Extracting the text from a Wikipedia article

Wikipedia has extensive text on articles it discusses – in some cases so much that a lot of language processing APIs won’t accept it. Alchemy API (now seemingly marketed as “IBM Watson”) has an endpoint to parse text from a website, but it only accepts 600KB pages (50K of output text). Consequently, it quickly becomes easier to just get the text yourself.

To do this, I recommend Apache Tika, which seems to include one of the better / best libraries for extracting text, and has every imaginable interface – Java, command line, REST, and a GUI(!).

You only need a Java jar for this-

curl http://apache.spinellicreations.com/tika/tika-app-1.13.jar > tika-app.jar

Tika has a complex set of options for detecting content types1, but it seems to respond to file extensions, and when I was testing this I found that it was more reliable when I specified these:

curl https://en.wikipedia.org/wiki/Barack_Obama > Barack_Obama.html

Invoking Tika is simple:

java -jar tika-app.jar -t data/$1.html > out/$1.txt

Problem is, Wikipedia has a ton of extra content wrapping the text. You could handle this in a few ways – pre-process the file to select out what you want, customize Tika to have it parse out the bits you want (probably a good option if you want to get just captions or headings), or hack at the output.

For my case I chose the last option. The following script will remove the table of contents, most captions, and the bogus header / footer information that shows up at the end of the file. Tune to your liking (I removed the references as well).

import fileinput
import re

start = re.compile("Jump to:.*navigation,.*search")
 
end = re.compile("^Notes and references$")

started = False
ended = False
blank = False

ignore = re.compile(
  "^(Main article: .*|" +
  "Main articles: .*|" + 
  "See also: .*|" + 
  "\s*[0-9]+.[0-9]+ .*" + 
  "|\s*[0-9]+.[0-9]+.[0-9]+ .*)\s*$")

footnote = re.compile("\[[0-9]+\]")

for line in fileinput.input():

  if (re.match(end, line.strip())):
    ended = True

  if (started and not ended):
    if (not blank or line.strip() != ""):
      if (not re.match(ignore, line)):
        if ("." in line or len(line) > 150 or len(line.strip()) == 0):
          print(re.sub(footnote, "", line.strip()))

          if (line.strip() == ""):
            blank = True
          else:
            blank = False

  if (not started):
    if (re.search(start, line) != None):
      started = True

  pass
  1. https://tika.apache.org/1.1/detection.html#Content_Detection []