Wikipedia has extensive text on articles it discusses – in some cases so much that a lot of language processing APIs won’t accept it. Alchemy API (now seemingly marketed as “IBM Watson”) has an endpoint to parse text from a website, but it only accepts 600KB pages (50K of output text). Consequently, it quickly becomes easier to just get the text yourself.
To do this, I recommend Apache Tika, which seems to include one of the better / best libraries for extracting text, and has every imaginable interface – Java, command line, REST, and a GUI(!).
You only need a Java jar for this-
curl http://apache.spinellicreations.com/tika/tika-app-1.13.jar > tika-app.jar
Tika has a complex set of options for detecting content types1, but it seems to respond to file extensions, and when I was testing this I found that it was more reliable when I specified these:
curl https://en.wikipedia.org/wiki/Barack_Obama > Barack_Obama.html
Invoking Tika is simple:
java -jar tika-app.jar -t data/$1.html > out/$1.txt
Problem is, Wikipedia has a ton of extra content wrapping the text. You could handle this in a few ways – pre-process the file to select out what you want, customize Tika to have it parse out the bits you want (probably a good option if you want to get just captions or headings), or hack at the output.
For my case I chose the last option. The following script will remove the table of contents, most captions, and the bogus header / footer information that shows up at the end of the file. Tune to your liking (I removed the references as well).
import fileinput
import re
start = re.compile("Jump to:.*navigation,.*search")
end = re.compile("^Notes and references$")
started = False
ended = False
blank = False
ignore = re.compile(
"^(Main article: .*|" +
"Main articles: .*|" +
"See also: .*|" +
"\s*[0-9]+.[0-9]+ .*" +
"|\s*[0-9]+.[0-9]+.[0-9]+ .*)\s*$")
footnote = re.compile("\[[0-9]+\]")
for line in fileinput.input():
if (re.match(end, line.strip())):
ended = True
if (started and not ended):
if (not blank or line.strip() != ""):
if (not re.match(ignore, line)):
if ("." in line or len(line) > 150 or len(line.strip()) == 0):
print(re.sub(footnote, "", line.strip()))
if (line.strip() == ""):
blank = True
else:
blank = False
if (not started):
if (re.search(start, line) != None):
started = True
pass
- https://tika.apache.org/1.1/detection.html#Content_Detection [↩]