“Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time – I have a corpus of legal texts, and build a set of n-gram indices from it.
The first index is a list of just tokenized text, with all text contents combined. You may wish to add a special token that indicates where files end.
import nltk;
import os;
import re;
all_tokens = []
idx = 0
for root, directories, filenames in os.walk('.'):
for file in filenames:
if file.endswith('.txt'):
idx = idx + 1
contents = open(file)
raw = contents.read()
lc_raw = raw.lower()
new_tokens = nltk.word_tokenize(lc_raw)
new_tokens_filtered = [w for w in new_tokens
if len(w) < 20 and
(re.search('^[a-zA-Z]+$', w) or len(w) == 1)]
all_tokens = all_tokens + new_tokens_filtered
for token in all_tokens:
print token
This filters the tokens - depending on the situation you may wish to modify this (e.g. whether you want punctuation or not).
The following turns the token index into an n-gram index. This runs pretty quickly while there is memory - this could easily be extended to work on parts of a document set, and combine with the results with a map-reduce operation by addition, especially if the output was sorted. One surprising result is that the 5-gram index can be quite a bit larger than the original data.
import nltk;
import os;
contents = open('all_tokens5', 'rU')
n1_gram_word = ' '
n2_gram_word = ' '
n3_gram_word = ' '
n4_gram_word = ' '
n5_gram_word = ' '
n1_counts = {}
n2_counts = {}
n3_counts = {}
n4_counts = {}
n5_counts = {}
index = 0
def incr(c, w):
try:
c[w] = c[w] + 1
except:
c[w] = 1
for word in contents:
index = index + 1
if (index % 10000 == 0):
print "Processed %-d words" % (index)
# defects: loses last character, loses EOF, adds bigrams/trigrams at starts
(n5_gram_word, n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word) = \
(n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word, word[:-1])
n1_gram = n1_gram_word
n2_gram = n2_gram_word + ' ' + n1_gram
n3_gram = n3_gram_word + ' ' + n2_gram
n4_gram = n4_gram_word + ' ' + n3_gram
n5_gram = n5_gram_word + ' ' + n4_gram
incr(n1_counts, n1_gram)
incr(n2_counts, n2_gram)
incr(n3_counts, n3_gram)
incr(n4_counts, n4_gram)
incr(n5_counts, n5_gram)
contents.close()
def save_ngram(c, f):
output = open(f, 'w')
ordered = sorted(c.items(), lambda a, b: cmp(c[a[0]], c[b[0]]))
for a, b in ordered:
output.write('%-s %-d\n' % (a, b))
output.close()
save_ngram(n1_counts, '1gram')
save_ngram(n2_counts, '2gram')
save_ngram(n3_counts, '3gram')
save_ngram(n4_counts, '4gram')
save_ngram(n5_counts, '5gram')
Isn’t there a ngram building tool in NLTK?
Yes, there is, and I used that version first. I wrote my own because I was running into memory issues and wanted to control how it worked, and it’s pretty simple to write anyway.
You are aware that the semicolons at the end of your import statements are ugly and useless?
Habit 🙂
You might want to have a look at this:
http://docs.python.org/2/library/collections.html#collections.Counter
Thanks!