Wordnet is a database containing hierarchies of certain types of relationships – “a tree is part of a forest”, “a car is a type of motor vehicle”, “an engine is part of a car” (meronyms, holonyms). “Natural Language Processing with Python” (read my review) suggests that you might discover these relationships in a corpus by searching for strings like “is a” and filtering it down – thus discovering things that could be manually added to Wordnet. Presumably this is how the database was first constructed.
Searching for strings this way seems like a simple regex search would suffice, but in practice just searching for strings generates a lot of noise. Rather than search the original text files, which would generate a lot of duplicates, I’m using an n-gram index I generated. This includes counts of the frequencies of phrases in 15,000 court cases, which means garbage tokens have also been filtered from the text. You see a lot of strings like “defendent is a flight risk”, which is interesting if you want to report on the case, but not for listing relationships.
NLTK has several text corpora included with the library, so I joined my text data to two other datasets. The first is a stopwords list – this removes a lot of garbage entries, which probably make sense in context, but not here.
import nltk
from nltk import memoize
@memoize
def get_stopwords():
return set(nltk.corpus.stopwords.words())
def has_stopword(test_word):
return test_word in get_stopwords()
I also use Wordnet to check known uses of a word, to see if it ever can be a noun (a word like “mint” could be a noun or verb, for instance, whereas “within” gets removed). A surprising number of words are nouns (“have”, as in “haves” and “have nots”). I also remove hapaxes (words that occur once) – this removes some people’s names and bogus misspellings.
@memoize
def get_all_words():
return set(nltk.corpus.words.words())
def is_word(test_word):
return test_word in get_all_words()
def can_be_noun(test_word):
synsets = nltk.corpus.wordnet.synsets(test_word)
if len(synsets) == 0:
return True
for s in synsets:
if s.pos == 'n':
return True
return False
Note the use of sets above- this is to make lookups faster. The memoize annotation comes from NLTK.
Now that we’ve defined filter functions, we can look through the file. Each entry in the file is of the form “phrase count”, e.g. “law 321” or “a black cat 7” depending which n-gram file you’re looking at. The order of which tests are run against a phrase matters a bit, since Wordnet lookups take longer than the rest.
def get_words():
grams = open('4gram', 'rU')
for line in grams:
vals = line.split(' ')
ngram = ' '.join(vals[0:-1])
count = int(vals[-1])
if count > 3:
relationships = ["is a", "forms a", "contains a"]
for rln in relationships:
segments = ngram.split(' ' + rln + ' ')
if len(segments) == 2:
begin_segment = segments[0]
end_segment = segments[1]
if is_word(begin_segment) and is_word(end_segment):
if (not has_stopword(begin_segment) and not has_stopword(end_segment)):
if can_be_noun(begin_segment) and can_be_noun(end_segment):
yield ngram
grams.close()
[w for w in get_words()]
This still generates a whole lot of garbage, but now a manageable amount.
If you filter this down manually, you get some real relationships – not all of these may be intended, and may actually represent cropped terms. A better approach to this problem would be to first tag the texts with parts of speech, then use those to determine where noun phrases begin and end. Then, one could filter to certain arrangements of parts of speech (noun phrase / verb / noun phrase) so that the noun phrases would be kept whole.
This technique can also be used to find opposites (antymony) or entailment (a verb that implies another verb, either because the action contains another action, or because they are synonyms).