Zipf’s Law states that the frequency of a word in a corpus of text is proportional to it’s rank – first noticed in the 1930’s. Unlike a “law” in the sense of mathematics or physics, this is purely on observation, without strong explanation that I can find of the causes.
We can explore this concept fairly simply on a bit of text using NLTK, which provides handy APIs for accessing and processing text. There is a good textbook on the subject, “Natural Language Processing with Python” (read my review), with lots of motivating examples like this.
import nltk
from nltk.corpus import reuters
from nltk.corpus import wordnet
reuters_words = [w.lower() for w in reuters.words()]
words = set(reuters_words)
counts = [(w, reuters_words.count(w)) for w in words]
The first step of counting word frequencies takes quite a while (an hour or two). Here are the top tokens (obviously not all words)
>>> [(w, c) for (w, c) in counts if c > 5000]
[('.', 94687), ('s', 15680), ('with', 6179), ("'", 11272),
('>', 7449), ('year', 7529), ('000', 10277), ('loss', 5124),
('u', 6392), ('pct', 9810), ('"', 6816), ('from', 8217),
('for', 13782), ('2', 6528), ('at', 7017), ('be', 6357),
('the', 69277), (';', 8762), ('he', 5215), ('net', 6989),
('is', 7668), ('it', 11104), ('in', 29253), ('billion', 5829),
('lt', 8696), ('-', 13705), ('of', 36779), ('&', 8698),
('to', 36400), ('vs', 14341), ('was', 5816), ('1', 9977),
('and', 25648), ('dlrs', 12417), ('by', 7101), ('its', 7402),
('mln', 18623), ('cts', 8361), ('on', 9244), ('that', 7540),
('3', 5091), ('a', 25103), (',', 72360), ('said', 25383),
('will', 5952)]
Next, I joined this to Wordnet data – while exploring Zipf’s law, I want to test the hypthesis that common words are more likely to be irregular. We also generate the rankings of each word in the set, but frequencies, and by number of “synsets” in Wordnet (synsets are sets meanings)
import scipy.stats as ss
amb = [(w, c, len(wordnet.synsets(w))) /
for (w, c) in counts if len(wordnet.synsets(w)) > 0]
amb_p_rank = ss.rankdata([p for (w, c, p) in amb])
amb_c_rank = ss.rankdata([c for (w, c, p) in amb])
amb_ranked = zip(amb, amb_p_rank, amb_c_rank)
amb_ranked[100:110]
Out[37]:
[('regulator', 2, 3, 8945.0, 5500.0),
('friend', 2, 5, 12344.0, 5500.0),
('feeling', 19, 19, 16810.0, 12979.0),
('sustaining', 7, 7, 14215.0, 10282.5),
('spectrum', 8, 2, 6142.0, 10684.5),
('consenting', 1, 2, 6142.0, 2218.5),
('resignations', 3, 3, 8945.0, 7215.5),
('dozen', 11, 2, 6142.0, 11554.0),
('affairs', 75, 5, 12344.0, 15397.5),
('mostly', 57, 2, 6142.0, 15038.0)]
The last line combines the data values together into one big list of tuples, for convenience. To do a quick test of Zipf’s law, we can test that the rank correlates to the log of the count:
numpy.corrcoef(amb_c_rank, [math.log(c) for (w, c, p) in amb])
Out[106]:
array([[ 1. , 0.95322962],
[ 0.95322962, 1. ]])
We can also demonstrate this a different way. First, we sort the records by occurence:
amb_ranked_sorted = sorted(amb_ranked, key=lambda (w, c, p, cr, pr): c)
And then we take some samples. Note how when we move by approximately 3x rank, the frequency goes down by about 1/3 (this will vary based on how the rank is counted, especially at the long tail where there are many matching counts in a row)
amb_ranked_sorted[-50]
Out[121]: ('profit', 2960, 4, 10907.0, 17128.0)
amb_ranked_sorted[-150]
Out[122]: ('major', 1000, 13, 16262.0, 17028.0)
amb_ranked_sorted[-450]
Out[123]: ('goods', 408, 4, 10907.0, 16728.0)
amb_ranked_sorted[-1350]
Out[124]: ('hope', 113, 9, 15215.0, 15827.5)
Finally, we can plot this:
import matplotlib
rev = [l-r+1 for r in amb_c_rank]
plt.plot([math.log(c) for c in rev], [math.log(c) for (w, c, p) in amb], 'ro')
This isn’t exactly straight, especially at the end, likely due to how ranks are computed, but it’s close. Now we can plot the number of “meanings” in wordnet vs use in the reuters corpus:
plt.plot([c for (w,c,p) in amb], [p for (w,c,p) in amb], 'bs')
Out[150]: []
plt.show()
The problem is that there is no notion of context, for instance the common word “at” is only referenced by the chemical element “astatine”:
wordnet.synsets('at')
Out[149]: [Synset('astatine.n.01'), Synset('at.n.02')]
Thus, a better technique is to look at verbs only.
def wordnet_verbs(w):
synsets = wordnet.synsets(w)
verbs = [w for w in synsets if w.pos == 'v']
return verbs
amb_v = [(w, c, len(wordnet_verbs(w)), len(wordnet.synsets(w))) \
for (w, c) in counts if len(wordnet_verbs(w)) > 0]
amb_v[100:110]
Out[167]:
[('shoots', 1, 20, 22),
('suffice', 1, 1, 1),
('acquainting', 1, 3, 3),
('perfumes', 1, 2, 4),
('safeguard', 14, 2, 4),
('arrays', 4, 2, 6),
('crowns', 143, 4, 16),
('roll', 21, 18, 33),
('intend', 63, 4, 4),
('palms', 2, 1, 5)]
plt.plot([c for (w,c,vc,wc) in amb_v], [vc for (w,c,vc,wc) in amb_v], 'bs',\
[c for (w,c,vc, wc) in amb_v], [wc for (w,c,vc,wc) in amb_v], 'ro')
And, we can plot these (y axis is # of alternate meanings, x axis is frequency, blue is counts of verb meanings, and red is counts of word meanings).
There does seem to be a pattern here, but not what I expect – I would expect some straight-line upwards, with alternate meanings increasing with usage, which is actually the opposite. This likely means that the wordnet thesaurus counts are a poor indicator of irregularity. Rather, words which have many meanings are discouraged in news writing, as they encourage ambiguity.
For a final note, lets look at what are these common words are:
[(w,c,vc,wc) for (w,c,vc,wc) in amb_v if vc > 40]
Out[179]:
[('cut', 905, 41, 70),
('makes', 169, 49, 51),
('runs', 38, 41, 57),
('making', 369, 49, 52),
('gave', 208, 44, 44),
('breaks', 14, 59, 75),
('take', 745, 42, 44),
('broke', 40, 59, 60),
('run', 125, 41, 57),
('give', 303, 44, 45),
('cuts', 319, 41, 61),
('broken', 36, 59, 72),
('giving', 99, 44, 48),
('took', 140, 42, 42),
('breaking', 19, 59, 60),
('takes', 91, 42, 44),
('taken', 244, 42, 44),
('make', 592, 49, 51),
('taking', 197, 42, 44),
('given', 370, 44, 47),
('gives', 47, 44, 45),
('made', 987, 49, 52),
('break', 76, 59, 75),
('giveing', 1, 44, 44),
('cutting', 122, 41, 54),
('running', 73, 41, 52),
('ran', 29, 41, 41)]
“Cut” has a staggering 70 entrieds in wordnet – many, but not all verbs. “Has”, by contrast, has merely twenty.
Well, the first step (obtaining word frequencies) takes so ridiculously long since you are calling count() during each iteration step. That’s not exactly a good method to obtain word frequencies. Here is how I would’ve approached the problem:
# ———————————————–
import collections
import nltk.corpus
reuters_corpus = nltk.corpus.reuters
word_frequencies = collections.Counter(w.lower() for w in reuters_corpus.words())
# ———————————————–
That’s pretty much it. On my trust old Celeron M 430 this yields the same results, except it takes 8 seconds to finish.
Ah, thanks!
Interesting post Gary.
I did something similar a while ago to look at Use Case readiness. My initial results showed a correlation between word count and the number of open items in individual use cases.
Do you have similar data? My sample size was too small to be statistically significant…
Says this. Please explain.
rev = [l-r+1 for r in amb_c_rank]
NameError: name ‘l’ is not defined
P.S. : I dont have any website. So i wrote google there.