Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything.
“Natural Language Processing with Python” (read my review) has an example of how to start this process, comparing verb frequencies across various genres of text using the Brown corpus, a well-known collection of texts assembled in the 60’s for language research.
I extended the example to include an additional corpus of court cases, and extra helper verbs- This includes the contents of ~15,000 legal documents.
We first define a function to retrieve genres of literature, and a second to retrieve words from the genre. For the legal documents, I am reading from an index I previously built of n-grams (i.e. word/phrase counts).
import nltk
import os
def get_genres():
yield 'legal'
for genre in brown.categories():
yield genre
modals = ['can', 'could', 'may', 'might', 'must', 'will', 'would', 'should']
def get_words(genre):
if (genre == 'legal'):
grams = open('1gram', 'rU')
for line in grams:
vals = line.split(' ')
word = vals[0]
count = int(vals[1])
if (word in modals):
for index in range(0, count):
yield word
else:
yield word
grams.close()
else:
for word in brown.words(categories=genre):
yield word
The Natural Language Toolkit provides a class for tracking frequencies of “experiment” results – here we track the use of different verb tenses.
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in get_genres()
for word in get_words(genre)
)
genres = [g for g in get_genres()]
cfd.tabulate(conditions=genres, samples=modals)
cfd.tabulate(conditions=genres, samples=modals)
The tabulate method is provided by NTLK, and makes a nicely formatted chart (in a command line it makes everything line up neatly)
can | could | may | might | must | will | would | should | |
legal | 13059 | 7849 | 26968 | 1762 | 15974 | 20757 | 19931 | 13916 |
adventure | 46 | 151 | 5 | 58 | 27 | 50 | 191 | 15 |
belles_lettres | 246 | 213 | 207 | 113 | 170 | 236 | 392 | 102 |
editorial | 121 | 56 | 74 | 39 | 53 | 233 | 180 | 88 |
fiction | 37 | 166 | 8 | 44 | 55 | 52 | 287 | 35 |
government | 117 | 38 | 153 | 13 | 102 | 244 | 120 | 112 |
hobbies | 268 | 58 | 131 | 22 | 83 | 264 | 78 | 73 |
humor | 16 | 30 | 8 | 8 | 9 | 13 | 56 | 7 |
learned | 365 | 159 | 324 | 128 | 202 | 340 | 319 | 171 |
lore | 170 | 141 | 165 | 49 | 96 | 175 | 186 | 76 |
mystery | 42 | 141 | 13 | 57 | 30 | 20 | 186 | 29 |
news | 93 | 86 | 66 | 38 | 50 | 389 | 244 | 59 |
religion | 82 | 59 | 78 | 12 | 54 | 71 | 68 | 45 |
reviews | 45 | 40 | 45 | 26 | 19 | 58 | 47 | 18 |
romance | 74 | 193 | 11 | 51 | 45 | 43 | 244 | 32 |
science_fiction | 16 | 49 | 4 | 12 | 8 | 16 | 79 | 3 |
Looking at these numbers, it is clear that we need to add a concept of normalization. My added corpus has a lot more tokens than the Brown corpus, which makes it hard to compare across.
The frequency distribution class exists to count things, and I didn’t see a good way to normalize the rows. I re-wrote the tabulate function to do this – it simply finds the max for each row, divides by that, and multiplies by 100.
def tabulate(cfd, conditions, samples):
max_len = max(len(w) for w in conditions)
sys.stdout.write(" " * (max_len + 1))
for c in samples:
sys.stdout.write("%-s\t" % c)
sys.stdout.write("\n")
for c in conditions:
sys.stdout.write(" " * (max_len - len(c)))
sys.stdout.write("%-s" % c)
sys.stdout.write(" ")
dist = cfd[c]
norm = sum([dist[w] for w in modals])
for s in samples:
value = 100 * dist[s] / norm
sys.stdout.write("%-d\t" % value)
sys.stdout.write("\n")
tabulate(cfd, genres, modals)
This makes it easier to scan up and down the chart-
can | could | may | might | must | will | would | should | |
legal | 10 | 6 | 22 | 1 | 13 | 17 | 16 | 11 |
adventure | 8 | 27 | 0 | 10 | 4 | 9 | 35 | 2 |
belles_lettres | 14 | 12 | 12 | 6 | 10 | 14 | 23 | 6 |
editorial | 14 | 6 | 8 | 4 | 6 | 27 | 21 | 10 |
fiction | 5 | 24 | 1 | 6 | 8 | 7 | 41 | 5 |
government | 13 | 4 | 17 | 1 | 11 | 27 | 13 | 12 |
hobbies | 27 | 5 | 13 | 2 | 8 | 27 | 7 | 7 |
humor | 10 | 20 | 5 | 5 | 6 | 8 | 38 | 4 |
learned | 18 | 7 | 16 | 6 | 10 | 16 | 15 | 8 |
lore | 16 | 13 | 15 | 4 | 9 | 16 | 17 | 7 |
mystery | 8 | 27 | 2 | 11 | 5 | 3 | 35 | 5 |
news | 9 | 8 | 6 | 3 | 4 | 37 | 23 | 5 |
religion | 17 | 12 | 16 | 2 | 11 | 15 | 14 | 9 |
reviews | 15 | 13 | 15 | 8 | 6 | 19 | 15 | 6 |
romance | 10 | 27 | 1 | 7 | 6 | 6 | 35 | 4 |
science_fiction | 8 | 26 | 2 | 6 | 4 | 8 | 42 | 1 |
One thing this makes clear is most genres have numerous references to ‘would’ and few have ‘should’.
It might be nice to see these on a scale of 1-10 – seeing the columns of numbers communicates something in the lengths.
def tabulate(cfd, conditions, samples):
max_len = max(len(w) for w in conditions)
sys.stdout.write(" " * (max_len + 1))
for c in samples:
sys.stdout.write("%-s\t" % c)
sys.stdout.write("\n")
for c in conditions:
sys.stdout.write(" " * (max_len - len(c)))
sys.stdout.write("%-s" % c)
sys.stdout.write(" ")
dist = cfd[c]
norm = sum([dist[w] for w in modals])
for s in samples:
value = 10 * float(dist[s]) / norm
sys.stdout.write("%.1f\t" % value)
sys.stdout.write("\n")
tabulate(cfd, genres, modals)
can | could | may | might | must | will | would | should | |
legal | 1.1 | 0.7 | 2.2 | 0.1 | 1.3 | 1.7 | 1.7 | 1.2 |
adventure | 0.8 | 2.8 | 0.1 | 1.1 | 0.5 | 0.9 | 3.5 | 0.3 |
belles_lettres | 1.5 | 1.3 | 1.2 | 0.7 | 1.0 | 1.4 | 2.3 | 0.6 |
editorial | 1.4 | 0.7 | 0.9 | 0.5 | 0.6 | 2.8 | 2.1 | 1.0 |
fiction | 0.5 | 2.4 | 0.1 | 0.6 | 0.8 | 0.8 | 4.2 | 0.5 |
government | 1.3 | 0.4 | 1.7 | 0.1 | 1.1 | 2.7 | 1.3 | 1.2 |
hobbies | 2.7 | 0.6 | 1.3 | 0.2 | 0.8 | 2.7 | 0.8 | 0.7 |
humor | 1.1 | 2.0 | 0.5 | 0.5 | 0.6 | 0.9 | 3.8 | 0.5 |
learned | 1.8 | 0.8 | 1.6 | 0.6 | 1.0 | 1.7 | 1.6 | 0.9 |
lore | 1.6 | 1.3 | 1.6 | 0.5 | 0.9 | 1.7 | 1.8 | 0.7 |
mystery | 0.8 | 2.7 | 0.3 | 1.1 | 0.6 | 0.4 | 3.6 | 0.6 |
news | 0.9 | 0.8 | 0.6 | 0.4 | 0.5 | 3.8 | 2.4 | 0.6 |
religion | 1.7 | 1.3 | 1.7 | 0.3 | 1.2 | 1.5 | 1.4 | 1.0 |
reviews | 1.5 | 1.3 | 1.5 | 0.9 | 0.6 | 1.9 | 1.6 | 0.6 |
romance | 1.1 | 2.8 | 0.2 | 0.7 | 0.6 | 0.6 | 3.5 | 0.5 |
science_fiction | 0.9 | 2.6 | 0.2 | 0.6 | 0.4 | 0.9 | 4.2 | 0.2 |
It would be nice to see how similar these genres are – we can compute that by imagining the counts of modals as describing vectors. The angle between vectors approximates “similarity”. The nice thing about this measure is that it removes other words (words which may only exist in one text – some of this will be due to how well the data is cleaned, which does not reflect on the genre of literature).
import math
def distance(cfd, conditions, samples, base):
base_cond = cfd[base]
base_vector = [base_cond[w] for w in samples]
base_length = math.sqrt(sum(a * a for a in base_vector))
for c in conditions:
cond = cfd[c]
cond_vector = [cond[w] for w in samples]
dotp = sum(a * b for (a,b) in zip(base_vector, cond_vector))
cond_length = math.sqrt(sum(a * a for a in cond_vector))
angle = math.acos(dotp / (cond_length * base_length))
percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
print "%-s similarity to %-s: %-.1f" % (c, base, percent)
The result are interesting – the genres showing closes to legal in this case are government and religion.
As an interesting side-note, belles_lettres means “fine writing”, i.e. poems, drama, fiction.
legal similarity to legal: 100.0 adventure similarity to legal: 41.6 belles_lettres similarity to legal: 72.4 editorial similarity to legal: 68.8 fiction similarity to legal: 42.9 government similarity to legal: 80.6 hobbies similarity to legal: 63.5 humor similarity to legal: 50.1 learned similarity to legal: 80.6 lore similarity to legal: 78.6 mystery similarity to legal: 41.3 news similarity to legal: 58.1 religion similarity to legal: 81.2 reviews similarity to legal: 73.5 romance similarity to legal: 42.9 science_fiction similarity to legal: 41.8
Some genres appear similar to legal documents – it is possible, however, that some verbs are not independent. For instance, you might see “may” and “might” with equal similarity. One way to test this might be to flip what we track for distance (make a vector for each modal, rather than genre)
The following code tracks the distance between each modal and the mean, using the different genres as dimensions. Since each of them contributes to the mean somewhat, there is guaranteed to be some similarity, but note that some are closer than others. Note also that these have to be normalized, like the last example, or the answer will be defined by the ‘legal’ genre.
def distance(cfd, conditions, samples):
base_vector = [0.0 for w in conditions]
norm = {}
for c_i in range(0, len(conditions)):
cond_name = conditions[c_i]
cond = cfd[cond_name]
norm[cond_name] = float(sum(cond[s] for s in samples))
for s in samples:
base_vector[c_i] = base_vector[c_i] + float(cond[s]) / norm[cond_name]
base_length = math.sqrt(sum(a * a for a in base_vector))
for s in samples: # compute each vector - which, might, etc
sample_vector = []
for c in conditions: # find condition for each vector
sample_vector.append(cfd[c][s] / norm[c])
dotp = sum(a * b for (a,b) in zip(base_vector, sample_vector))
sample_length = math.sqrt(sum(a * a for a in sample_vector))
angle = math.acos(dotp / (sample_length * base_length))
percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
print "%-s similarity to mean: %-.1f" % (s, percent)
distance(cfd, genres, modals)
What I’d infer from this is that the least helpful verb for distinguishing genres is “must,” and the most helpful is “may.”
can similarity to mean: 76.0 could similarity to mean: 67.6 may similarity to mean: 61.5 might similarity to mean: 70.0 must similarity to mean: 79.7 will similarity to mean: 67.7 would similarity to mean: 73.6 should similarity to mean: 74.2