NLP Analysis in Python using Modal Verbs

Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything.

Natural Language Processing with Python” (read my review) has an example of how to start this process, comparing verb frequencies across various genres of text using the Brown corpus, a well-known collection of texts assembled in the 60’s for language research.

I extended the example to include an additional corpus of court cases, and extra helper verbs- This includes the contents of ~15,000 legal documents.

We first define a function to retrieve genres of literature, and a second to retrieve words from the genre. For the legal documents, I am reading from an index I previously built of n-grams (i.e. word/phrase counts).

import nltk
import os

def get_genres():
yield 'legal'
for genre in brown.categories():
yield genre

modals = ['can', 'could', 'may', 'might', 'must', 'will', 'would', 'should']

def get_words(genre):
  if (genre == 'legal'):
    grams = open('1gram', 'rU')
    for line in grams:
      vals = line.split(' ')
      word = vals[0]
      count = int(vals[1])
      if (word in modals):
        for index in range(0, count):
          yield word
        else:
          yield word
    grams.close()
  else:
    for word in brown.words(categories=genre):
      yield word

The Natural Language Toolkit provides a class for tracking frequencies of “experiment” results – here we track the use of different verb tenses.

cfd = nltk.ConditionalFreqDist(
  (genre, word)
  for genre in get_genres()
  for word in get_words(genre)
)

genres = [g for g in get_genres()]
cfd.tabulate(conditions=genres, samples=modals)

cfd.tabulate(conditions=genres, samples=modals)

The tabulate method is provided by NTLK, and makes a nicely formatted chart (in a command line it makes everything line up neatly)

can could may might must will would should
legal 13059 7849 26968 1762 15974 20757 19931 13916
adventure 46 151 5 58 27 50 191 15
belles_lettres 246 213 207 113 170 236 392 102
editorial 121 56 74 39 53 233 180 88
fiction 37 166 8 44 55 52 287 35
government 117 38 153 13 102 244 120 112
hobbies 268 58 131 22 83 264 78 73
humor 16 30 8 8 9 13 56 7
learned 365 159 324 128 202 340 319 171
lore 170 141 165 49 96 175 186 76
mystery 42 141 13 57 30 20 186 29
news 93 86 66 38 50 389 244 59
religion 82 59 78 12 54 71 68 45
reviews 45 40 45 26 19 58 47 18
romance 74 193 11 51 45 43 244 32
science_fiction 16 49 4 12 8 16 79 3

Looking at these numbers, it is clear that we need to add a concept of normalization. My added corpus has a lot more tokens than the Brown corpus, which makes it hard to compare across.

The frequency distribution class exists to count things, and I didn’t see a good way to normalize the rows. I re-wrote the tabulate function to do this – it simply finds the max for each row, divides by that, and multiplies by 100.

def tabulate(cfd, conditions, samples):
  max_len = max(len(w) for w in conditions)
  sys.stdout.write(" " * (max_len + 1))
  for c in samples:
    sys.stdout.write("%-s\t" % c)
    sys.stdout.write("\n")
  for c in conditions:
    sys.stdout.write(" " * (max_len - len(c)))
    sys.stdout.write("%-s" % c)
    sys.stdout.write(" ")
    dist = cfd[c]
    norm = sum([dist[w] for w in modals])
    for s in samples:
      value = 100 * dist[s] / norm
      sys.stdout.write("%-d\t" % value)
      sys.stdout.write("\n")

tabulate(cfd, genres, modals)

This makes it easier to scan up and down the chart-

can could may might must will would should
legal 10 6 22 1 13 17 16 11
adventure 8 27 0 10 4 9 35 2
belles_lettres 14 12 12 6 10 14 23 6
editorial 14 6 8 4 6 27 21 10
fiction 5 24 1 6 8 7 41 5
government 13 4 17 1 11 27 13 12
hobbies 27 5 13 2 8 27 7 7
humor 10 20 5 5 6 8 38 4
learned 18 7 16 6 10 16 15 8
lore 16 13 15 4 9 16 17 7
mystery 8 27 2 11 5 3 35 5
news 9 8 6 3 4 37 23 5
religion 17 12 16 2 11 15 14 9
reviews 15 13 15 8 6 19 15 6
romance 10 27 1 7 6 6 35 4
science_fiction 8 26 2 6 4 8 42 1

One thing this makes clear is most genres have numerous references to ‘would’ and few have ‘should’.

It might be nice to see these on a scale of 1-10 – seeing the columns of numbers communicates something in the lengths.

def tabulate(cfd, conditions, samples):
  max_len = max(len(w) for w in conditions)
  sys.stdout.write(" " * (max_len + 1))
  for c in samples:
    sys.stdout.write("%-s\t" % c)
    sys.stdout.write("\n")
    for c in conditions:
      sys.stdout.write(" " * (max_len - len(c)))
      sys.stdout.write("%-s" % c)
      sys.stdout.write(" ")
      dist = cfd[c]
      norm = sum([dist[w] for w in modals])
  for s in samples:
    value = 10 * float(dist[s]) / norm
    sys.stdout.write("%.1f\t" % value)
    sys.stdout.write("\n")

tabulate(cfd, genres, modals)
can could may might must will would should
legal 1.1 0.7 2.2 0.1 1.3 1.7 1.7 1.2
adventure 0.8 2.8 0.1 1.1 0.5 0.9 3.5 0.3
belles_lettres 1.5 1.3 1.2 0.7 1.0 1.4 2.3 0.6
editorial 1.4 0.7 0.9 0.5 0.6 2.8 2.1 1.0
fiction 0.5 2.4 0.1 0.6 0.8 0.8 4.2 0.5
government 1.3 0.4 1.7 0.1 1.1 2.7 1.3 1.2
hobbies 2.7 0.6 1.3 0.2 0.8 2.7 0.8 0.7
humor 1.1 2.0 0.5 0.5 0.6 0.9 3.8 0.5
learned 1.8 0.8 1.6 0.6 1.0 1.7 1.6 0.9
lore 1.6 1.3 1.6 0.5 0.9 1.7 1.8 0.7
mystery 0.8 2.7 0.3 1.1 0.6 0.4 3.6 0.6
news 0.9 0.8 0.6 0.4 0.5 3.8 2.4 0.6
religion 1.7 1.3 1.7 0.3 1.2 1.5 1.4 1.0
reviews 1.5 1.3 1.5 0.9 0.6 1.9 1.6 0.6
romance 1.1 2.8 0.2 0.7 0.6 0.6 3.5 0.5
science_fiction 0.9 2.6 0.2 0.6 0.4 0.9 4.2 0.2

It would be nice to see how similar these genres are – we can compute that by imagining the counts of modals as describing vectors. The angle between vectors approximates “similarity”. The nice thing about this measure is that it removes other words (words which may only exist in one text – some of this will be due to how well the data is cleaned, which does not reflect on the genre of literature).

import math

def distance(cfd, conditions, samples, base):
  base_cond = cfd[base]
  base_vector = [base_cond[w] for w in samples]
  base_length = math.sqrt(sum(a * a for a in base_vector))
  for c in conditions:
    cond = cfd[c]
    cond_vector = [cond[w] for w in samples]
    dotp = sum(a * b for (a,b) in zip(base_vector, cond_vector))
    cond_length = math.sqrt(sum(a * a for a in cond_vector))
    angle = math.acos(dotp / (cond_length * base_length))
    percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
    print "%-s similarity to %-s: %-.1f" % (c, base, percent)

The result are interesting – the genres showing closes to legal in this case are government and religion.

As an interesting side-note, belles_lettres means “fine writing”, i.e. poems, drama, fiction.

legal similarity to legal: 100.0
adventure similarity to legal: 41.6
belles_lettres similarity to legal: 72.4
editorial similarity to legal: 68.8
fiction similarity to legal: 42.9
government similarity to legal: 80.6
hobbies similarity to legal: 63.5
humor similarity to legal: 50.1
learned similarity to legal: 80.6
lore similarity to legal: 78.6
mystery similarity to legal: 41.3
news similarity to legal: 58.1
religion similarity to legal: 81.2
reviews similarity to legal: 73.5
romance similarity to legal: 42.9
science_fiction similarity to legal: 41.8

Some genres appear similar to legal documents – it is possible, however, that some verbs are not independent. For instance, you might see “may” and “might” with equal similarity. One way to test this might be to flip what we track for distance (make a vector for each modal, rather than genre)

The following code tracks the distance between each modal and the mean, using the different genres as dimensions. Since each of them contributes to the mean somewhat, there is guaranteed to be some similarity, but note that some are closer than others. Note also that these have to be normalized, like the last example, or the answer will be defined by the ‘legal’ genre.

def distance(cfd, conditions, samples):
  base_vector = [0.0 for w in conditions]
  norm = {}
  for c_i in range(0, len(conditions)):
    cond_name = conditions[c_i]
    cond = cfd[cond_name]
    norm[cond_name] = float(sum(cond[s] for s in samples))
    for s in samples:
      base_vector[c_i] = base_vector[c_i] + float(cond[s]) / norm[cond_name]
      base_length = math.sqrt(sum(a * a for a in base_vector))
  for s in samples: # compute each vector - which, might, etc
    sample_vector = []
    for c in conditions: # find condition for each vector
      sample_vector.append(cfd[c][s] / norm[c])
      dotp = sum(a * b for (a,b) in zip(base_vector, sample_vector))
      sample_length = math.sqrt(sum(a * a for a in sample_vector))
      angle = math.acos(dotp / (sample_length * base_length))
      percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
      print "%-s similarity to mean: %-.1f" % (s, percent)

distance(cfd, genres, modals)

What I’d infer from this is that the least helpful verb for distinguishing genres is “must,” and the most helpful is “may.”

can similarity to mean: 76.0
could similarity to mean: 67.6
may similarity to mean: 61.5
might similarity to mean: 70.0
must similarity to mean: 79.7
will similarity to mean: 67.7
would similarity to mean: 73.6
should similarity to mean: 74.2