One of the difficulties inherent in machine learning techniques is that the most accurate algorithms refuse to tell a story: we can discuss the confusion matrix, testing and training data, accuracy and the like, but it’s often hard to explain in simple terms what’s really going on.
Practically speaking this isn’t a big issue from an engineering perspective, but in a general political sense it is- highly accurate machine are often considered creepy, especially when it’s not apparent how it figured something out.
A simple case of this is part of speech tagging – you can read a book on how it works, and see the output, but it’s really hard to figure out whether something is “good” and develop an intuition for the personality of the algorithms. To that end, I’ve experimented with comparing the output of two taggers on common pieces of text, below.
The first tagger is the POS tagger included in NLTK (Python). This is presented in some detail in “Natural Language Processing with Python” (read my review), which has lots of motivating examples for natural language processing around NLTK, a natural language processing library maintained by the authors. The second toolkit is the Stanford NLP tagger (Java). Conveniently, these each use a simlar set of text.
For the first example, we’ll take a simple sentence and compare the output of the two products. In this case, you can see the formatting is quite different, but the tags are the same.
nltk.pos_tag(nltk.word_tokenize(
"""This Court has jurisdiction to
consider the merits of the case.""""))
[('This', 'DT'),
('Court', 'NNP'),
('has', 'VBZ'),
('jurisdiction', 'NN'),
('to', 'TO'),
('consider', 'VB'),
('the', 'DT'),
('merits', 'NNS'),
('of', 'IN'),
('the', 'DT'),
('case', 'NN'),
('.', '.')]
This_DT Court_NNP has_VBZ jurisdiction_NN
to_TO consider_VB the_DT merits_NNS of_IN
the_DT case_NN ._.
For reference, there are quite a few possible tags in a POS tagger, far more than what you learn in high school English class – this helps later processes form more accurate results. Here are examples, from the Penn TreeBank documentation-
CC - Coordinating conjunction CD - Cardinal number DT - Determiner EX - Existential there FW - Foreign word IN - Preposition or subordinating conjunction JJ - Adjective JJR - Adjective, comparative JJS - Adjective, superlative LS - List item marker MD - Modal NN - Noun, singular or mass NNS - Noun, plural NNP - Proper noun, singular NNPS - Proper noun, plural PDT - Predeterminer POS - Possessive ending PRP - Personal pronoun PRP$ - Possessive pronoun (prolog version PRP-S) RB - Adverb RBR - Adverb, comparative RBS - Adverb, superlative RP - Particle SYM - Symbol TO - to UH - Interjection VB - Verb, base form VBD - Verb, past tense VBG - Verb, gerund or present participle VBN - Verb, past participle VBP - Verb, non-3rd person singular present VBZ - Verb, 3rd person singular present WDT - Wh-determiner WP - Wh-pronoun WP$ - Possessive wh-pronoun (prolog version WP-S) WRB - Wh-adverb
The following are some more involved examples, rendered side by side. I’ve edited the output to facilitate comparison. NLTK is on the left; Stanford NLP on the right.
For the first example, I’ve chosen a sonnet, which is surprisingly similar between the two tools.
nltk.pos_tag(nltk.word_tokenize("""
Can my love excuse the slow offence,
Of my dull bearer, when from thee I speed,
From where thou art, why should I haste me thence?
Till I return of posting is no need.
"""))
Can: NNP | Can: MD |
my: PRP$ | my: PRP$ |
love: NN | love: NN |
excuse: NN | excuse: NN |
the: DT | the: DT |
slow: JJ | slow: JJ |
offence: NN | offence: NN |
,: , | ,: , |
Of: IN | Of: IN |
my: PRP$ | my: PRP$ |
dull: NN | dull: JJ |
bearer: NN | bearer: NN |
,: , | ,: , |
when: WRB | when: WRB |
from: IN | from: IN |
thee: NN | thee: NN |
I: PRP | I: PRP |
speed: VBP | speed: VBP |
,: , | ,: , |
From: NNP | From: IN |
where: WRB | where: WRB |
thou: PRP | thou: JJ |
art: VBP | art: NN |
,: , | ,: , |
why: WRB | why: WRB |
should: MD | should: MD |
I: PRP | I: PRP |
haste: VB | haste: NN |
me: PRP | me: PRP |
thence: NN | thence: VB |
?: . | ?: . |
Till: NNP | Till: IN |
I: PRP | I: PRP |
return: VBP | return: VBP |
of: IN | of: IN |
posting: VBG | posting: VBG |
is: VBZ | is: VBZ |
no: DT | no: DT |
need: NN | need: NN |
.: . | .: . |
For a second example, I’ve chosen a very wordy example from a recent Supreme Court case. The reason this type of text is interesting is that it is a common type of thing one might want to analyze, and it has entity names in it. For such a short sentence, there is very little deviation – it makes me wonder if these aren’t two versions of the same code/model.
nltk.pos_tag(nltk.word_tokenize("""Roy Koontz, Sr., whose estate
is represented here by petitioner, sought permits to develop a
section of his property from respondent St. Johns River Water
Management District (District), which, consistent with Florida
law, requires permit applicants wishing to build on wetlands to
offset the resulting environmental damage."""))
Roy: NNP | Roy: NNP |
Koontz: NNP | Koontz: NNP |
,: , | ,: , |
Sr.: NNP | Sr.: NNP |
,: , | ,: , |
whose: WP$ | whose: WP$ |
estate: NN | estate: NN |
is: VBZ | is: VBZ |
represented: VBN | represented: VBN |
here: RB | here: RB |
by: IN | by: IN |
petitioner: NN | petitioner: NN |
,: , | ,: , |
sought: VBD | sought: VBD |
permits: NNS | permits: NNS |
to: TO | to: TO |
develop: VB | develop: VB |
a: DT | a: DT |
section: NN | section: NN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
property: NN | property: NN |
from: IN | from: IN |
respondent: NN | respondent: NN |
St.: NNP | St.: NNP |
Johns: NNP | Johns: NNP |
River: NNP | River: NNP |
Water: NNP | Water: NNP |
Management: NNP | Management: NNP |
District: NNP | District: NNP |
(: NNP | -LRB-: -LRB- |
District: NNP | District: NNP |
): NNP | -RRB-: -RRB- |
,: , | ,: , |
which: WDT | which: WDT |
,: , | ,: , |
consistent: VBD | consistent: JJ |
with: IN | with: IN |
Florida: NNP | Florida: NNP |
law: NN | law: NN |
,: , | ,: , |
requires: VBZ | requires: VBZ |
permit: NN | permit: NN |
applicants: NNS | applicants: NNS |
wishing: VBG | wishing: VBG |
to: TO | to: TO |
build: VB | build: VB |
on: IN | on: IN |
wetlands: NNS | wetlands: NNS |
to: TO | to: TO |
offset: VB | offset: VB |
the: DT | the: DT |
resulting: VBG | resulting: VBG |
environmental: JJ | environmental: JJ |
damage: NN | damage: NN |
.: .’ | .: . |
Now, for a really interesting example: gibberish made to look like English. For this test, we’re going to use Lewis Carroll’s Jabberwocky:
nltk.pos_tag(nltk.word_tokenize("""
Twas bryllyg, and ye slythy toves
Did gyre and gymble in ye wabe:
All mimsy were ye borogoves;
And ye mome raths outgrabe.
"""))
At last, we have something where the output varies. One obvious lesson from this is that these algorithms are more than happy to guess to improve accuracy, even where they have no idea what’s going on, a similar strategy to multiple choice tests. It may be prudent to develop a class of algorithms which lose points for consistently guessing wildly incorrectly (similar to the scoring method used on the SATs).
Twas: NNP | Twas: NNP |
bryllyg: NN | bryllyg: NN |
,: , | ,: , |
and: CC | and: CC |
ye: VB | ye: NN |
slythy: JJ | slythy: NN |
toves: NNS | toves: VBZ |
Did: NNP | Did: VBD |
gyre: NN | gyre: NN |
and: CC | and: CC |
gymble: JJ | gymble: NN |
in: IN | in: IN |
ye: NN | ye: JJ |
wabe: NN | wabe: NN |
:: : | :: : |
All: DT | All: DT |
mimsy: NN | mimsy: NN |
were: VBD | were: VBD |
ye: NN | ye: JJ |
borogoves: NNS | borogoves: NNS |
;: : | ;: : |
And: CC | And: CC |
ye: NN | ye: VB |
mome: NN | mome: FW |
raths: NNS | raths: FW |
outgrabe: VBP | outgrabe: FW |
.: . | .: . |
For a final sample, we have a commonly cited section of Winnie the Pooh: while completely decipherable as English, it’s excruciatingly long. It may be worth noting that while this is verbose for modern tastes, many legal documents are written in the form of a single long sentence, separated by conjunctions (whereas a, whereas b, …) – this also bears strong resemblance to the writings of Victor Hugo:
In after-years [Piglet] liked to think that he had been in Very Great Danger during the Terrible Flood, but the only danger he had really been in was in the last half-hour of his imprisonment, when Owl, who had just flown up, sat on a branch of his tree to comfort him, and told him a very long story about an aunt who had once laid a seagull’s egg by mistake, and the story went on and on, rather like this sentence, until Piglet who was listening out of his window without much hope, went to sleep quietly and naturally, slipping slowly out of the window towards the water until he was only hanging on by his toes, at which moment luckily, a sudden loud squawk from Owl, which was really part of the story, being what his aunt said, woke Piglet up and just gave him time to jerk himself back into safety and say, “How interesting, and did she?” when-well, you can imagine his joy when at last he saw the good ship, The Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him.
It’s worth noting here that the two processes tokenize these slightly differently – one handles a set of unicode characters more gracefully, and the other inserts extra token breaks.
In: IN | In: IN |
after-years: NNS | after-years: JJ |
[: : | -LSB-: -LRB- |
Piglet: NNP | Piglet: NN |
]: : | -RSB-: -RRB- |
liked: VBD | liked: VBD |
to: TO | to: TO |
think: VB | think: VB |
that: IN | that: IN |
he: PRP | he: PRP |
had: VBD | had: VBD |
been: VBN | been: VBN |
in: IN | in: IN |
Very: NNP | Very: RB |
Great: NNP | Great: JJ |
Danger: NNP | Danger: NN |
during: IN | during: IN |
the: DT | the: DT |
Terrible: NNP | Terrible: JJ |
Flood: NNP | Flood: NNP |
,: , | ,: , |
but: CC | but: CC |
the: DT | the: DT |
only: JJ | only: JJ |
danger: NN | danger: NN |
he: PRP | he: PRP |
had: VBD | had: VBD |
really: RB | really: RB |
been: VBN | been: VBN |
in: IN | in: IN |
was: VBD | was: VBD |
in: IN | in: IN |
the: DT | the: DT |
last: JJ | last: JJ |
half-hour: JJ | half-hour: NN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
imprisonment: NN | imprisonment: NN |
,: , | ,: , |
when: WRB | when: WRB |
Owl: NNP | Owl: NN |
,: , | ,: , |
who: WP | who: WP |
had: VBD | had: VBD |
just: RB | just: RB |
flown: VBN | flown: VBN |
up: RP | up: RP |
,: , | ,: , |
sat: JJ | sat: VBD |
on: IN | on: IN |
a: DT | a: DT |
branch: NN | branch: NN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
tree: NN | tree: NN |
to: TO | to: TO |
comfort: VB | comfort: NN |
him: PRP | him: PRP |
,: , | ,: , |
and: CC | and: CC |
told: VBD | told: VBD |
him: PRP | him: PRP |
a: DT | a: DT |
very: RB | very: RB |
long: JJ | long: JJ |
story: NN | story: NN |
about: IN | about: IN |
an: DT | an: DT |
aunt: NN | aunt: NN |
who: WP | who: WP |
had: VBD | had: VBD |
once: RB | once: RB |
laid: VBN | laid: VBN |
a: DT | a: DT |
seagull\xe2\x80\x99s: JJ | seagull: NN |
egg: NN | s: POS |
by: IN | egg: NN |
mistake: NN | by: IN |
,: , | mistake: NN |
and: CC | ,: , |
the: DT | and: CC |
story: NN | the: DT |
went: VBD | story: NN |
on: IN | went: VBD |
and: CC | on: IN |
on: IN | and: CC |
,: , | on: IN |
rather: RB | ,: , |
like: IN | rather: RB |
this: DT | like: IN |
sentence: NN | this: DT |
,: , | sentence: NN |
until: IN | ,: , |
Piglet: NNP | until: IN |
who: WP | Piglet: NNP |
was: VBD | who: WP |
listening: VBG | was: VBD |
out: RP | listening: VBG |
of: IN | out: IN |
his: PRP$ | of: IN |
window: NN | his: PRP$ |
without: IN | window: NN |
much: JJ | without: IN |
hope: NN | much: JJ |
,: , | hope: NN |
went: VBD | ,: , |
to: TO | went: VBD |
sleep: VB | to: TO |
quietly: RB | sleep: VB |
and: CC | quietly: RB |
naturally: RB | and: CC |
,: , | naturally: RB |
slipping: VBG | ,: , |
slowly: RB | slipping: VBG |
out: IN | slowly: RB |
of: IN | out: IN |
the: DT | of: IN |
window: NN | the: DT |
towards: NNS | window: NN |
the: DT | towards: IN |
water: NN | the: DT |
until: IN | water: NN |
he: PRP | until: IN |
was: VBD | he: PRP |
only: RB | was: VBD |
hanging: VBG | only: RB |
on: IN | hanging: VBG |
by: IN | on: IN |
his: PRP$ | by: IN |
toes: NNS | his: PRP$ |
,: , | toes: NNS |
at: IN | ,: , |
which: WDT | at: IN |
moment: NN | which: WDT |
luckily: RB | moment: NN |
,: , | luckily: RB |
a: DT | ,: , |
sudden: JJ | a: DT |
loud: NN | sudden: JJ |
squawk: NN | loud: JJ |
from: IN | squawk: NN |
Owl: NNP | from: IN |
,: , | Owl: NN |
which: WDT | ,: , |
was: VBD | which: WDT |
really: RB | was: VBD |
part: NN | really: RB |
of: IN | part: NN |
the: DT | of: IN |
story: NN | the: DT |
,: , | story: NN |
being: VBG | ,: , |
what: WP | being: VBG |
his: PRP$ | what: WP |
aunt: NN | his: PRP$ |
said: VBD | aunt: NN |
,: , | said: VBD |
woke: NN | ,: , |
Piglet: NNP | woke: VBD |
up: IN | Piglet: NNP |
and: CC | up: IN |
just: RB | and: CC |
gave: VBD | just: RB |
him: PRP | gave: VBD |
time: NN | him: PRP |
to: TO | time: NN |
jerk: VB | to: TO |
himself: PRP | jerk: VB |
back: RB | himself: PRP |
into: IN | back: RB |
safety: NN | into: IN |
and: CC | safety: NN |
say: VB | and: CC |
,: , | say: VB |
“: “ | ,: , |
How: WRB | “: “ |
interesting: JJ | How: WRB |
,: , | interesting: JJ |
and: CC | ,: , |
did: VBD | and: CC |
she: PRP | did: VBD |
?: . | she: PRP |
(“””, “””), | ?: . |
when-well: NNP | ‘: ” |
,: , | when-well: NN |
you: PRP | ,: , |
can: MD | you: PRP |
imagine: VB | can: MD |
his: PRP$ | imagine: VB |
joy: NN | his: PRP$ |
when: WRB | joy: NN |
at: IN | when: WRB |
last: JJ | at: IN |
he: PRP | last: JJ |
saw: VBD | he: PRP |
the: DT | saw: VBD |
good: JJ | the: DT |
ship: NN | good: JJ |
,: , | ship: NN |
The: NNP | ,: , |
Brain: NNP | The: DT |
of: IN | Brain: NN |
Pooh: NNP | of: IN |
(: NNP | Pooh: NNP |
Captain: NNP | -LRB-: -LRB- |
,: , | Captain: NNP |
C.: NNP | ,: , |
Robin: NNP | C.: NNP |
;: : | Robin: NNP |
1st: CD | ;: : |
Mate: NNP | 1st: CD |
,: , | Mate: NN |
P.: NNP | ,: , |
Bear: NNP | P.: NNP |
): NNP | Bear: NNP |
coming: VBG | -RRB-: -RRB- |
over: IN | coming: VBG |
the: DT | over: IN |
sea: NN | the: DT |
to: TO | sea: NN |
rescue: VB | to: TO |
him: PRP | rescue: VB |
.: .’)] | him: PRP |
.: .’)] | .: . |
In: IN | In: IN |
after-years: NNS | after-years: JJ |
[: : | -LSB-: -LRB- |
Piglet: NNP | Piglet: NN |
]: : | -RSB-: -RRB- |
liked: VBD | liked: VBD |
to: TO | to: TO |
think: VB | think: VB |
that: IN | that: IN |
he: PRP | he: PRP |
had: VBD | had: VBD |
been: VBN | been: VBN |
in: IN | in: IN |
Very: NNP | Very: RB |
Great: NNP | Great: JJ |
Danger: NNP | Danger: NN |
during: IN | during: IN |
the: DT | the: DT |
Terrible: NNP | Terrible: JJ |
Flood: NNP | Flood: NNP |
,: , | ,: , |
but: CC | but: CC |
the: DT | the: DT |
only: JJ | only: JJ |
danger: NN | danger: NN |
he: PRP | he: PRP |
had: VBD | had: VBD |
really: RB | really: RB |
been: VBN | been: VBN |
in: IN | in: IN |
was: VBD | was: VBD |
in: IN | in: IN |
the: DT | the: DT |
last: JJ | last: JJ |
half-hour: JJ | half-hour: NN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
imprisonment: NN | imprisonment: NN |
,: , | ,: , |
when: WRB | when: WRB |
Owl: NNP | Owl: NN |
,: , | ,: , |
who: WP | who: WP |
had: VBD | had: VBD |
just: RB | just: RB |
flown: VBN | flown: VBN |
up: RP | up: RP |
,: , | ,: , |
sat: JJ | sat: VBD |
on: IN | on: IN |
a: DT | a: DT |
branch: NN | branch: NN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
tree: NN | tree: NN |
to: TO | to: TO |
comfort: VB | comfort: NN |
him: PRP | him: PRP |
,: , | ,: , |
and: CC | and: CC |
told: VBD | told: VBD |
him: PRP | him: PRP |
a: DT | a: DT |
very: RB | very: RB |
long: JJ | long: JJ |
story: NN | story: NN |
about: IN | about: IN |
an: DT | an: DT |
aunt: NN | aunt: NN |
who: WP | who: WP |
had: VBD | had: VBD |
once: RB | once: RB |
laid: VBN | laid: VBN |
a: DT | a: DT |
seagull\xe2\x80\x99s: JJ | seagull: NN |
egg: NN | s: POS |
egg: NN | |
by: IN | by: IN |
mistake: NN | mistake: NN |
,: , | ,: , |
and: CC | and: CC |
the: DT | the: DT |
story: NN | story: NN |
went: VBD | went: VBD |
on: IN | on: IN |
and: CC | and: CC |
on: IN | on: IN |
,: , | ,: , |
rather: RB | rather: RB |
like: IN | like: IN |
this: DT | this: DT |
sentence: NN | sentence: NN |
,: , | ,: , |
until: IN | until: IN |
Piglet: NNP | Piglet: NNP |
who: WP | who: WP |
was: VBD | was: VBD |
listening: VBG | listening: VBG |
out: RP | out: IN |
of: IN | of: IN |
his: PRP$ | his: PRP$ |
window: NN | window: NN |
without: IN | without: IN |
much: JJ | much: JJ |
hope: NN | hope: NN |
,: , | ,: , |
went: VBD | went: VBD |
to: TO | to: TO |
sleep: VB | sleep: VB |
quietly: RB | quietly: RB |
and: CC | and: CC |
naturally: RB | naturally: RB |
,: , | ,: , |
slipping: VBG | slipping: VBG |
slowly: RB | slowly: RB |
out: IN | out: IN |
of: IN | of: IN |
the: DT | the: DT |
window: NN | window: NN |
towards: NNS | towards: IN |
the: DT | the: DT |
water: NN | water: NN |
until: IN | until: IN |
he: PRP | he: PRP |
was: VBD | was: VBD |
only: RB | only: RB |
hanging: VBG | hanging: VBG |
on: IN | on: IN |
by: IN | by: IN |
his: PRP$ | his: PRP$ |
toes: NNS | toes: NNS |
,: , | ,: , |
at: IN | at: IN |
which: WDT | which: WDT |
moment: NN | moment: NN |
luckily: RB | luckily: RB |
,: , | ,: , |
a: DT | a: DT |
sudden: JJ | sudden: JJ |
loud: NN | loud: JJ |
squawk: NN | squawk: NN |
from: IN | from: IN |
Owl: NNP | Owl: NN |
,: , | ,: , |
which: WDT | which: WDT |
was: VBD | was: VBD |
really: RB | really: RB |
part: NN | part: NN |
of: IN | of: IN |
the: DT | the: DT |
story: NN | story: NN |
,: , | ,: , |
being: VBG | being: VBG |
what: WP | what: WP |
his: PRP$ | his: PRP$ |
aunt: NN | aunt: NN |
said: VBD | said: VBD |
,: , | ,: , |
woke: NN | woke: VBD |
Piglet: NNP | Piglet: NNP |
up: IN | up: IN |
and: CC | and: CC |
just: RB | just: RB |
gave: VBD | gave: VBD |
him: PRP | him: PRP |
time: NN | time: NN |
to: TO | to: TO |
jerk: VB | jerk: VB |
himself: PRP | himself: PRP |
back: RB | back: RB |
into: IN | into: IN |
safety: NN | safety: NN |
and: CC | and: CC |
say: VB | say: VB |
,: , | ,: , |
“: “ | “: “ |
How: WRB | How: WRB |
interesting: JJ | interesting: JJ |
,: , | ,: , |
and: CC | and: CC |
did: VBD | did: VBD |
she: PRP | she: PRP |
?: . | ?: . |
(“””, “””), | ‘: ” |
when-well: NNP | when-well: NN |
,: , | ,: , |
you: PRP | you: PRP |
can: MD | can: MD |
imagine: VB | imagine: VB |
his: PRP$ | his: PRP$ |
joy: NN | joy: NN |
when: WRB | when: WRB |
at: IN | at: IN |
last: JJ | last: JJ |
he: PRP | he: PRP |
saw: VBD | saw: VBD |
the: DT | the: DT |
good: JJ | good: JJ |
ship: NN | ship: NN |
,: , | ,: , |
The: NNP | The: DT |
Brain: NNP | Brain: NN |
of: IN | of: IN |
Pooh: NNP | Pooh: NNP |
(: NNP | -LRB-: -LRB- |
Captain: NNP | Captain: NNP |
,: , | ,: , |
C.: NNP | C.: NNP |
Robin: NNP | Robin: NNP |
;: : | ;: : |
1st: CD | 1st: CD |
Mate: NNP | Mate: NN |
,: , | ,: , |
P.: NNP | P.: NNP |
Bear: NNP | Bear: NNP |
): NNP | -RRB-: -RRB- |
coming: VBG | coming: VBG |
over: IN | over: IN |
the: DT | the: DT |
sea: NN | sea: NN |
to: TO | to: TO |
rescue: VB | rescue: VB |
him: PRP | him: PRP |
.: . | .: . |
(Little typo in your first Python example, four double-quotes instead of three.)
On wild guessing — actually, many taggers also calculate probabilities of the tags, which describe their confidence for each tag. I’m sure there’s an option somehow to get the Stanford tagger and/or the NLTK tagger to output them.
(I don’t know if this will be useful for you, but our Twitter POS tagger outputs confidence scores by default: http://www.ark.cs.cmu.edu/TweetNLP/)
If you only want to have tags where the tagger is likely to be right, simply only use tags with at least 95% or 98% or whatever you want level of confidence.