Part of Speech Tagging: NLTK vs Stanford NLP

One of the difficulties inherent in machine learning techniques is that the most accurate algorithms refuse to tell a story: we can discuss the confusion matrix, testing and training data, accuracy and the like, but it’s often hard to explain in simple terms what’s really going on.

Practically speaking this isn’t a big issue from an engineering perspective, but in a general political sense it is- highly accurate machine are often considered creepy, especially when it’s not apparent how it figured something out.

A simple case of this is part of speech tagging – you can read a book on how it works, and see the output, but it’s really hard to figure out whether something is “good” and develop an intuition for the personality of the algorithms. To that end, I’ve experimented with comparing the output of two taggers on common pieces of text, below.

The first tagger is the POS tagger included in NLTK (Python). This is presented in some detail in “Natural Language Processing with Python” (read my review), which has lots of motivating examples for natural language processing around NLTK, a natural language processing library maintained by the authors. The second toolkit is the Stanford NLP tagger (Java). Conveniently, these each use a simlar set of text.

For the first example, we’ll take a simple sentence and compare the output of the two products. In this case, you can see the formatting is quite different, but the tags are the same.

nltk.pos_tag(nltk.word_tokenize(
"""This Court has jurisdiction to 
consider the merits of the case.""""))

[('This', 'DT'),
 ('Court', 'NNP'),
 ('has', 'VBZ'),
 ('jurisdiction', 'NN'),
 ('to', 'TO'),
 ('consider', 'VB'),
 ('the', 'DT'),
 ('merits', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('case', 'NN'),
 ('.', '.')]

This_DT Court_NNP has_VBZ jurisdiction_NN 
to_TO consider_VB the_DT merits_NNS of_IN 
the_DT case_NN ._. 

For reference, there are quite a few possible tags in a POS tagger, far more than what you learn in high school English class – this helps later processes form more accurate results. Here are examples, from the Penn TreeBank documentation-

CC - Coordinating conjunction
CD - Cardinal number
DT - Determiner
EX - Existential there
FW - Foreign word
IN - Preposition or subordinating conjunction
JJ - Adjective
JJR - Adjective, comparative
JJS - Adjective, superlative
LS - List item marker
MD - Modal
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
PDT - Predeterminer
POS - Possessive ending
PRP - Personal pronoun
PRP$ - Possessive pronoun (prolog version PRP-S)
RB - Adverb
RBR - Adverb, comparative
RBS - Adverb, superlative
RP - Particle
SYM - Symbol
TO - to
UH - Interjection
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
WDT - Wh-determiner
WP - Wh-pronoun
WP$ - Possessive wh-pronoun (prolog version WP-S)
WRB - Wh-adverb

The following are some more involved examples, rendered side by side. I’ve edited the output to facilitate comparison. NLTK is on the left; Stanford NLP on the right.

For the first example, I’ve chosen a sonnet, which is surprisingly similar between the two tools.

nltk.pos_tag(nltk.word_tokenize("""
Can my love excuse the slow offence,
Of my dull bearer, when from thee I speed,
From where thou art, why should I haste me thence?
Till I return of posting is no need.
"""))
Can: NNP Can: MD
my: PRP$ my: PRP$
love: NN love: NN
excuse: NN excuse: NN
the: DT the: DT
slow: JJ slow: JJ
offence: NN offence: NN
,: , ,: ,
Of: IN Of: IN
my: PRP$ my: PRP$
dull: NN dull: JJ
bearer: NN bearer: NN
,: , ,: ,
when: WRB when: WRB
from: IN from: IN
thee: NN thee: NN
I: PRP I: PRP
speed: VBP speed: VBP
,: , ,: ,
From: NNP From: IN
where: WRB where: WRB
thou: PRP thou: JJ
art: VBP art: NN
,: , ,: ,
why: WRB why: WRB
should: MD should: MD
I: PRP I: PRP
haste: VB haste: NN
me: PRP me: PRP
thence: NN thence: VB
?: . ?: .
Till: NNP Till: IN
I: PRP I: PRP
return: VBP return: VBP
of: IN of: IN
posting: VBG posting: VBG
is: VBZ is: VBZ
no: DT no: DT
need: NN need: NN
.: . .: .

For a second example, I’ve chosen a very wordy example from a recent Supreme Court case. The reason this type of text is interesting is that it is a common type of thing one might want to analyze, and it has entity names in it. For such a short sentence, there is very little deviation – it makes me wonder if these aren’t two versions of the same code/model.

nltk.pos_tag(nltk.word_tokenize("""Roy Koontz, Sr., whose estate
is represented here by petitioner, sought permits to develop a 
section of his property from respondent St. Johns River Water 
Management District (District), which, consistent with Florida 
law, requires permit applicants wishing to build on wetlands to 
offset the resulting environmental damage."""))
Roy: NNP Roy: NNP
Koontz: NNP Koontz: NNP
,: , ,: ,
Sr.: NNP Sr.: NNP
,: , ,: ,
whose: WP$ whose: WP$
estate: NN estate: NN
is: VBZ is: VBZ
represented: VBN represented: VBN
here: RB here: RB
by: IN by: IN
petitioner: NN petitioner: NN
,: , ,: ,
sought: VBD sought: VBD
permits: NNS permits: NNS
to: TO to: TO
develop: VB develop: VB
a: DT a: DT
section: NN section: NN
of: IN of: IN
his: PRP$ his: PRP$
property: NN property: NN
from: IN from: IN
respondent: NN respondent: NN
St.: NNP St.: NNP
Johns: NNP Johns: NNP
River: NNP River: NNP
Water: NNP Water: NNP
Management: NNP Management: NNP
District: NNP District: NNP
(: NNP -LRB-: -LRB-
District: NNP District: NNP
): NNP -RRB-: -RRB-
,: , ,: ,
which: WDT which: WDT
,: , ,: ,
consistent: VBD consistent: JJ
with: IN with: IN
Florida: NNP Florida: NNP
law: NN law: NN
,: , ,: ,
requires: VBZ requires: VBZ
permit: NN permit: NN
applicants: NNS applicants: NNS
wishing: VBG wishing: VBG
to: TO to: TO
build: VB build: VB
on: IN on: IN
wetlands: NNS wetlands: NNS
to: TO to: TO
offset: VB offset: VB
the: DT the: DT
resulting: VBG resulting: VBG
environmental: JJ environmental: JJ
damage: NN damage: NN
.: .’ .: .

Now, for a really interesting example: gibberish made to look like English. For this test, we’re going to use Lewis Carroll’s Jabberwocky:

nltk.pos_tag(nltk.word_tokenize("""
Twas bryllyg, and ye slythy toves
Did gyre and gymble in ye wabe:
All mimsy were ye borogoves;
And ye mome raths outgrabe.
"""))

At last, we have something where the output varies. One obvious lesson from this is that these algorithms are more than happy to guess to improve accuracy, even where they have no idea what’s going on, a similar strategy to multiple choice tests. It may be prudent to develop a class of algorithms which lose points for consistently guessing wildly incorrectly (similar to the scoring method used on the SATs).

Twas: NNP Twas: NNP
bryllyg: NN bryllyg: NN
,: , ,: ,
and: CC and: CC
ye: VB ye: NN
slythy: JJ slythy: NN
toves: NNS toves: VBZ
Did: NNP Did: VBD
gyre: NN gyre: NN
and: CC and: CC
gymble: JJ gymble: NN
in: IN in: IN
ye: NN ye: JJ
wabe: NN wabe: NN
:: : :: :
All: DT All: DT
mimsy: NN mimsy: NN
were: VBD were: VBD
ye: NN ye: JJ
borogoves: NNS borogoves: NNS
;: : ;: :
And: CC And: CC
ye: NN ye: VB
mome: NN mome: FW
raths: NNS raths: FW
outgrabe: VBP outgrabe: FW
.: . .: .

For a final sample, we have a commonly cited section of Winnie the Pooh: while completely decipherable as English, it’s excruciatingly long. It may be worth noting that while this is verbose for modern tastes, many legal documents are written in the form of a single long sentence, separated by conjunctions (whereas a, whereas b, …) – this also bears strong resemblance to the writings of Victor Hugo:


In after-years [Piglet] liked to think that he had been in Very Great Danger during the Terrible Flood, but the only danger he had really been in was in the last half-hour of his imprisonment, when Owl, who had just flown up, sat on a branch of his tree to comfort him, and told him a very long story about an aunt who had once laid a seagull’s egg by mistake, and the story went on and on, rather like this sentence, until Piglet who was listening out of his window without much hope, went to sleep quietly and naturally, slipping slowly out of the window towards the water until he was only hanging on by his toes, at which moment luckily, a sudden loud squawk from Owl, which was really part of the story, being what his aunt said, woke Piglet up and just gave him time to jerk himself back into safety and say, “How interesting, and did she?” when-well, you can imagine his joy when at last he saw the good ship, The Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him.

It’s worth noting here that the two processes tokenize these slightly differently – one handles a set of unicode characters more gracefully, and the other inserts extra token breaks.

In: IN In: IN
after-years: NNS after-years: JJ
[: : -LSB-: -LRB-
Piglet: NNP Piglet: NN
]: : -RSB-: -RRB-
liked: VBD liked: VBD
to: TO to: TO
think: VB think: VB
that: IN that: IN
he: PRP he: PRP
had: VBD had: VBD
been: VBN been: VBN
in: IN in: IN
Very: NNP Very: RB
Great: NNP Great: JJ
Danger: NNP Danger: NN
during: IN during: IN
the: DT the: DT
Terrible: NNP Terrible: JJ
Flood: NNP Flood: NNP
,: , ,: ,
but: CC but: CC
the: DT the: DT
only: JJ only: JJ
danger: NN danger: NN
he: PRP he: PRP
had: VBD had: VBD
really: RB really: RB
been: VBN been: VBN
in: IN in: IN
was: VBD was: VBD
in: IN in: IN
the: DT the: DT
last: JJ last: JJ
half-hour: JJ half-hour: NN
of: IN of: IN
his: PRP$ his: PRP$
imprisonment: NN imprisonment: NN
,: , ,: ,
when: WRB when: WRB
Owl: NNP Owl: NN
,: , ,: ,
who: WP who: WP
had: VBD had: VBD
just: RB just: RB
flown: VBN flown: VBN
up: RP up: RP
,: , ,: ,
sat: JJ sat: VBD
on: IN on: IN
a: DT a: DT
branch: NN branch: NN
of: IN of: IN
his: PRP$ his: PRP$
tree: NN tree: NN
to: TO to: TO
comfort: VB comfort: NN
him: PRP him: PRP
,: , ,: ,
and: CC and: CC
told: VBD told: VBD
him: PRP him: PRP
a: DT a: DT
very: RB very: RB
long: JJ long: JJ
story: NN story: NN
about: IN about: IN
an: DT an: DT
aunt: NN aunt: NN
who: WP who: WP
had: VBD had: VBD
once: RB once: RB
laid: VBN laid: VBN
a: DT a: DT
seagull\xe2\x80\x99s: JJ seagull: NN
egg: NN s: POS
by: IN egg: NN
mistake: NN by: IN
,: , mistake: NN
and: CC ,: ,
the: DT and: CC
story: NN the: DT
went: VBD story: NN
on: IN went: VBD
and: CC on: IN
on: IN and: CC
,: , on: IN
rather: RB ,: ,
like: IN rather: RB
this: DT like: IN
sentence: NN this: DT
,: , sentence: NN
until: IN ,: ,
Piglet: NNP until: IN
who: WP Piglet: NNP
was: VBD who: WP
listening: VBG was: VBD
out: RP listening: VBG
of: IN out: IN
his: PRP$ of: IN
window: NN his: PRP$
without: IN window: NN
much: JJ without: IN
hope: NN much: JJ
,: , hope: NN
went: VBD ,: ,
to: TO went: VBD
sleep: VB to: TO
quietly: RB sleep: VB
and: CC quietly: RB
naturally: RB and: CC
,: , naturally: RB
slipping: VBG ,: ,
slowly: RB slipping: VBG
out: IN slowly: RB
of: IN out: IN
the: DT of: IN
window: NN the: DT
towards: NNS window: NN
the: DT towards: IN
water: NN the: DT
until: IN water: NN
he: PRP until: IN
was: VBD he: PRP
only: RB was: VBD
hanging: VBG only: RB
on: IN hanging: VBG
by: IN on: IN
his: PRP$ by: IN
toes: NNS his: PRP$
,: , toes: NNS
at: IN ,: ,
which: WDT at: IN
moment: NN which: WDT
luckily: RB moment: NN
,: , luckily: RB
a: DT ,: ,
sudden: JJ a: DT
loud: NN sudden: JJ
squawk: NN loud: JJ
from: IN squawk: NN
Owl: NNP from: IN
,: , Owl: NN
which: WDT ,: ,
was: VBD which: WDT
really: RB was: VBD
part: NN really: RB
of: IN part: NN
the: DT of: IN
story: NN the: DT
,: , story: NN
being: VBG ,: ,
what: WP being: VBG
his: PRP$ what: WP
aunt: NN his: PRP$
said: VBD aunt: NN
,: , said: VBD
woke: NN ,: ,
Piglet: NNP woke: VBD
up: IN Piglet: NNP
and: CC up: IN
just: RB and: CC
gave: VBD just: RB
him: PRP gave: VBD
time: NN him: PRP
to: TO time: NN
jerk: VB to: TO
himself: PRP jerk: VB
back: RB himself: PRP
into: IN back: RB
safety: NN into: IN
and: CC safety: NN
say: VB and: CC
,: , say: VB
“: “ ,: ,
How: WRB “: “
interesting: JJ How: WRB
,: , interesting: JJ
and: CC ,: ,
did: VBD and: CC
she: PRP did: VBD
?: . she: PRP
(“””, “””), ?: .
when-well: NNP ‘: ”
,: , when-well: NN
you: PRP ,: ,
can: MD you: PRP
imagine: VB can: MD
his: PRP$ imagine: VB
joy: NN his: PRP$
when: WRB joy: NN
at: IN when: WRB
last: JJ at: IN
he: PRP last: JJ
saw: VBD he: PRP
the: DT saw: VBD
good: JJ the: DT
ship: NN good: JJ
,: , ship: NN
The: NNP ,: ,
Brain: NNP The: DT
of: IN Brain: NN
Pooh: NNP of: IN
(: NNP Pooh: NNP
Captain: NNP -LRB-: -LRB-
,: , Captain: NNP
C.: NNP ,: ,
Robin: NNP C.: NNP
;: : Robin: NNP
1st: CD ;: :
Mate: NNP 1st: CD
,: , Mate: NN
P.: NNP ,: ,
Bear: NNP P.: NNP
): NNP Bear: NNP
coming: VBG -RRB-: -RRB-
over: IN coming: VBG
the: DT over: IN
sea: NN the: DT
to: TO sea: NN
rescue: VB to: TO
him: PRP rescue: VB
.: .’)] him: PRP
.: .’)] .: .
In: IN In: IN
after-years: NNS after-years: JJ
[: : -LSB-: -LRB-
Piglet: NNP Piglet: NN
]: : -RSB-: -RRB-
liked: VBD liked: VBD
to: TO to: TO
think: VB think: VB
that: IN that: IN
he: PRP he: PRP
had: VBD had: VBD
been: VBN been: VBN
in: IN in: IN
Very: NNP Very: RB
Great: NNP Great: JJ
Danger: NNP Danger: NN
during: IN during: IN
the: DT the: DT
Terrible: NNP Terrible: JJ
Flood: NNP Flood: NNP
,: , ,: ,
but: CC but: CC
the: DT the: DT
only: JJ only: JJ
danger: NN danger: NN
he: PRP he: PRP
had: VBD had: VBD
really: RB really: RB
been: VBN been: VBN
in: IN in: IN
was: VBD was: VBD
in: IN in: IN
the: DT the: DT
last: JJ last: JJ
half-hour: JJ half-hour: NN
of: IN of: IN
his: PRP$ his: PRP$
imprisonment: NN imprisonment: NN
,: , ,: ,
when: WRB when: WRB
Owl: NNP Owl: NN
,: , ,: ,
who: WP who: WP
had: VBD had: VBD
just: RB just: RB
flown: VBN flown: VBN
up: RP up: RP
,: , ,: ,
sat: JJ sat: VBD
on: IN on: IN
a: DT a: DT
branch: NN branch: NN
of: IN of: IN
his: PRP$ his: PRP$
tree: NN tree: NN
to: TO to: TO
comfort: VB comfort: NN
him: PRP him: PRP
,: , ,: ,
and: CC and: CC
told: VBD told: VBD
him: PRP him: PRP
a: DT a: DT
very: RB very: RB
long: JJ long: JJ
story: NN story: NN
about: IN about: IN
an: DT an: DT
aunt: NN aunt: NN
who: WP who: WP
had: VBD had: VBD
once: RB once: RB
laid: VBN laid: VBN
a: DT a: DT
seagull\xe2\x80\x99s: JJ seagull: NN
egg: NN s: POS
egg: NN
by: IN by: IN
mistake: NN mistake: NN
,: , ,: ,
and: CC and: CC
the: DT the: DT
story: NN story: NN
went: VBD went: VBD
on: IN on: IN
and: CC and: CC
on: IN on: IN
,: , ,: ,
rather: RB rather: RB
like: IN like: IN
this: DT this: DT
sentence: NN sentence: NN
,: , ,: ,
until: IN until: IN
Piglet: NNP Piglet: NNP
who: WP who: WP
was: VBD was: VBD
listening: VBG listening: VBG
out: RP out: IN
of: IN of: IN
his: PRP$ his: PRP$
window: NN window: NN
without: IN without: IN
much: JJ much: JJ
hope: NN hope: NN
,: , ,: ,
went: VBD went: VBD
to: TO to: TO
sleep: VB sleep: VB
quietly: RB quietly: RB
and: CC and: CC
naturally: RB naturally: RB
,: , ,: ,
slipping: VBG slipping: VBG
slowly: RB slowly: RB
out: IN out: IN
of: IN of: IN
the: DT the: DT
window: NN window: NN
towards: NNS towards: IN
the: DT the: DT
water: NN water: NN
until: IN until: IN
he: PRP he: PRP
was: VBD was: VBD
only: RB only: RB
hanging: VBG hanging: VBG
on: IN on: IN
by: IN by: IN
his: PRP$ his: PRP$
toes: NNS toes: NNS
,: , ,: ,
at: IN at: IN
which: WDT which: WDT
moment: NN moment: NN
luckily: RB luckily: RB
,: , ,: ,
a: DT a: DT
sudden: JJ sudden: JJ
loud: NN loud: JJ
squawk: NN squawk: NN
from: IN from: IN
Owl: NNP Owl: NN
,: , ,: ,
which: WDT which: WDT
was: VBD was: VBD
really: RB really: RB
part: NN part: NN
of: IN of: IN
the: DT the: DT
story: NN story: NN
,: , ,: ,
being: VBG being: VBG
what: WP what: WP
his: PRP$ his: PRP$
aunt: NN aunt: NN
said: VBD said: VBD
,: , ,: ,
woke: NN woke: VBD
Piglet: NNP Piglet: NNP
up: IN up: IN
and: CC and: CC
just: RB just: RB
gave: VBD gave: VBD
him: PRP him: PRP
time: NN time: NN
to: TO to: TO
jerk: VB jerk: VB
himself: PRP himself: PRP
back: RB back: RB
into: IN into: IN
safety: NN safety: NN
and: CC and: CC
say: VB say: VB
,: , ,: ,
“: “ “: “
How: WRB How: WRB
interesting: JJ interesting: JJ
,: , ,: ,
and: CC and: CC
did: VBD did: VBD
she: PRP she: PRP
?: . ?: .
(“””, “””), ‘: ”
when-well: NNP when-well: NN
,: , ,: ,
you: PRP you: PRP
can: MD can: MD
imagine: VB imagine: VB
his: PRP$ his: PRP$
joy: NN joy: NN
when: WRB when: WRB
at: IN at: IN
last: JJ last: JJ
he: PRP he: PRP
saw: VBD saw: VBD
the: DT the: DT
good: JJ good: JJ
ship: NN ship: NN
,: , ,: ,
The: NNP The: DT
Brain: NNP Brain: NN
of: IN of: IN
Pooh: NNP Pooh: NNP
(: NNP -LRB-: -LRB-
Captain: NNP Captain: NNP
,: , ,: ,
C.: NNP C.: NNP
Robin: NNP Robin: NNP
;: : ;: :
1st: CD 1st: CD
Mate: NNP Mate: NN
,: , ,: ,
P.: NNP P.: NNP
Bear: NNP Bear: NNP
): NNP -RRB-: -RRB-
coming: VBG coming: VBG
over: IN over: IN
the: DT the: DT
sea: NN sea: NN
to: TO to: TO
rescue: VB rescue: VB
him: PRP him: PRP
.: . .: .

2 Replies to “Part of Speech Tagging: NLTK vs Stanford NLP”

  1. On wild guessing — actually, many taggers also calculate probabilities of the tags, which describe their confidence for each tag. I’m sure there’s an option somehow to get the Stanford tagger and/or the NLTK tagger to output them.

    (I don’t know if this will be useful for you, but our Twitter POS tagger outputs confidence scores by default: http://www.ark.cs.cmu.edu/TweetNLP/)

    If you only want to have tags where the tagger is likely to be right, simply only use tags with at least 95% or 98% or whatever you want level of confidence.

Leave a Reply

Your email address will not be published. Required fields are marked *