I’ve been exploring APIs for Named Entity Recognition (and other language processing / AI techniques) as part of a project to discover university lectures and historical speeches.
Named Entity Recognition is a collection of techniques used to label and classify “entities” mentioned in a piece of text – e.g. to list countries mentioned in a speaker’s bio, makes and models of cars mentioned in accident reports, and so on. The automated labeling process typically tags parts of speech, then trains a machine learning algorithm to recognize the desired classes of values from manually tagged texts. Entity recognition systems may also return a unique identifier (useful for companies, people, etc), and thus must attempt to disambiguate two entities with the same name, and unify references to the same entity under different names (e.g. Microsoft, MSFT).
Once trained, the algorithm is expected to recognize values it has never seen before – new people, companies, countries, and so on.
There are several commercial and open-source systems that implement portions of this functionality. Training such a system requires a large dataset, so I suspect that as time progresses, the commercial offerings will greatly outstrip the free ones.
The Stanford Named Entity Recognizer is promoted as being good at tagging people, organizations, and locations. I’ve investigated two commercial systems in detail – AlchemyAPI (IBM) and Open Calaias (Thomson Reuters)- both have metered pricing.
Alchemy API
AlchemyAPI promotes that they can recognize several hundred entity types, which is likely the broadest coverage of any of these systems. In my experience, some of these seem to have a lot of false positives, but I imagine this will improve with time. AlchemyAPI also has a system to let you upload your own data and train new entities, although it is quite expensive. One nice feature is that they chose existing open data systems for the links, when they know what they are.
{
"type": "Facility",
"relevance": "0.817492",
"count": "12",
"text": "Alban Berg Quartet"
},
{
"type": "Organization",
"relevance": "0.350804",
"count": "1",
"text": "Cavatina Chamber Music Trust"
},
{
"type": "Person",
"relevance": "0.344306",
"count": "2",
"text": "Emma Parker",
"disambiguated": {
"name": "Emma Parker",
"dbpedia": "http://dbpedia.org/resource/Emma_Parker",
"freebase": "http://rdf.freebase.com/ns/m.0dln2wr"
}
},
{
"type": "JobTitle",
"relevance": "0.326067",
"count": "4",
"text": "Ernest Bloch Lecturer"
}
One thing that surprises me about entity recognition systems is that more of them don’t use fixed lists of values to check against – for instance, there are hundreds of “former countries and territorial entities, but few of these were identified by AlchemyAPI in my testing.
Open Calais
Open Calais is run by Thomson Reuters – I found that it typically returns about 2x as many entities as AlchemyAPI, but with less categories. It is a more REST oriented API, in that it returns URLs to items it finds, so you can use these as ids. It also returns context clues, so at least you know where it found something (I imagine this might be useful in a search engine):
"_typeGroup": "entities",
"_type": "Position",
"forenduserdisplay": "false",
"name": "Professor of the History of Christianity and Leverhulme Major Research Fellow",
"_typeReference": "http://s.opencalais.com/1/type/em/e/Position",
"instances": [
{
"detection": "[ Professor in the History of Religion. He is also ]Professor of the History of Christianity and Leverhulme Major Research Fellow[ at Durham University. \nHis first series of]",
"prefix": " Professor in the History of Religion. He is also ",
"exact": "Professor of the History of Christianity and Leverhulme Major Research Fellow",
"suffix": " at Durham University. \nHis first series of",
"offset": 80,
"length": 77
}
],
"relevance": 0.2
On the other hand, this did detect talks from Gresham College as being in Gresham, Oregon (clearly not in the U.K.), which illustrates the peril of relying too heavily on these types of systems.
Problem Areas
Finally, if you want to use these systems, it’s important to know where they break down.
- In tagging a lot of transcripts of historical speeches, AlchemyAPI tends to consider phrases like “Nazi Germany”, or “Tsarist Russia” as the names of actual countries (if you were thinking it may have learned to detect propaganda, it did not detect “Axis of Evil” as a country). I don’t know a way around this problem other than manually correcting / hiding bad results.
- The “Job Title” entity type is pretty slick, but it can’t tell if a “President” is a University President or a U.S. President. Depending on your needs, you may get more accurate information by getting someone’s name and scraping Wikipedia and LinkedIn instead.
- Occasionally AlchemyAPI detects a list and treats it as an entity. E.g. “President and Vice President”.
- AlchemyAPI has a really neat feature, that it can tag health conditions. Generally this works great (see here) but unfortunately it does pick up “The Great Depression” as the medical condition “depression”
- AlchemyAPI has an additional API for “relationship extraction”, which is supposed to tell you how entities relate to each other. However, I have found this does return a lot of noise, and it can’t tell you if a text is “about” an entity, vs. just mentioning it (a hard problem, to be sure).
Through this project I’ve pulled a few tens of thousands of entity calls, so if you have questions feel free to ask below.