Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import java.io._
import org.xml.sax._
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreAnnotations

object pdfHandler extends ContentHandler {
  val contents: StringBuffer = new StringBuffer()

  def characters(ch: Array[Char], start: Int, length: Int) {
    contents.append(new String(ch))
  }

  def endDocument() {
  }

  def endElement(uri: String, localName: String, qName: String) {
  }

  def endPrefixMapping(prefix: String) {
  }

  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
  }

  def processingInstruction(target: String, data: String) {
  }

  def setDocumentLocator(locator: Locator) {
  }

  def skippedEntity(name: String) {
  }

  def startDocument() {
  }

  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
  }

  def startPrefixMapping(prefix: String, uri: String) {
  }
}

object pdf extends App {
  val file = """e:\data\11-1285_i4dk.pdf"""

  val pdf: PDFParser = new PDFParser();

  val stream: InputStream = new FileInputStream(file)
  val handler: ContentHandler = pdfHandler
  val metadata: Metadata = new Metadata()
  val context: ParseContext = new ParseContext()

  pdf.parse(stream,
    handler,
    metadata,
    context)

  stream.close()

  val contents: String = pdfHandler.contents.toString()
  println(contents)

  val src = "stanford-ner-2013-04-04/classifiers/"
  val classifier1 = "english.all.3class.distsim.crf.ser.gz"
  val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
  val classifier3 = "english.muc.7class.distsim.crf.ser.gz"

  val serializedClassifier = src + classifier1

  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
  val out = classifier.classify(contents)

  var words = 0
  for (i <- 0 to out.size() - 1) {
    val sentence = out.get(i)

    var foundWord = ""
    var oldWordClass = ""

    for (j <- 0 to sentence.size() - 1) {
      val word = sentence.get(j)
      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + ""

      if (!oldWordClass.equals(wordClass)) {
        if (!oldWordClass.equals("O") && !oldWordClass.equals("")) {
          print("[/" + oldWordClass + "]")
        }
      }

      if (!wordClass.equals("O") && !wordClass.equals("")) {
        if (!oldWordClass.equals(wordClass)) {
          print("[" + wordClass + "]")
        }
      }

      oldWordClass = wordClass

      words = words + 1
      print(word);
      print(" ");

      if (words > 10) {
        words = 0
        println(" ")
      }
    }
  }
}
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.  
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-  
Slip Opinion -RRB- OCTOBER TERM ,  
2012 Syllabus NOTE : Where it  
is feasible , a syllabus -LRB-  
headnote -RRB- will be released ,  
as isbeing done in connection with  
this case , at the time  
the opinion is issued . The  
syllabus constitutes no part of the  
opinion of the Court but has  
beenprepared by the Reporter of Decisions  
for the convenience of the reader  
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit  
Timber & Lumber Co. [/ORGANIZATION], 200  
U. S. 321 , 337 .  
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES  
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,  
IN ITS CAPACITY AS FIDUCIARY AND  
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS  
, INC. . EMPLOYEE BENEFITS PLAN  
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI  
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF  
APPEALS FOR THE THIRD CIRCUIT No.  
11 -- 1285 . Argued November  
27 , 2012 -- Decided April  
16 , 2013 The health benefits  
plan established by petitioner [ORGANIZATION]US Airways  
[/ORGANIZATION]paid $ 66,866 in medical expenses  
for injuries suffered by respondentMcCutchen ,  
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in  
a car accident caused by athird  
party . The plan entitled [ORGANIZATION]US  
Airways [/ORGANIZATION]to reimbursement if 
[PERSON]McCutchen [/PERSON]