The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.
In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.
For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.
import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.xml.sax._
import edu.stanford.nlp.ling.CoreAnnotations
object pdfHandler extends ContentHandler {
val contents: StringBuffer = new StringBuffer()
def characters(ch: Array[Char], start: Int, length: Int) {
contents.append(new String(ch))
def endDocument() {
def endElement(uri: String, localName: String, qName: String) {
def endPrefixMapping(prefix: String) {
def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
def processingInstruction(target: String, data: String) {
def setDocumentLocator(locator: Locator) {
def skippedEntity(name: String) {
def startDocument() {
def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
def startPrefixMapping(prefix: String, uri: String) {
object pdf extends App {
val file = """e:\data\11-1285_i4dk.pdf"""
val pdf: PDFParser = new PDFParser();
val stream: InputStream = new FileInputStream(file)
val handler: ContentHandler = pdfHandler
val metadata: Metadata = new Metadata()
val context: ParseContext = new ParseContext()
val contents: String = pdfHandler.contents.toString()
val src = "stanford-ner-2013-04-04/classifiers/"
val classifier1 = "english.all.3class.distsim.crf.ser.gz"
val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
val classifier3 = "english.muc.7class.distsim.crf.ser.gz"
val serializedClassifier = src + classifier1
val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
val out = classifier.classify(contents)
var words = 0
for (i 10) {
words = 0
println(" ")
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v. [PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB- Slip Opinion -RRB- OCTOBER TERM , 2012 Syllabus NOTE : Where it is feasible , a syllabus -LRB- headnote -RRB- will be released , as isbeing done in connection with this case , at the time the opinion is issued . The syllabus constitutes no part of the opinion of the Court but has beenprepared by the Reporter of Decisions for the convenience of the reader . See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit Timber & Lumber Co. [/ORGANIZATION], 200 U. S. 321 , 337 . SUPREME COURT OF THE [ORGANIZATION]UNITED STATES Syllabus US AIRWAYS [/ORGANIZATION], INC. , IN ITS CAPACITY AS FIDUCIARY AND PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS , INC. . EMPLOYEE BENEFITS PLAN v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF APPEALS FOR THE THIRD CIRCUIT No. 11 -- 1285 . Argued November 27 , 2012 -- Decided April 16 , 2013 The health benefits plan established by petitioner [ORGANIZATION]US Airways [/ORGANIZATION]paid $ 66,866 in medical expenses for injuries suffered by respondentMcCutchen , a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in a car accident caused by athird party . The plan entitled [ORGANIZATION]US Airways [/ORGANIZATION]to reimbursement if [PERSON]McCutchen [/PERSON]