Finding people, places, and things with Named Entity Recognition

One thing that's fairly easy to pull out of documents is named items. This includes things such as people's names, organizations, locations, and dates. These algorithms are called Named Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error rates under 0.1 are normal.

The OpenNLP library has classes to perform NER, and depending on what you train them with, they will identify people, locations, dates, or a number of other things. The clojure-opennlp library also exposes these classes in a good, Clojure-friendly way.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of this, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

From the Tokenizing text recipe, we'll use tokenize, and from the Focusing on content words with stoplists recipe, we'll use normalize.

Pretrained models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. I downloaded en-ner-person.bin, en-ner-organization.bin, en-ner-date.bin, en-ner-location.bin, and en-ner-money.bin. Then, I saved these models in models/.

How to do it…

To set things up, we have to load the models and bind them to function names. To load the models, we'll use the opennlp.nlp/make-name-finder function. We can use this to load each recognizer individually, as follows:

(def get-persons
  (nlp/make-name-finder "models/en-ner-person.bin"))
(def get-orgs
  (nlp/make-name-finder "models/en-ner-organization.bin"))
(def get-date
  (nlp/make-name-finder "models/en-ner-date.bin"))
(def get-location
  (nlp/make-name-finder "models/en-ner-location.bin"))
(def get-money
  (nlp/make-name-finder "models/en-ner-money.bin"))

Now, in order to test this out, let's load the latest SOTU address in our corpus. This is Barak Obama's 2013 State of the Union:

(def sotu (tokenize (slurp "sotu/2013-0.txt")))

We can call each of these functions on the tokenized text to see the results, as shown here:

user=> (get-persons sotu)
("John F. Kennedy" "Most Americans—Democrats" "Government" "John McCain" "Joe Lieberman" "So" "Tonight I" "Joe Biden" "Joe" "Tonight" "Al Qaida" "Russia" "And" "Michelle" "Hadiya Pendleton" "Gabby Giffords" "Menchu Sanchez" "Desiline Victor" "Brian Murphy" "Brian")
user=> (get-orgs sotu)
("Congress" "Union" "Nation" "America" "Tax" "Apple" "Department of Defense and Energy" "CEOs" "Siemens America—a" "New York Public Schools" "City University of New York" "IBM" "American" "Higher Education" "Federal" "Senate" "House" "CEO" "European Union" "It")
user=> (get-date sotu)
("this year" "18 months ago" "Last year" "Today" "last 15" "2007" "today" "tomorrow" "20 years ago" "This" "last year" "This spring" "next year" "2014" "next two decades" "next month" "a")
user=> (get-location sotu)
("Washington" "United States of America" "Earth" "Japan" "Mexico" "America" "Youngstown" "Ohio" "China" "North Carolina" "Georgia" "Oklahoma" "Germany" "Brooklyn" "Afghanistan" "Arabian Peninsula" "Africa" "Libya" "Mali" "And" "North Korea" "Iran" "Russia" "Asia" "Atlantic" "United States" "Rangoon" "Burma" "the Americas" "Europe" "Middle East" "Egypt" "Israel" "Chicago" "Oak Creek" "New York City" "Miami" "Wisconsin")
user=> (get-money sotu)
("$ 2.5 trillion" "$ 4 trillion" "$ 140 to" "$")

How it works…

When you glance at the results, you can see that it appears to have performed well. We need to look into the document and see what it missed to be certain, of course.

The process to use this is similar to the tokenizer or sentence chunker: load the model from a file and then call the result as a function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset