Words (tokens) aren't the only structures that we're interested in, however. Another interesting and useful grammatical structure is the sentence. In this recipe, we'll use a process similar to the one we used in the previous recipe, Tokenizing text, in order to create a function that will pull sentences from a string in the same way that tokenize pulled tokens from a string in the last recipe.
We'll need to include clojure-opennlp
in our project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp "0.3.2"]])
We will also need to require it into the current namespace:
(require '[opennlp.nlp :as nlp])
Finally, we'll download a model for a statistical sentence splitter. I downloaded en-sent.bin
from http://opennlp.sourceforge.net/models-1.5/. I then saved it into models/en-sent.bin.
As in the Tokenizing text recipe, we will start by loading the sentence identification model data, as shown here:
(def get-sentences (nlp/make-sentence-detector "models/en-sent.bin"))
Now, we use that data to split a text into a series of sentences, as follows:
user=> (get-sentences "I never saw a Purple Cow. I never hope to see one. But I can tell you, anyhow. I'd rather see than be one.") ["I never saw a Purple Cow." "I never hope to see one." "But I can tell you, anyhow." "I'd rather see than be one."]