Stoplists or stopwords are a list of words that should not be included in further analysis. Usually, this is because they're so common that they don't add much information to the analysis.
These lists are usually dominated by what are known as function words—words that have a grammatical purpose in the sentence, but which themselves do not carry any meaning. For example, the indicates that the noun that follows is singular, but it does not have a meaning by itself. Others prepositions, such as after, have a meaning, but they are so common that they tend to get in the way.
On the other hand, chair has a meaning beyond what it's doing in the sentence, and in fact, it's role in the sentence will vary (subject, direct object, and so on).
You don't always want to use stopwords since they throw away information. However, since function words are more frequent than content words, sometimes focusing on the content words can add clarity to your analysis and its output. Also, they can speed up the processing.
This recipe will build on the work that we've done so far in this chapter. As such, it will use the same project.clj
file that we used in the Tokenizing text and Finding sentences recipes:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp "0.3.2"]])
However, we'll use a slightly different set of requirements for this recipe:
(require '[opennlp.nlp :as nlp] '[clojure.java.io :as io])
We'll also need to have a list of stopwords. You can easily create your own list, but for the purpose of this recipe, we'll use the English stopword list included with the Natural Language Toolkit (http://www.nltk.org/). You can download this from http://nltk.github.com/nltk_data/packages/corpora/stopwords.zip. Unzip it into your project directory and make sure that the stopwords/english
file exists.
We'll also use the tokenize
and get-sentences
functions that we created in the previous two recipes.
We'll need to create a function in order to process and normalize the tokens. Also, we'll need a utility function to load the stopword list. Once these are in place, we'll see how to use the stopwords. To do this, perform the following steps:
normalize
function to handle the lowercasing of each token:(defn normalize [token-seq] (map #(.toLowerCase %) token-seq))
load-stopwords
function will read in the file, break it into lines, and fold them into a set, as follows:(defn load-stopwords [filename] (with-open [r (io/reader filename)] (set (doall (line-seq r))))) (def is-stopword (load-stopwords "stopwords/english"))
(def tokens (map #(remove is-stopword (normalize (tokenize %))) (get-sentences "I never saw a Purple Cow. I never hope to see one. But I can tell you, anyhow. I'd rather see than be one.")))
Now, you can see that the tokens returned are more focused on the content and are missing all of the function words:
user=> (pprint tokens) (("never" "saw" "purple" "cow" ".") ("never" "hope" "see" "one" ".") ("tell" "," "anyhow" ".") ("'d" "rather" "see" "one" "."))