One of the main challenges in text mining is transforming unstructured written natural language into structured attribute-based instances. The process involves many steps as shown in the following image:
First, we extract some text from the Internet, existing documents, or databases. At the end of the first step, the text could still be presented in the XML format or some other proprietary format. The next step is to, therefore, extract the actual text only and segment it into parts of the document, for example, title, headline, abstract, body, and so on. The third step is involved with normalizing text encoding to ensure the characters are presented the same way, for example, documents encoded in formats such as ASCII, ISO 8859-1, and Windows-1250 are transformed into Unicode encoding. Next, tokenization splits the document into particular words, while the following step removes frequent words that usually have low predictive power, for example, the, a, I, we, and so on.
The part-of-speech (POS) tagging and lemmatization step could be included to transform each token (that is, word) to its basic form, which is known as lemma, by removing word endings and modifiers. For example, running becomes run, better becomes good, and so on. A simplified approach is stemming, which operates on a single word without any context of how the particular word is used, and therefore, cannot distinguish between words having different meaning, depending on the part of speech, for example, axes as plural of axe as well as axis.
The last step transforms tokens into a feature space. Most often feature space is a bag-of-words (BoW) presentation. In this presentation, a set of all words appearing in the dataset is created, that is, a bag of words. Each document is then presented as a vector that counts how many times a particular word appears in the document.
Consider the following example with two sentences:
The bag of words in this case consists of {Jacob, likes, table, tennis, Emma, too, also, basketball}
, which has eight distinct words. The two sentences could be now presented as vectors using the indexes of the list, indicating how many times a word at a particular index appears in the document, as follows:
[1, 2, 2, 2, 1, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 1, 1]
Such vectors finally become instances for further learning.
Another very powerful presentation based on the BoW model is word2vec. Word2vec was introduced in 2013 by a team of researchers led by Tomas Mikolov at Google. Word2vec is a neural network that learns distributed representations for words. An interesting property of this presentation is that words appear in clusters, such that some word relationships, such as analogies, can be reproduced using vector math. A famous example shows that king - man + woman returns queen.
Further details and implementation are available at the following link:
In this chapter, we will not look into how to scrap a set of documents from a website or extract them from database. Instead, we will assume that we already collected them as set of documents and store them in the .txt
file format. Now let's look at two options how to load them. The first option addresses the situation where each document is stored in its own .txt
file. The second option addresses the situation where all the documents are stored in a single file, one per line.
Mallet supports reading from directory with the cc.mallet.pipe.iterator.FileIterator
class. File iterator is constructed with the following three parameters:
File[]
directories with text filesConsider the data structured into folders as shown in the following image. We have documents organized in five topics by folders (tech
, entertainment
, politics
, and sport
, business
). Each folder contains documents on particular topics, as shown in the following image:
In this case, we initialize iterator
as follows:
FileIterator iterator = new FileIterator(new File[]{new File("path-to-my-dataset")}, new TxtFilter(), FileIterator.LAST_DIRECTORY);
The first parameter specifies the path to our root folder, the second parameter limits the iterator to the .txt
files only, while the last parameter asks the method to use the last directory name in the path as class label.
Another option to load the documents is through cc.mallet.pipe.iterator.CsvIterator.CsvIterator(Reader, Pattern, int, int, int)
, which assumes all the documents are in a single file and returns one instance per line extracted by a regular expression. The class is initialized by the following components:
Reader
: This is the object that specifies how to read from a filePattern
: This is a regular expression, extracting three groups: data, target label, and document nameint, int, int
: These are the indexes of data, target, and name groups as they appear in a regular expressionConsider a text document in the following format, specifying document name, category and content:
AP881218 local-news A 16-year-old student at a private Baptist... AP880224 business The Bechtel Group Inc. offered in 1985 to... AP881017 local-news A gunman took a 74-year-old woman hostage... AP900117 entertainment Cupid has a new message for lovers this... AP880405 politics The Reagan administration is weighing w...
To parse a line into three groups, we can use the following regular expression:
^(\S*)[\s,]*(\S*)[\s,]*(.*)$
There are three groups that appear in parenthesis, ()
, where the third group contains the data, the second group contains the target class, and the first group contains the document ID. The iterator is initialized as follows:
CsvIterator iterator = new CsvIterator ( fileReader, Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(.*)$"), 3, 2, 1));
Here the regular expression extracts the three groups separated by an empty space and their order is 3, 2, 1
.
Now let's move to data pre-processing pipeline.
Once we initialized an iterator that will go through the data, we need to pass the data through a sequence of transformations as described at the beginning of this section. Mallet supports this process through a pipeline and a wide variety of steps that could be included in a pipeline, which are collected in the cc.mallet.pipe
package. Some examples are as follows:
Input2CharSequence
: This is a pipe that can read from various kinds of text sources (either URI, File, or Reader) into CharSequence
CharSequenceRemoveHTML
: Thise pipe removes HTML from CharSequence
MakeAmpersandXMLFriendly
: This converts &
to &
in tokens of a token sequenceTokenSequenceLowercase
: This converts the text in each token in the token sequence in the data field to lower caseTokenSequence2FeatureSequence
: This converts the token sequence in the data field of each instance to a feature sequenceTokenSequenceNGrams
: This converts the token sequence in the data field to a token sequence of ngrams, that is, combination of two or more wordsThe full list of processing steps is available in the following Mallet documentation:
http://mallet.cs.umass.edu/api/index.html?cc/mallet/pipe/iterator/package-tree.html
Now we are ready to build a class that will import our data.
First, let's build a pipeline, where each processing step is denoted as a pipeline in Mallet. Pipelines can be wired together in a serial fashion with a list of ArrayList<Pipe>
objects:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
Begin by reading data from a file object and converting all the characters into lower case:
pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() );
Next, tokenize raw strings with a regular expression. The following pattern includes Unicode letters and numbers and the underscore character:
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern));
Remove stop words, that is, frequent words with no predictive power, using a standard English stop list. Two additional parameters indicate whether stop word removal should be case-sensitive and mark deletions instead of just deleting the words. We'll set both of them to false
:
pipeList.add(new TokenSequenceRemoveStopwords(false, false));
Instead of storing the actual words, we can convert them into integers, indicating a word index in the bag of words:
pipeList.add(new TokenSequence2FeatureSequence());
We'll do the same for the class label; instead of label string, we'll use an integer, indicating a position of the label in our bag of words:
pipeList.add(new Target2Label());
We could also print the features and the labels by invoking the PrintInputAndTarget
pipe:
pipeList.add(new PrintInputAndTarget());
Finally, we store the list of pipelines in a SerialPipes
class that will covert an instance through a sequence of pipes:
SerialPipes pipeline = new SerialPipes(pipeList);
Now let's take a look at how apply this in a text mining application!