How it works...

In step 1, we created a dataset iterator using FileLabelAwareIterator.

The FileLabelAwareIterator is a simple filesystem-based LabelAwareIterator interface. It assumes that you have one or more folders organized in the following way:

  • First-level subfolder: Label name
  • Second-level subfolder: The documents for that label

Look at the following screenshot for an example of this data structure:

In step 3, we created ParagraphVector by adding all required hyperparameters. The purpose of paragraph vectors is to associate arbitrary documents with labels. Paragraph vectors are an extension to Word2Vec that learn to correlate labels and words, while Word2Vec correlates words with other words. We need to define labels for the paragraph vectors to work.

For more information on what we did in step 5, refer to the following directory structure (under the unlabeled directory in the project):

The directory names can be random and no specific labels are required. Our task is to find the proper labels (document classifications) for these documents. Word embeddings are stored in the lookup table. For any given word, a word vector of numbers will be returned. 

Word embeddings are stored in the lookup table. For any given word, a word vector will be returned from the lookup table.

In step 6, we created InMemoryLookupTable from paragraph vectors. InMemoryLookupTable is the default word lookup table in DL4J. Basically, the lookup table operates as the hidden layer and the word/document vectors refer to the output. 

Step 8 to step 12 are solely used for the calculation of the domain vector of each document. 

In step 8, we created tokens for the document using the tokenizer that was created in step 2. In step 9, we used the lookup table that was created in step 6 to obtain VocabCache. VocabCache stores the information needed to operate the lookup table. We can look up words in the lookup table using VocabCache

In step 11, we store the word vectors along with the occurrence of a particular word in an INDArray. 

In step 12, we calculated the mean of this INDArray to get the document vector.

The mean across the zero dimension means that it is calculated across all dimensions. 

    In step 13, the cosine similarity is calculated by calling the cosineSim() method provided by ND4J. We use cosine similarity to calculate the similarity of document vectors. ND4J provides a functional interface to calculate the cosine similarity of two domain vectors. vecLabel represents the document vector for the labels from classified documents. Then, we compared vecLabel with our unlabeled document vector, documentVector

    After step 14, you should see an output similar to the following:

    We can choose the label that has the higher cosine similarity value. From the preceding screenshots, we can infer that the first document is more likely finance-related content with a 69.7% probability. The second document is more likely health-related content with a 53.2% probability.

    ..................Content has been hidden....................

    You can't read the all page of ebook, please click here login for view all page.
    Reset