Document clustering – understanding the number of clusters k in a semantic context

We are given the following information about the frequency counts for the words money and god(s) in the following 17 books from the Project Gutenberg:

Book number

Book name

Money in %

God(s) in %

1

The Vedanta-Sutras with the Commentary by

Ramanuja, by Trans. George Thibaut

0

0.07

2

The Mahabharata of Krishna-Dwaipayana Vyasa

- Adi Parva, by Kisari Mohan Ganguli

0

0.17

3

The Mahabharata of Krishna-Dwaipayana

Vyasa, Part 2, by Krishna-Dwaipayana Vyasa

0.01

0.10

4

Mahabharata of Krishna-Dwaipayana Vyasa Bk.

3 Pt. 1, by Krishna-Dwaipayana Vyasa

0

0.32

5

The Mahabharata of Krishna-Dwaipayana Vyasa

Bk. 4, by Kisari Mohan Ganguli

0

0.06

6

The Mahabharata of Krishna-Dwaipayana Vyasa

Bk. 3 Pt. 2, by Translated by Kisari Mohan Ganguli

0

0.27

7

The Vedanta-Sutras with the Commentary by

Sankaracarya

0

0.06

8

The King James Bible

0.02

0.59

9

Paradise Regained, by John Milton

0.02

0.45

10

Imitation of Christ, by Thomas A Kempis

0.01

0.69

11

The Koran as translated by Rodwell

0.01

1.72

12

The Adventures of Tom Sawyer, Complete by

Mark Twain (Samuel Clemens)

0.05

0.01

13

Adventures of Huckleberry Finn, Complete

by Mark Twain (Samuel Clemens)

0.08

0

14

Great Expectations, by Charles Dickens

0.04

0.01

15

The Picture of Dorian Gray, by Oscar Wilde

0.03

0.03

16

The Adventures of Sherlock Holmes, by Arthur Conan Doyle

0.04

0.03

17

Metamorphosis, by Franz Kafka

Translated by David Wyllie

0.06

0.03

We would like to cluster this dataset based on the on the chosen frequency counts of the words into the groups by their semantic context.

Analysis:

First we will do a rescaling since the highest frequency count of the word money is 0.08% whereas the highest frequency count of the word god(s) is 1.72%. So we will divide the frequency counts of money by 0.08 and the frequency counts of god(s) by 1.72:

Book number Money scaled God(s) scaled
1 0 0.0406976744
2 0 0.0988372093
3 0.125 0.0581395349
4 0 0.1860465116
5 0 0.0348837209
6 0 0.1569767442
7 0 0.0348837209
8 0.25 0.3430232558
9 0.25 0.261627907
10 0.125 0.4011627907
11 0.125 1
12 0.625 0.0058139535
13 1 0
14 0.5 0.0058139535
15 0.375 0.0174418605
16 0.5 0.0174418605
17 0.75 0.0174418605

Now that we have rescaled data, let us apply k-means clustering algorithm trying dividing the data into a different number of the clusters.

Input:

source_code/5/document_clustering/word_frequencies_money_god_scaled.csv
0,0.0406976744
0,0.0988372093
0.125,0.0581395349
0,0.1860465116
0,0.0348837209
0,0.1569767442
0,0.0348837209
0.25,0.3430232558
0.25,0.261627907
0.125,0.4011627907
0.125,1
0.625,0.0058139535
1,0
0.5,0.0058139535
0.375,0.0174418605
0.5,0.0174418605
0.75,0.0174418605

Output for 2 clusters:

$ python k-means_clustering.py document_clustering/word_frequencies_money_god_scaled.csv 2 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 0), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 0), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.0, 0.0406976744), (1.0, 0.0)]
Step number 1: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 0), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 1), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.10416666666666667, 0.21947674418333332), (0.675, 0.0093023256)]
Step number 2: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 0), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 1), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.07954545454545454, 0.2378435517909091), (0.625, 0.01065891475)]

We can observe that clustering into the 2 clusters divides books into religious ones, the ones in the blue cluster and non-religious ones, the ones in the red cluster. Let us try to cluster the books into the 3 clusters to observe how the algorithm would divide the data.

Output for 3 clusters:

$ python k-means_clustering.py document_clustering/word_frequencies_money_god_scaled.csv 3 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 2), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 0), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.0, 0.0406976744), (1.0, 0.0), (0.125, 1.0)]
Step number 1: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 2), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 1), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.10227272727272728, 0.14852008456363636), (0.675, 0.0093023256), (0.125, 1.0)]
Step number 2: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 0), ((0.25, 0.261627907), 0), ((0.125, 0.4011627907), 0), ((0.125, 1.0), 2), ((0.625, 0.0058139535), 1), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 1), ((0.375, 0.0174418605), 1), ((0.5, 0.0174418605), 1), ((0.75, 0.0174418605), 1)]
centroids = [(0.075, 0.16162790697), (0.625, 0.01065891475), (0.125, 1.0)]

This time the algorithm separated from the religious books book The Koran into a green cluster. This is because in fact the word god is the 5th most frequent word in The Koran. The clustering here happens to divide the books according to the writing style they were written with. Clustering into 4 clusters separates one book that has a relatively high frequency of the word money from the red cluster of non-religious books into a separate cluster. Let us look at the clustering into the 5 clusters.

Output for 5 clusters:

$ python k-means_clustering.py word_frequencies_money_god_scaled.csv 5 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 4), ((0.25, 0.261627907), 4), ((0.125, 0.4011627907), 4), ((0.125, 1.0), 2), ((0.625, 0.0058139535), 3), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 3), ((0.375, 0.0174418605), 3), ((0.5, 0.0174418605), 3), ((0.75, 0.0174418605), 3)]
centroids = [(0.0, 0.0406976744), (1.0, 0.0), (0.125, 1.0), (0.5, 0.0174418605), (0.25, 0.3430232558)]
Step number 1: point_groups = [((0.0, 0.0406976744), 0), ((0.0, 0.0988372093), 0), ((0.125, 0.0581395349), 0), ((0.0, 0.1860465116), 0), ((0.0, 0.0348837209), 0), ((0.0, 0.1569767442), 0), ((0.0, 0.0348837209), 0), ((0.25, 0.3430232558), 4), ((0.25, 0.261627907), 4), ((0.125, 0.4011627907), 4), ((0.125, 1.0), 2), ((0.625, 0.0058139535), 3), ((1.0, 0.0), 1), ((0.5, 0.0058139535), 3), ((0.375, 0.0174418605), 3), ((0.5, 0.0174418605), 3), ((0.75, 0.0174418605), 3)]
centroids = [(0.017857142857142856, 0.08720930231428571), (1.0, 0.0), (0.125, 1.0), (0.55, 0.0127906977), (0.20833333333333334, 0.3352713178333333)]

This clustering further divides the blue cluster of the remaining religious books into the blue cluster of the Hindi books and the gray cluster of the Christian books.

We can use clustering this way to group items with similar properties and then enable to find similar items quickly based on the given example. The granularity of the clustering parameter k determines how similar we can expect the items in a group to be. The higher the parameter, the more similar items are going to be in the cluster, but a smaller number of them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset