Zipf's law states that the frequency of a token in a text is directly proportional to its rank or position in the sorted list. This law describes how tokens are distributed in languages: some tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely occur.
Let's see the code for obtaining the log-log plot in NLTK that is based on Zipf's law:
>>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.probability import FreqDist >>> import matplotlib >>> import matplotlib.pyplot as plt >>> matplotlib.use('TkAgg') >>> fd = FreqDist() >>> for text in gutenberg.fileids(): . . . for word in gutenberg.words(text): . . . fd.inc(word) >>> ranks = [] >>> freqs = [] >>> for rank, word in enumerate(fd): . . . ranks.append(rank+1) . . . freqs.append(fd[word]) . . . >>> plt.loglog(ranks, freqs) >>> plt.xlabel('frequency(f)', fontsize=14, fontweight='bold') >>> plt.ylabel('rank(r)', fontsize=14, fontweight='bold') >>> plt.grid(True) >>> plt.show()
The preceding code will obtain a plot of rank versus the frequency of words in a document. So, we can check whether Zipf's law holds for all the documents or not by seeing the proportionality relationship between rank and the frequency of words.