Subsampling frequent words

In our corpus, there will be certain words that occur very frequently, such as the, is, and so on, and there are certain words that occur infrequently. To maintain a balance between these two, we use a subsampling technique. So, we remove the words that occur frequently more than a certain threshold with the probability , and it can be represented as:

Here, is the threshold and is the frequency of the word .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset