In our corpus, there will be certain words that occur very frequently, such as the, is, and so on, and there are certain words that occur infrequently. To maintain a balance between these two, we use a subsampling technique. So, we remove the words that occur frequently more than a certain threshold with the probability , and it can be represented as:
Here, is the threshold and is the frequency of the word .