What is TF-IDF?

Tf-idf's are a way to represent documents as feature vectors. But what are they? Tf-idf's can be understood as a modification of the raw term frequencies (tf) and inverse document frequency (idf). The tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to down weight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely to be unimportant or don't contain any useful information for NLP tasks such as document classification.

