Removing noise

Any text present in the sentence that may not be relevant to the context of the data can be termed noise.

For example, this can include language stop words (commonly used words in a language  is, am, the, of, and in), URLs or links, social media entities (mentions, hashtags), and punctuation.

To remove the noise from the sentence, the general approach is to maintain a dictionary of noise words and then iterate through the tokens of the sentence under consideration against this dictionary and remove matching stop words. The dictionary of noise words is updated frequently to cover all possible noise.

