Segmentation, annotation, and chunking

When the text is presented in digital form, it is relatively easy to find words as we can split the stream on non-word characters. This becomes more complex in spoken language analysis. In this case, segmenters try to optimize a metric, for example, to minimize the number of distinct words in the lexicon and the length or complexity of the phrase (Natural Language Processing with Python by Steven Bird et al, O'Reilly Media Inc, 2009).

Annotation usually refers to parts-of-speech tagging. In English, these are nouns, pronouns, verbs, adjectives, adverbs, articles, prepositions, conjunctions, and interjections. For example, in the phrase we saw the yellow dog, we is a pronoun, saw is a verb, the is an article, yellow is an adjective, and dog is a noun.

In some languages, the chunking and annotation depends on context. For example, in Chinese, ???? literally translates to love country person and can mean either country-loving person or love country-person. In Russian, ??????? ?????? ??????????, literally translating to execute not pardon, can mean execute, don't pardon, or don't execute, pardon. While in written language, this can be disambiguated using commas, in a spoken language this is usually it is very hard to recognize the difference, even though sometimes the intonation can help to segment the phrase properly.

For techniques based on word frequencies in the bags, some extremely common words, which are of little value in helping select documents, are explicitly excluded from the vocabulary. These words are called stop words. There is no good general strategy for determining a stop list, but in many cases, this is to exclude very frequent words that appear in almost every document and do not help to differentiate between them for classification or information retrieval purposes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset