Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Segmentation, annotation, and chunking

When the text is presented in digital form, it is relatively easy to find words as we can split the stream on non-word characters. This becomes more complex in spoken language analysis. In this case, segmenters try to optimize a metric, for example, to minimize the number of distinct words in the lexicon and the length or complexity of the phrase (Natural Language Processing with Python by Steven Bird et al, O'Reilly Media Inc, 2009).

Annotation usually refers to parts-of-speech tagging. In English, these are nouns, pronouns, verbs, adjectives, adverbs, articles, prepositions, conjunctions, and interjections. For example, in the phrase we saw the yellow dog, we is a pronoun, saw is a verb, the is an article, yellow is an adjective, and dog is a noun.

In some languages, the chunking and annotation depends on context. For example, in Chinese, ???? literally translates to love country person and can mean either country-loving person or love country-person. In Russian, ??????? ?????? ??????????, literally translating to execute not pardon, can mean execute, don't pardon, or don't execute, pardon. While in written language, this can be disambiguated using commas, in a spoken language this is usually it is very hard to recognize the difference, even though sometimes the intonation can help to segment the phrase properly.

For techniques based on word frequencies in the bags, some extremely common words, which are of little value in helping select documents, are explicitly excluded from the vocabulary. These words are called stop words. There is no good general strategy for determining a stop list, but in many cases, this is to exclude very frequent words that appear in almost every document and do not help to differentiate between them for classification or information retrieval purposes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Segmentation, annotation, and chunking

Create new playlist

Sign In

Sign Up

Segmentation, annotation, and chunking

Table of Contents for
Segmentation, annotation, and chunking