Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Filtering insignificant words from a sentence

Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase the movie was terrible, the most significant words are movie and terrible, while the and was are almost useless. You could get the same meaning if you took them out, that is, movie terrible or terrible movie. Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words and keep the significant ones by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

Word	Tag
a	DT
all	PDT
an	DT
and	CC
or	CC
that	WDT
the	DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. Refer to Appendix A, Penn Treebank Part-of-speech Tags, for details on tag meanings.

How to do it...

In transforms.py is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC:

def filter_insignificant(chunk, tag_suffixes=['DT', 'CC']):
  good = []

  for word, tag in chunk:
    ok = True

    for suffix in tag_suffixes:
      if tag.endswith(suffix):
        ok = False
        break

    if ok:
      good.append((word, tag))

  return good

And now we can use it on the part-of-speech tagged version of the terrible movie:

>>> from transforms import filter_insignificant
>>> filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie', 'NN')])
[('terrible', 'JJ'), ('movie', 'NN')]

As you can see, the word the is eliminated from the chunk.

How it works...

The filter_insignificant() function iterates over the tagged words in the chunk. For each tag, it checks whether that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. But if the tag is ok, then the tagged word is appended to a new good chunk that is returned.

There's more...

The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as you, your, their, and theirs are no good, but DT and CC words are ok. The tag suffixes would then be PRP and PRP$:

>>> filter_insignificant([('your', 'PRP$'), ('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')], tag_suffixes=['PRP', 'PRP$'])
[('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')]

Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing and querying and text classification.

Table of Contents for
Filtering insignificant words from a sentence

Filtering insignificant words from a sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Filtering insignificant words from a sentence

Create new playlist

Sign In

Sign Up

Filtering insignificant words from a sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Filtering insignificant words from a sentence