Replacing synonyms

It is often useful to reduce the vocabulary of a text by replacing words with common synonyms. By compressing the vocabulary without losing meaning, you can save memory in cases such as frequency analysis and text indexing. More details about these topics are available at https://en.wikipedia.org/wiki/Frequency_analysis and https://en.wikipedia.org/wiki/Full_text_search. Vocabulary reduction can also increase the occurrence of significant collocations, which was covered in the Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics.

Getting ready

You will need a defined mapping of a word to its synonym. This is a simple controlled vocabulary. We will start by hardcoding the synonyms as a Python dictionary, and then explore other options to store synonym maps.

How to do it...

We'll first create a WordReplacer class in replacers.py that takes a word replacement mapping:

class WordReplacer(object):
  def __init__(self, word_map):
    self.word_map = word_map

  def replace(self, word):
    return self.word_map.get(word, word)

Then, we can demonstrate its usage for simple word replacement:

>>> from replacers import WordReplacer
>>> replacer = WordReplacer({'bday': 'birthday'})
>>> replacer.replace('bday')
'birthday'
>>> replacer.replace('happy')
'happy'

How it works...

The WordReplacer class is simply a class wrapper around a Python dictionary. The replace() method looks up the given word in its word_map dictionary and returns the replacement synonym if it exists. Otherwise, the given word is returned as is.

If you were only using the word_map dictionary, you wouldn't need the WordReplacer class and could instead call word_map.get() directly. However, WordReplacer can act as a base class for other classes that construct the word_map dictionary from various file formats. Read on for more information.

There's more...

Hardcoding synonyms in a Python dictionary is not a good long-term solution. Two better alternatives are to store the synonyms in a CSV file or in a YAML file. Choose whichever format is easiest for those who maintain your synonym vocabulary. Both of the classes outlined in the following section inherit the replace() method from WordReplacer.

CSV synonym replacement

The CsvWordReplacer class extends WordReplacer in replacers.py in order to construct the word_map dictionary from a CSV file:

import csv

class CsvWordReplacer(WordReplacer):
  def __init__(self, fname):
    word_map = {}
    for line in csv.reader(open(fname)):
      word, syn = line
      word_map[word] = syn
    super(CsvWordReplacer, self).__init__(word_map)

Your CSV file should consist of two columns, where the first column is the word and the second column is the synonym meant to replace it. If this file is called synonyms.csv and the first line is bday, birthday, then you can perform the following:

>>> from replacers import CsvWordReplacer
>>> replacer = CsvWordReplacer('synonyms.csv')
>>> replacer.replace('bday')
'birthday'
>>> replacer.replace('happy')
'happy'

YAML synonym replacement

If you have PyYAML installed, you can create YamlWordReplacer in replacers.py as shown in the following:

import yaml

class YamlWordReplacer(WordReplacer):
  def __init__(self, fname):
    word_map = yaml.load(open(fname))
    super(YamlWordReplacer, self).__init__(word_map)

Note

Download and installation instructions for PyYAML are located at http://pyyaml.org/wiki/PyYAML. You can also type pip install pyyaml on the command prompt

Your YAML file should be a simple mapping of word: synonym, such as bday: birthday. Note that the YAML syntax is very particular, and the space after the colon is required. If the file is named synonyms.yaml, then you can perform the following:

>>> from replacers import YamlWordReplacer
>>> replacer = YamlWordReplacer('synonyms.yaml')
>>> replacer.replace('bday')
'birthday'
>>> replacer.replace('happy')
'happy'

See also

You can use the WordReplacer class to perform any kind of word replacement, even spelling correction for more complicated words that can't be automatically corrected, as we did in the previous recipe. In the next recipe, we will cover antonym replacement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset