Spelling correction with Enchant

Replacing repeating characters is actually an extreme form of spelling correction. In this recipe, we will take on the less extreme case of correcting minor spelling issues using Enchant—a spelling correction API.

Getting ready

You will need to install Enchant and a dictionary for it to use. Enchant is an offshoot of the AbiWord open source word processor, and more information on it can be found at http://www.abisource.com/projects/enchant/.

For dictionaries, Aspell is a good open source spellchecker and dictionary that can be found at http://aspell.net/.

Finally, you will need the PyEnchant library, which can be found at the following link: http://pythonhosted.org/pyenchant/

You should be able to install it with the easy_install command that comes with Python setuptools, such as by typing sudo easy_install pyenchant on Linux or Unix. On a Mac machine, PyEnchant may be difficult to install. If you have difficulties, consult http://pythonhosted.org/pyenchant/download.html.

How to do it...

We will create a new class called SpellingReplacer in replacers.py, and this time, the replace() method will check Enchant to see whether the word is valid. If not, we will look up the suggested alternatives and return the best match using nltk.metrics.edit_distance():

import enchant
from nltk.metrics import edit_distance

class SpellingReplacer(object):
  def __init__(self, dict_name='en', max_dist=2):
    self.spell_dict = enchant.Dict(dict_name)
    self.max_dist = max_dist

  def replace(self, word):
    if self.spell_dict.check(word):
      return word
    suggestions = self.spell_dict.suggest(word)

    if suggestions and edit_distance(word, suggestions[0]) <= 
      self.max_dist:
      return suggestions[0]
    else:
      return word

The preceding class can be used to correct English spellings, as follows:

>>> from replacers import SpellingReplacer
>>> replacer = SpellingReplacer()
>>> replacer.replace('cookbok')
'cookbook'

How it works...

The SpellingReplacer class starts by creating a reference to an Enchant dictionary. Then, in the replace() method, it first checks whether the given word is present in the dictionary. If it is, no spelling correction is necessary and the word is returned. If the word is not found, it looks up a list of suggestions and returns the first suggestion, as long as its edit distance is less than or equal to max_dist. The edit distance is the number of character changes necessary to transform the given word into the suggested word. The max_dist value then acts as a constraint on the Enchant suggest function to ensure that no unlikely replacement words are returned. Here is an example showing all the suggestions for languege, a misspelling of language:

>>> import enchant
>>> d = enchant.Dict('en')
>>> d.suggest('languege')
['language', 'languages', 'languor', "language's"]

Except for the correct suggestion, language, all the other words have an edit distance of three or greater. You can try this yourself with the following code:

>>> from nltk.metrics import edit_distance
>>> edit_distance('language', 'languege')
1
>>> edit_distance('language', 'languo')
3

There's more...

You can use language dictionaries other than en, such as en_GB, assuming the dictionary has already been installed. To check which other languages are available, use enchant.list_languages():

>>> enchant.list_languages()
['en', 'en_CA', 'en_GB', 'en_US']

Tip

If you try to use a dictionary that doesn't exist, you will get enchant.DictNotFoundError. You can first check whether the dictionary exists using enchant.dict_exists(), which will return True if the named dictionary exists, or False otherwise.

The en_GB dictionary

Always ensure that you use the correct dictionary for whichever language you are performing spelling correction on. The en_US dictionary can give you different results than en_GB, such as for the word theater. The word theater is the American English spelling whereas the British English spelling is theatre:

>>> import enchant
>>> dUS = enchant.Dict('en_US')
>>> dUS.check('theater')
True
>>> dGB = enchant.Dict('en_GB')
>>> dGB.check('theater')
False
>>> from replacers import SpellingReplacer
>>> us_replacer = SpellingReplacer('en_US')
>>> us_replacer.replace('theater')
'theater'
>>> gb_replacer = SpellingReplacer('en_GB')
>>> gb_replacer.replace('theater')
'theatre'

Personal word lists

Enchant also supports personal word lists. These can be combined with an existing dictionary, allowing you to augment the dictionary with your own words. So, let's say you had a file named mywords.txt that had nltk on one line. You could then create a dictionary augmented with your personal word list as follows:

>>> d = enchant.Dict('en_US')
>>> d.check('nltk')
False
>>> d = enchant.DictWithPWL('en_US', 'mywords.txt')
>>> d.check('nltk')
True

To use an augmented dictionary with our SpellingReplacer class, we can create a subclass in replacers.py that takes an existing spelling dictionary:

class CustomSpellingReplacer(SpellingReplacer):
  def __init__(self, spell_dict, max_dist=2):
    self.spell_dict = spell_dict
    self.max_dist = max_dist

This CustomSpellingReplacer class will not replace any words that you put into mywords.txt:

>>> from replacers import CustomSpellingReplacer
>>> d = enchant.DictWithPWL('en_US', 'mywords.txt')
>>> replacer = CustomSpellingReplacer(d)
>>> replacer.replace('nltk')
'nltk'

See also

The previous recipe covered an extreme form of spelling correction by replacing repeating characters. You can also perform spelling correction by simple word replacement as discussed in the next recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset