Storing a conditional frequency distribution in Redis

The nltk.probability.ConditionalFreqDist class is a container for FreqDist instances, with one FreqDist per condition. It is used to count frequencies that are dependent on another condition, such as another word or a class label. We used this class in the Calculating high information words recipe in Chapter 7, Text Classification. Here, we'll create an API-compatible class on top of Redis using the RedisHashFreqDist from the previous recipe.

Getting ready

As in the previous recipe, you'll need to have Redis and redis-py installed with an instance of redis-server running.

How to do it...

We define a RedisConditionalHashFreqDist class in redisprob.py that extends nltk.probability.ConditionalFreqDist and overrides the __getitem__() method. We override __getitem__() so we can create an instance of RedisHashFreqDist instead of a FreqDist:

from nltk.probability import ConditionalFreqDist
from rediscollections import encode_key

class RedisConditionalHashFreqDist(ConditionalFreqDist):
  def __init__(self, r, name, cond_samples=None):
    self._r = r
    self._name = name
    ConditionalFreqDist.__init__(self, cond_samples)
   
    for key in self._r.keys(encode_key('%s:*' % name)):
      condition = key.split(':')[1]
      self[condition] # calls self.__getitem__(condition)
  
  def __getitem__(self, condition):
    if condition not in self._fdists:
      key = '%s:%s' % (self._name, condition)
      val = RedisHashFreqDist(self._r, key)
      super(RedisConditionalHashFreqDist, self).__setitem__(condition, val)
    
    return super(RedisConditionalHashFreqDist, self).__getitem__(condition)
  
  def clear(self):
    for fdist in self.values():
      fdist.clear()

An instance of this class can be created by passing in a Redis connection and a base name. After that, it works just like a ConditionalFreqDist:

>>> from redis import Redis
>>> from redisprob import RedisConditionalHashFreqDist
>>> r = Redis()
>>> rchfd = RedisConditionalHashFreqDist(r, 'condhash')
>>> rchfd.N()
0
>>> rchfd.conditions()
[]

>>> rchfd['cond1']['foo'] += 1
>>> rchfd.N()
1
>>> rchfd['cond1']['foo']
1
>>> rchfd.conditions()
['cond1']
>>> rchfd.clear()

How it works...

The RedisConditionalHashFreqDist uses name prefixes to reference RedisHashFreqDist instances. The name passed into the RedisConditionalHashFreqDist is a base name that is combined with each condition to create a unique name for each RedisHashFreqDist. For example, if the base name of the RedisConditionalHashFreqDist is 'condhash', and the condition is 'cond1', then the final name for the RedisHashFreqDist is 'condhash:cond1'. This naming pattern is used at initialization to find all the existing hash maps using the keys command. By searching for all keys matching 'condhash:*', we can identify all the existing conditions and create an instance of RedisHashFreqDist for each.

Combining strings with colons is a common naming convention for Redis keys as a way to define namespaces. In our case, each RedisConditionalHashFreqDist instance defines a single namespace of hash maps.

There's more...

RedisConditionalHashFreqDist also defines a clear() method. This is a helper method that calls clear() on all the internal RedisHashFreqDist instances. The clear() method is not defined in ConditionalFreqDist.

See also

The previous recipe covers RedisHashFreqDist in detail. Also, see the Calculating high information words recipe in Chapter 7, Text Classification, for example usage of ConditionalFreqDist.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset