Chapter 8. Using NLTK with Other Python Libraries

In this chapter, we will explore some of the backbone libraries of Python for machine learning and natural language processing. Until now, we have used NLTK, Scikit, and genism, which had very abstract functions, and were very specific to the task in hand. Most of statistical NLP is heavily based on the vector space model, which in turn depends on basic linear algebra covered by NumPy. Also many NLP tasks, such as POS or NER tagging, are really classifiers in disguise. Some of the libraries we will discuss are heavily used in all these tasks.

The idea behind this chapter is to give you a quick overview of some the most fundamental Python libraries. This will help us understand more than just the data structure, design, and math behind some of the coolest libraries, such as NLTK and Scikit, which we have discussed in the previous chapters.

We will look at the following four libraries. I have tried to keep it short, but I highly encourage you to read in more detail about these libraries if you want Python to be a one-stop solution to most of your data science needs.

  • NumPy (Numeric Python)
  • SciPy (Scientific Python)
  • Pandas (Data manipulation)
  • Matplotlib (Visualization)

NumPy

NumPy is a Python library for dealing with numerical operations, and it's really fast. NumPy provides some of the highly optimized data structures, such as ndarrays. NumPy has many functions specially designed and optimized to perform some of the most common numeric operations. This is one of the reasons NLTK, scikit-learn, pandas, and other libraries use NumPy as a base to implement some of the algorithms. This section will give you a brief summary with running examples of NumPy. This will not just help us understand the fundamental data structures beneath NLTK and other libraries, but also give us the ability to customize some of these functionalities to our needs.

Let's start with discussion on ndarrays, how they can be used as matrices, and how easy and efficient it is to deal with matrices in NumPy.

ndarray

An ndarray is an array object that represents a multidimensional, homogeneous array of fixed-size items.

We will start with building an ndarray using an ordinary Python list:

>>>x=[1,2,5,7,3,11,14,25]
>>>import numpy as np
>>>np_arr=np.array(x)
>>>np_arr

As you can see, this is a linear 1D array. The real power of Numpy comes with 2D arrays. Let's move to 2D arrays. We will create one using a Python list of lists.

>>>arr=[[1,2],[13,4],[33,78]]
>>>np_2darr= np.array(arr)
>>>type(np_2darr)
numpy.ndarray

Indexing

The ndarray is indexed more like Python containers. NumPy provides a slicing method to get different views of the ndarray.

>>>np_2darr.tolist()
[[1, 2], [13, 4], [33, 78]]
>>>np_2darr[:]
array([[1, 2], [13,  4], [33, 78]])
>>>np_2darr[:2]
array([[1, 2], [13, 4]])
>>>np_2darr[:1]
array([[1, 2]])
>>>np_2darr[2]
array([33, 78])
>>>    np_2darr[2][0]
>>>33
>>>    np_2darr[:-1]
array([[1, 2], [13, 4]])

Basic operations

NumPy also has some other operations that can be used in various numeric processing. In this example, we want to get an array with values ranging from 0 to 10 with a step size of 0.1. This is typically required for any optimization routine. Some of the most common libraries, such as Scikit and NLTK, actually use these NumPy functions.

>>>>import numpy as np
>>>>np.arange(0.0, 1.0, 0.1)
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1]

We can do something like this, and generate a array with all ones and all zeros:

>>>np.ones([2, 4])
array([[1., 1., 1., 1.], [1., 1., 1., 1.]])
>>>np.zeros([3,4])
array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])

Wow!

If you have done higher school math, you know that we need all these matrixes to perform many algebraic operations. And guess what, most of the Python machine learning libraries also do that!

>>>np.linspace(0, 2, 10)
array([    0.,    0.22222222,    0.44444444,    0.66666667,    0.88888889,    1.11111111,    1.33333333,    1.55555556,    1.77777778,    2,    ])

The linespace function returns number samples which are evenly spaced, calculated over the interval from the start and end values. In the given example we were trying to get 10 sample in the range of 0 to 2.

Similarly, we can do this at the log scale. The function here is:

>>>np.logspace(0,1)
array([    1.,    1.04811313,    1.09854114,    1.1513954,    7.90604321,    8.28642773,    8.68511374,    9.10298178,    9.54095476,    10.,    ])

You can still execute Python's help function to get more details about the parameter and the return values.

>>>help(np.logspace)
Help on function logspace in module NumPy.core.function_base:

logspace(start, stop, num=50, endpoint=True, base=10.0)
    Return numbers spaced evenly on a log scale.
    
    In linear space, the sequence starts at ``base ** start``
    (`base` to the power of `start`) and ends with ``base ** stop``
    (see `endpoint` below).
    
    Parameters
    ----------
    start : float

So we have to provide the start and end and the number of samples we want on the scale; in this case, we also have to provide a base.

Extracting data from an array

We can do all sorts of manipulation and filtering on the ndarrays. Let's start with a new Ndarray, A:

>>>A = array([[0, 0, 0], [0, 1, 2], [0, 2, 4], [0, 3, 6]])

>>>B = np.array([n for n in range n for n in range(4)])
>>>B
array([0, 1, 2, 3])

We can do this kind of conditional operation, and it's very elegant. We can observe this in the following example:

>>>less_than_3 = B<3 # we are filtering the items that are less than 3.
>>>less_than_3
array([ True,  True,  True, False], dtype=bool)
>>>B[less_than_3]
array([0, 1, 2])

We can also assign a value to all these values, as follows:

>>>B[less_than_3] = 0
>>>: B
array([0, 0, 0, 3])

There is a way to get the diagonal of the given matrix. Let's get the diagonal for our matrix A:

>>>np.diag(A)
array([0, 1, 4])

Complex matrix operations

One of the common matrix operations is element-wise multiplication, where we will multiply one element of a matrix by an element of another matrix. The shape of the resultant matrix will be same as the input matrix, for example:

>>>A = np.array([[1,2],[3,4]])
>>>A * A
array([[ 1,  4], [ 9, 16]])

Note

However, we can't perform the following operation, which will throw an error when executed:

>>>A * B
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-e2f71f566704> in <module>()
----> 1 A*B

ValueError: Operands could not be broadcast together with shapes (2,2) (4,).

Simply, the numbers of columns of the first operand have to match the number of rows in the second operand for matrix multiplication to work.

Let's do the dot product, which is the backbone of many optimization and algebraic operations. I still feel doing this in a traditional environment was not very efficient. Let's see how easy it is in NumPy, and how super-efficient it is in terms of memory.

>>>np.dot(A, A)
array([[ 7, 10], [15, 22]])

We can do operations like add, subtract, and transpose, as shown in the following example:

>>>A - A
array([[0, 0], [0, 0]])
>>>A + A
array([[2, 4], [6, 8]])
>>>np.transpose(A)
array([[1, 3], [2, 4]])
>>>>A
array([[1, 2], [2, 3]])

The same transpose operations can be performed using an alternative operation, such as this:

>>>A.T
array([[1, 3], [2, 4]])

We can also cast these ndarrays into a matrix and perform matrix operations, as shown in the following example:

>>>M = np.matrix(A)
>>>M
matrix([[1, 2], [3, 4]])
>>> np.conjugate(M)
matrix([[1, 2], [3, 4]])
>>> np.invert(M)
matrix([[-2, -3], [-4, -5]])

We can perform all sorts of complex matrix operations with NumPy, and they are pretty simple to use too! Please have a look at documentation for more information on NumPy.

Let's switch back to some of the common mathematics operations, such as min, max, mean, and standard deviation, for the given array elements. We have generated the normal distributed random numbers. Let's see how these things can be applied there:

>>>N = np.random.randn(1,10)
>>>N
array([[    0.59238571,    -0.22224549,    0.6753678,    0.48092087,    -0.37402105,    -0.54067842,    0.11445297,    -0.02483442,    -0.83847935,    0.03480181,    ]])
>>>N.mean()
-0.010232957191371551
>>>N.std()
0.47295594072935421

This was an example demonstrating how NumPy can be used to perform simple mathematic and algebraic operations of finding out the mean and standard deviation of a set of numbers.

Reshaping and stacking

In case of some of the numeric, algebraic operations we do need to change the shape of resultant matrix based on the input matrices. NumPy has some of the easiest ways of reshaping and stacking the matrix in whichever way you want.

>>>A
array([[1, 2], [3, 4]])

If we want a flat matrix, we just need to reshape it using NumPy's reshape() function:

>>>>(r, c) = A.shape  # r is rows and c is columns
>>>>r,c
(2L, 2L)
>>>>A.reshape((1, r * c))
array([[1, 2, 3, 4]])

This kind of reshaping is required in many algebraic operations. To flatten the ndarray, we can use the flatten() function:

>>>A.flatten()
array([1, 2, 3, 4])

There is a function to repeat the same elements of the given array. We need to just specify the number of times we want the element to repeat. To repeat the ndarray, we can use the repeat() function:

>>>np.repeat(A, 2)
array([1, 1, 2, 2, 3, 3, 4, 4])
>>>>A
array([[1, 2],[3, 4]])

In the preceding example, each element is repeated twice in sequence. A similar function known as tile() is used for for repeating the matrix, and is shown here:

>>>np.tile(A, 4)
array([[1, 2, 1, 2, 1, 2, 1, 2], [3, 4, 3, 4, 3, 4, 3, 4]])

There are also ways to add a row or a column to the matrix. If we want to add a row, we use the concatenate() function shown here:

>>>B = np.array([[5, 6]])
>>>np.concatenate((A, B), axis=0)
array([[1, 2], [3, 4], [5, 6]])

This can also be achieved using the Vstack() function shown here:

>>>np.vstack((A, B))
array([[1, 2], [3, 4], [5, 6]])

Also, if you want to add a column, you can use the concatenate() function in the following manner:

>>>np.concatenate((A, B.T), axis=1)
array([[1, 2, 5], [3, 4, 6]])

Tip

Alternatively, the hstack() function can be used to add columns. This is used very similarly to the vstack() function in the example shown above.

Random numbers

Random number generation is also used across many tasks involving NLP and machine learning tasks. Let's see how easy it is to get a random sample:

>>>from numpy import random
>>>#uniform random number from [0,1]
>>>random.rand(2, 5)
array([[ 0.82787406, 0.21619509, 0.24551583, 0.91357419, 0.39644969], [ 0.91684427, 0.34859763, 0.87096617, 0.31916835, 0.09999382]])

There is one more function called random.randn(), which generates normally distributed random numbers in the given range. So, in the following example, we want random numbers between 2 and 5.

>>>>random.randn(2, 5)
array([[-0.59998393, -0.98022613, -0.52050449, 0.73075943, -0.62518516], [ 1.00288355, -0.89613323,  0.59240039, -0.89803825, 0.11106479]])

This is achieved by using the function random.randn(2,5).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset