SciPy

Scientific Python or SciPy is a framework built on top of NumPy and ndarray and was essentially developed for advanced scientific operations such as optimization, integration, algebraic operations, and Fourier transforms.

The concept was to efficiently use ndarrays to provide some of these common scientific algorithms in a memory-efficient manner. Because of NumPy and SciPy, we are in a state where we can focus on writing libraries such as scikit-learn and NLTK, which focus on domain-specific problems, while NumPy / SciPy do the heavy lifting for us. We will give you a brief overview of the data structures and common operations provided in SciPy. We get the details of some of the black-box libraries, such as scikit-learn and understand what goes on behind the scenes.

>>>import scipy as sp

This is how you import SciPy. I am using sp as an alias but you can import everything.

Let's start with something we are more familiar with. Let's see how integration can be achieved here, using the quad() function.

>>>from scipy.integrate import quad, dblquad, tplquad
>>>def f(x):
>>>     return x
>>>x_lower == 0 # the lower limit of x
>>>x_upper == 1 # the upper limit of x
>>>val, abserr = = quad(f, x_lower, x_upper)
>>>print val,abserr
>>> 0.5 , 5.55111512313e-15

If we integrate the x, it will be x2/2, which is 0.5. There are other scientific functions, such as these:

  • Interpolation (scipy.interpolate)
  • Fourier transforms (scipy.fftpack)
  • Signal processing (scipy.signal)

But we will focus on only linear algebra and optimization because these are more relevant in the context of machine learning and NLP.

Linear algebra

The linear algebra module contains a lot of matrix-related functions. Probably the best contribution of SciPy is sparse matrix (CSR matrix), which is used heavily in other packages for manipulation of matrices.

SciPy provides one of the best ways of storing sparse matrices and doing data manipulation on them. It also provides some of the common operations, such as linear equation solving. It has a great way of solving eigenvalues and eigenvectors, matrix functions (for example, matrix exponentiation), and more complex operations such as decompositions (SVD). Some of these are the behind-the-scenes optimization in our ML routines. For example, SVD is the simplest form of LDA (topic modeling) that we used in Chapter 6, Text Classification.

The following is an example showing how the linear algebra module can be used:

>>>A = = sp.rand(2, 2)
>>>B = = sp.rand(2, 2)
>>>import Scipy
>>>X = = solve(A, B)
>>>from Scipy import linalg as LA
>>>X = = LA.solve(A, B)
>>>LA.dot(A, B)

Note

Detailed documentation is available at http://docs.scipy.org/doc/scipy/reference/linalg.html.

eigenvalues and eigenvectors

In some of the NLP and machine learning applications, we represent the documents as term document matrices. Eigenvalues and eigenvectors are typically calculated for many different mathematical formulations. Say A is our matrix, and there exists a vector v such that Av=λv.

In this case, λ will be our eigenvalue and v will be our eigenvector. One of the most commonly used operation, the singular value decomposition (SVD)will require some calculus functionality. It's quite simple to achieve this in SciPy.

>>>evals = LA.eigvals(A)
>>>evals
array([-0.32153198+0.j, 1.40510412+0.j])

And eigen vectors are as follows:

>>>evals, evect = LA.eig(A)

We can perform other matrix operations, such as inverse, transpose, and determinant:

>>>LA.inv(A)
array([[-1.24454719, 1.97474827], [ 1.84807676, -1.15387236]])
>>>LA.det(A)
-0.4517859060209965

The sparse matrix

In a real-world scenario, when we use a typical matrix, most of the elements of this matrix are zeroes. It is highly inefficient to go over all these non-zero elements for any matrix operation. As a solution to this kind of problem, a sparse matrix format has been introduced, with the simple idea of storing only non-zero items.

A matrix in which most of the elements are non-zeroes is called a dense matrix, and the matrix in which most of the elements are zeroes is called a sparse matrix.

A matrix is typically a 2D array with an index of row and column will provide the value of the element. Now there are different ways in which we can store sparse matrices:

  • DOK (Dictionary of keys): Here, we store the dictionary with keys in the format (row, col) and the values are stored as dictionary values.
  • LOL (list of list): Here, we provide one list per row, with only an index of the non-zero elements.
  • COL (Coordinate list): Here, a list (row, col, value) is stored as a list.
  • CRS/CSR (Compressed row Storage): A CSR matrix reads values first by column; a row index is stored for each value, and column pointers are stored (val, row_ind, col_ptr). Here, val is an array of the non-zero values of the matrix, row_ind represents the row indices corresponding to the values, and col_ptr is the list of val indexes where each column starts. The name is based on the fact that column index information is compressed relative to the COO format. This format is efficient for arithmetic operations, column slicing, and matrix-vector products.
  • CSC (sparse column): This is similar to CSR, except that the values are read first by column; a row index is stored for each value, and column pointers are stored. In otherwords, CSC is (val, row_ind, col_ptr).

Let's have some hands-on experience with CSR matrix manipulation. We have a sparse matrix A:

>>>from scipy import sparse as s
>>>A = array([[1,0,0],[0,2,0],[0,0,3]])
>>>A
array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
>>>from scipy import sparse as sp
>>>C = = sp.csr_matrix(A);
>>>C
<3x3 sparse matrix of type '<type 'NumPy.int32'>'
    with 3 stored elements in Compressed Sparse Row format>

If you read very carefully, the CSR matrix stored just three elements. Let's see what it stored:

>>>C.toarray()
array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
>>>C * C.todense()
matrix([[1, 0, 0], [0, 4, 0], [0, 0, 9]])

This is exactly what we are looking for. Without going over all the zeroes, we still got the same results with the CSR matrix.

>>>dot(C, C).todense()

Optimization

I hope you understand that every time we have built a classifier or a tagger in the background, all these are some sort of optimization routine. Let's have some basic understanding about the function provided in SciPy. We will start with getting a minima of the given polynomial. Let's jump to one of the example snippets of the optimization routine provided by SciPy.

>>>def f(x):
>>>    returnx         return x**2-4
>>>optimize.fmin_bfgs(f,0)
Optimization terminated successfully.
         Current function value: -4.000000
         Iterations: 0
         Function evaluations: 3
         Gradient evaluations: 1
array([0])

Here, the first argument is the function you want the minima of, and the second is the initial guess for the minima. In this example, we already knew that zero will be the minima. To get more details, use the function help(), as shown here:

>>>help(optimize.fmin_bfgs)
Help on function fmin_bfgs in module Scipy.optimize.optimize:

fmin_bfgs(f, x0, fprime=None, args=(), gtol=1e-05, norm=inf, epsilon=1.4901161193847656e-08, maxiter=None, full_output=0, disp=1, retall=0, callback=None)
    Minimize a function using the BFGS algorithm.
    
    Parameters
    ----------
    f : callable f(x,*args)
        Objective function to be minimized.
    x0 : ndarray
        Initial guess.
>>>from scipy import optimize
             optimize.fsolve(f, 0.2)
array([ 0.46943096])

>>>def f1 def f1(x,y):
>>>    return x ** 2+  y ** 2 - 4
>>>optimize.fsolve(f1, 0, 0)
array([ 0.])

To summarize, we now have enough knowledge about SciPy's most basic data structures, and some of the most common optimization techniques. The intention was to motivate you to not just run black-box machine learning or natural language processing, but to go beyond that and get the mathematical context about the ML algorithms you are using and also have a look at the source code and try to understand it.

Implementing this will not just help your understanding about the algorithm, but also allow you to optimize/customize the implementation to your need.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset