Working with sparse arrays

Sparse matrices are matrices whose values are mostly zero values. They occur naturally when working with certain kinds of data problems such as natural language processing (NLP), data counting events (such as customers' purchases), categorical data transformed into binary variables (a technique called one-hot-encoding, which we will be discussing in the next chapter), or even images if they have lots of black pixels.

sparse matrices with the right tools because they represent a memory and computational challenge for most machine learning algorithms.
First of all, sparse matrices are huge (if treated as a normal matrix, they cannot fit into memory) and they mostly contain zero values but for a few cells. Data structures that are optimized for sparse matrices allow us to efficiently store matrices where most of the elements valued as zero do not occupy any memory space. Instead, in any NumPy array (in contrast, we will be calling it a dense array), any zero value occupies some memory space because arrays keep track of all the values.

In addition, sparse matrices, being large, require a lot of computations in order to be processed, yet, most of their values are not used for any prediction. Algorithms that can take advantage of sparse matrix data structures can perform in much less computation time than standard algorithms operating on dense matrices.

In Python, SciPy's sparse module offers different sparse data structures that are able to address sparse problems. More specifically, it offers seven different kinds of sparse matrices:

  • csc_matrix: Compressed Sparse Column format
  • csr_matrix: Compressed Sparse Row format
  • bsr_matrix: Block Sparse Row format
  • lil_matrix: List of Lists format
  • dok_matrix: Dictionary of Keys format
  • coo_matrix: COOrdinate format (also known as IJV, triplet format)
  • dia_matrix: DIAgonal format

Each kind of matrix features a different way to store sparse information, a particular way that affects how the matrix performs under different circumstances. We are going to illustrate each sparse matrix kind and look at what operations are fast and efficient, and  what operations are not performing at all. For instance, the documentation points out dok_matrix, lil_matrix, or coo_matrix as the best ones to construct a sparse matrix from scratch. We will start with this problem and from the coo_matrix.


You can find all of SciPy's documentation about sparse matrices at https://docs.scipy.org/doc/scipy/reference/sparse.html.

Let's start by creating a sparse matrix:

  1. In order to create a sparse matrix, you can either generate it from a NumPy array (just by passing the array to one of SciPy's sparse matrix formats), or by providing three vectors containing row indexes, column indexes, and data values to a COO matrix, respectively:
In: row_idx = np.array([0, 1, 3, 3, 4])
col_idx = np.array([1, 2, 2, 4, 2])
values = np.array([1, 1, 2, 1, 2], dtype=float)
sparse_coo = sparse.coo_matrix((values, (row_idx, col_idx)))
sparse_coo

Out: <5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
  1. Calling the COO matrix will tell you the shape and how many non-zero elements it contains. The number of zero elements against the size of the matrix will provide you with the sparsity measure, which is something that can be otherwise computed as the following:
In: sparsity = 1.0 - (sparse_coo.count_nonzero() /
np.prod(sparse_coo.shape))
print(sparsity)

Out: 0.8

The sparsity is 0.8; that is, 80% of the matrix is actually empty.

You can investigate sparsity graphically as well by using the spy command from matplotlib. In the following example, we will create a random sparse matrix and easily represent it in graphic form to provide an idea of how much data is effectively available in the matrix:

In: import matplotlib.pyplot as plt

%matplotlib inline

large_sparse = sparse.random(10 ** 3, 10 ** 3, density=0.001, format='coo')
plt.spy(large_sparse, marker=',')
plt.show()

The resulting graph will provide you with an idea of the empty space in the matrix:

If needed, you can always convert a sparse matrix into a dense one by using the method to_dense: sparse_coo.to_dense().

You can try to figure out how a COO matrix is constituted by printing it:

In: print(sparse_coo)

Out: (0, 1) 1.0
(1, 2) 1.0
(3, 2) 2.0
(3, 4) 1.0
(4, 2) 2.0

From the output representation, we can figure out that a sparse coordinate format matrix works by storing the printed values in three separated storage arrays: one for x coordinates, one for y coordinates, and one for the values. This means that COO matrices are really fast when inserting the information (each new element is a new row in each storage array) but slowly processing it because it cannot immediately figure out what the values in a row or in a column are in order to scan the arrays.

The same is true for dictionaries of keys (dok) and lists in list (lil) matrices. The first operates by using a dictionary of coordinates (so it is fast retrieving single elements), the second uses two lists, both arranged to represent rows, containing the non-zero coordinates in the row, and the other its values (it is easy to expand by adding more rows).
Another advantage of COO matrices is that they can be promptly converted into other kinds of matrices that are specialized in working efficiently at a row or column level: csr and csc matrics.

Compressed sparse row (csr) and compressed sparse column (csc) are the most used formats for operating on sparse matrices after having created them. They use an indexing system that favors computations over the rows for csr and over the columns for csc. However, that makes editing quite computationally costly (for this reason, it is not convenient to change them after having created them).


The performances of csr and csc really depend on the algorithm used and how it optimizes its parameters. You have to actually try them out on your algorithm to find out which performs best.

Finally, diagonal format matrices are sparse data structures that are specialized for diagonal matrices and block sparse row format matrices. These resemble csr matrices in characteristics, apart from the way they store data, which is based on entire blocks of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset