Using bloom filters and sparse matrix

Sparse matrix can be used as highly efficient data structure. A sparse matrix has more 0 values compared to actual values. For example, a 100 X 100 matrix may have 10,000 cells. Now, out of this 10,000 cells, only 100 have values; rest are 0. Other than the 100 values, remaining cells are occupied with the default value of 0, and they are taking same byte size to store the value 0 to indicate the empty cell. It is a huge waste of space, and we can reduce it using the sparse matrix. We can use different techniques to store the values to the sparse matrix in a separate matrix that will be very lean and will not take any unnecessary spaces. We can also use a linked list to represent the sparse matrix. Here is an example of the sparse matrix:

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

0

0

Row

Col

Value

0

5

1

1

0

1

2

4

2

3

2

2

4

6

1

5

7

2

6

6

1

7

1

1

Since PHP array is dynamic in nature, the best approach for sparse matrix in PHP will be using only the indexes that have values; others are not used at all. When we are using the cell, we can do a check to see whether the cell has any value; else, the default value of 0 is used, just as shown in the following example:

$sparseArray = []; 
$sparseArray[0][5] = 1;
$sparseArray[1][0] = 1;
$sparseArray[2][4] = 2;
$sparseArray[3][2] = 2;
$sparseArray[4][6] = 1;
$sparseArray[5][7] = 2;
$sparseArray[6][6] = 1;
$sparseArray[7][1] = 1;

function getSparseValue(array $array, int $i, int $j): int {
if (isset($array[$i][$j]))
return $array[$i][$j];
else
return 0;
}

echo getSparseValue($sparseArray, 0, 2) . " ";
echo getSparseValue($sparseArray, 7, 1) . " ";
echo getSparseValue($sparseArray, 8, 8) . " ";

This will produce the following output in the command line:

0
1
0

When we have a large dataset, doing lookup in the dataset can be very time consuming and costly. Let's assume we have a dataset of 10 million phone numbers and we want to search for a particular phone number. This can be easily done using a database query. However, what if it is 1 billion phone numbers? Will it still be faster to find from a database? Such a big database can create slow-performing lookups. In order to solve this problem, an efficient approach can be using bloom filters.

A bloom filter is a space-efficient, probabilistic data structure that determines whether a particular item is part of a set or not. It returns two values: "possibly in set" and "definitely not in set". If an item does not belong to a set, bloom filter returns false. However, if it returns true, the item may or may not be in the set. The reason for this is described here.

In general, a bloom filter is a bit array of size m, where all initial values are 0. There is k different hash function, which converts an item to a hashed integer value, which is mapped in the bit array. This hash value can be between 0 to m, as m is the max size of our bit array. The hash functions are similar to md5, sha1, crc32, and so on, but they are very fast and efficient. Usually, in bloom filter fnv, murmur, Siphash, and so on, hash functions are used. Let's take an example of 16 (16+1 cells) bit bloom filter with the initial value of 0:

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Let's assume that we have two hash functions, k1 and k2, to convert our items to integer values between 0 to 16. Let our first item to store in the bloom filter be "PHP". Then, our hash function will return following:

k1("PHP") = 5 
k2("PHP") = 9

Two hash functions have returned two different values. We can now put 1 in the bit array to mark that. The bit array will now look like this:

0

0

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

Let's now add another item in the list, for example, "algorithm". Suppose our hash functions will return the following values:

k1("algorithm") = 2 
k2("algorithm") = 5

Since we can see that 5 is already marked by another item, we do not have to mark it again. Now, the bit array will look like this:

0

0

1

0

0

1

0

0

0

1

0

0

0

0

0

0

0

For example, now, we want to check an item called "error", which is hashed to the following values:

k1("error") = 2 
k2("error") = 9

As we can see that our hash functions k1 and k2 returned a hashed value for string "error," which is not present in the array. So, this is definitely an error, and we expect to have such errors if our hash functions are only few in number. The more hash functions we have, lesser the errors we will have, as different hash functions will return different values. There is a relationship between error rate, number of hash functions, and the size of bloom filter. For example, a bloom filter for 5000 items and 0.0001 error rate will require roughly 14 hash functions and approximately 96000 bits. We can get such numbers from online bloomfilter calculators such as https://krisives.github.io/bloom-calculator/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset