index
Numerics
2d keys 219 – 220, 224 – 225
A
acceptance-rejection method 147
additive error 172 – 173
adversary argument 231
algorithms
comment data example 3 – 8
comment data as stream 7
comment data in database 8
how to solve 4 – 8
designing with hardware in mind 12 – 14
sampling from data streams 163 – 166
structure and purpose of book 8 – 10
analysis tier 122, 125
anchor slot 66
approximate membership
Bloom filters 50 – 52
quotient filters 63 – 72
approximate quantiles 168 – 193
additive error 172 – 173
exact quantiles 169 – 171
q-digest 184 – 189
constructing from scratch 184 – 186
error and space considerations in 188
merging 186 – 187
quantile queries with 188 – 189
relative error 173 – 174
simulation code and results 189 – 192
t-digest 174 – 184
digestion 175 – 176
merging 180 – 183
scale functions 177 – 180
space bounds for 183 – 184
arrival index intervals 175
(article-id -> keyword_frequency) dictionaries 4
(article-id -> keyword_frequency) hash tables 7
Art of Computer Programming, Volume 2, The (Knuth) 145
B
B+-trees 227 – 229
balanced binary search trees 22
bandwidth, latency vs. 12
Bernoulli sampling 7, 143 – 146
Bε-trees
buffering mechanics 233 – 234
cost analysis 236 – 238
deletes 236
I/Os 240
inserts 236
lookups 236
overview 233
spectrum of data structures 238
use case 238 – 239
BFPRT (Blum, Floyd, Pratt, Rivest, and Tarjan) 171
BFs (Bloom filters) 6, 50 – 52, 127
adaptations and alternatives 62 – 63
better false positive rates 61 – 62
configuring 56 – 59
inserting items into 51
lookups 51 – 52
quotient filters compared to 72 – 74
successful lookups 74
uniform random inserts 73
uniform random lookups 73 – 74
simple implementation of 55 – 56
theory 59 – 62
use cases 53 – 54
Bitcoin mobile app 54
Squid 53 – 54
biased parameter 165
biased reservoir sampling 151
biased sampling strategy 136
binary search 204 – 207
minimum median income 204 – 206
runtime analysis 206 – 207
bioinformatics 204 – 206
bitarray library 55
Bitcoin mobile app 54
BLOCK_SIZE_ELEMENTS 210 – 211
Bloom filters see BFs (Bloom filters)
Bloom-joins 126 – 128
Blum, Floyd, Pratt, Rivest, and Tarjan (BFPRT) 171
brokers 129
B-trees 219 – 230
balancing 220 – 221
deletes 224 – 227, 231 – 232
inserts 221 – 224, 231 – 232
lookups 221, 230 – 232
use case 229 – 230
bucket_occupied bit 66 – 67, 70
buffer_in list 210
buildFingerTables(self) method 46
C
cache digests 53
cancer genomics 250 – 251
cardinality estimation 98 – 118
aggregation using HLL 114 – 116
counting distinct items in databases 99 – 100
incremental design 100 – 109
use cases for HLL 109 – 110, 114 – 116
Cassandra 245
centroid 175
chain sampling 156 – 160
Chernoff bounds 60
Chord 44 – 47
chordLookup(self,hashValue) method 47
ChunkStash [1] 24
close operation 204
clustered index 216
clusters 66
CMS (count-min sketch) 75 – 97
error vs. space in 88
estimate operation 80 – 81
general heavy-hitters problem 78 – 79
majority problem 76 – 79
range queries with 91 – 97
computing dyadic intervals 95 – 97
dyadic intervals 91 – 92
estimate phase 94 – 95
update phase 93 – 94
simple implementation of 88 – 91
update operation 80
use cases 82 – 87
scaling distributional similarity of words 85 – 87
top-k restless sleepers 82 – 85
collision resolution 29 – 32
comment data example 3 – 8
comment data as stream 7
comment data in database 8
how to solve 4 – 8
(comment-id -> frequency) dictionary 4
(comment-id -> frequency) hash table 6 – 7
compression parameter 184
concept drift 133
concept shifts 133
consistent hashing 34 – 47
adding new nodes/resources 39 – 41
Chord 44 – 47
hashring 36 – 38
lookup 38 – 39
removing nodes 41 – 44
typical problem 35
constant density 169
constant-time operations 28 – 29, 52
convex hull 249
CountMinSketch class 89
count(v) 185
crawlers 136
cuckoo hashing 32
CurrentSample object 165
D
DAM (disk-access model) 197 – 214
binary search 204 – 207
runtime analysis 206 – 207
use case 204 – 206
finding minimum 201 – 204
merging K sorted lists 209 – 213
optimal searching 207 – 209
overview 199 – 201
simple vs. simplistic 213
data intensive, meaning of 3
data stream data (DSD) objects 163
data stream task (DST) class 163, 165
data structures
comment data 3 – 8
as stream 7
in database 8
solving 4 – 8
overview 22 – 23
structure and purpose of book 8 – 10
deduplication 24, 128 – 129
in backup/storage solutions 24 – 26
delete operation 62
dictionaries 22
dict key-value dictionary 32
dict library 4
dict type 46
digest 174
distance method 38
DISTINCT keyword 99
distributed systems
hash tables for 34 – 47
adding new nodes/resources 39 – 41
Chord 44 – 47
hashring 36 – 38
lookup 38 – 39
removing nodes 41 – 44
typical problem 35
massive datasets and 12
distributional similarity 85
dkeys 224
DSC_Sample() function 164
DSC_Sample class 165
DSD (data stream data) objects 163
DSD_Gaussians() function 163
DSD_ReadCSV class 164
DST (data stream task) class 163, 165
dyadic intervals
computing 95 – 97
overview 91 – 92
E
ε-approximate φ quantile 172
empirical relative errors 175
estimate operation 79 – 81, 83
estimate query 87
estimation
count-min sketch 80 – 81, 94 – 95
streaming data 135, 139 – 141
event stream 7
external-memory algorithms 2, 8
external memory sorting 248 – 265
challenges of 251 – 255
external memory merge-sort 255 – 257
external quick-sort 258 – 263
multiway 259 – 260
pivots 260 – 262
two-way 258 – 259
use cases 249 – 251
cancer genomics 250 – 251
robot motion planning 249 – 250
external quick-sort 258 – 263
multiway 259 – 260
pivots 260 – 262
two-way 258 – 259
F
file_names list 210
file_processed list 211
files_loc list 210
fingerprint 63
fingerprint variable 65
fingerTable attribute 46
frequency estimation 75 – 97
error vs. space in count-min sketch 88
estimate operation 80 – 81
general heavy-hitters problem 78 – 79
majority problem 76 – 79
range queries with count-min sketch 91 – 97
simple implementation of count-min sketch 88 – 91
update operation 80
use cases for count-min sketch 82 – 87
fully merged t-digest 179
G
general heavy-hitters problem 78 – 79
gripping power 170
H
hash128 function 34
hash64 function 34
hash-based sketching data structures 8
hashing 19 – 47
collision resolution 29 – 32
consistent hashing 34 – 47
adding new nodes/resources 39 – 41
Chord 44 – 47
hashring 36 – 38
lookup 38 – 39
removing nodes 41 – 44
typical problem 35
constant-time operations 28 – 29
MurmurHash 33 – 34
ubiquitous nature of 20 – 22
usage scenarios 24 – 28
deduplication 24 – 26
plagiarism detection 26 – 28
Python dict key-value dictionary 32 – 33
HashMap library 4
hashring 36 – 38
HashRing class 36 – 38, 46
hashValue attribute 37
h-bit hash 64
HDFS query processor (HQP) 128
HLL1[1..m] HyperLogLog 115
HLL2[1..m] HyperLogLog 115
HLL (HyperLogLog) 98 – 118
aggregation using 114 – 116
counting distinct items in databases 99 – 100
effect of number of buckets 113 – 114
incremental design 100 – 109
error and space considerations in 109
LogLog 105 – 106
probabilistic counting 101 – 103
stochastic averaging 103 – 104
stochastic averaging with harmonic mean 106 – 109
use cases 109 – 110, 114 – 116
HLL_UNION[1..m] HyperLogLog 115
Horvitz-Thompson type estimator 140
HQP (HDFS query processor) 128
HyperLogLog data structure 7
I
inclusion probability 150 – 151
indexing 216 – 218
inserts
Bloom filters 51, 73
B-trees 221 – 224, 231 – 232
quotient filters 66 – 69, 73
inverse probability integral transform 145
inverted index 26
is_shifted bit 66
ith bucket 103
K
keys fingers 45
K-mers 204
krandom bits 74
k-wise independent hash functions 31
L
landmark stream,sampling from 143 – 156
Bernoulli sampling 143 – 146
biased reservoir sampling 151 – 156
reservoir sampling 146 – 151
latency, bandwidth vs. 12
leveling merge policy 243
light node 54
likes attribute 7
limited working storage 133
linear probing 29
linked lists 22
load balancing 130 – 132
load shedding 125
LogLog 105 – 106
error and space considerations in 105 – 106
super LogLog 106
lookupNode method 39, 41, 45
lookups
Bloom filters 51 – 52, 73 – 74
B-trees 221, 230 – 232
hash tables 38 – 39
quotient filters 69 – 71, 73 – 74
LSM-trees (log-structured merge-trees) 240 – 245
cost analysis 245
overview 241 – 244
use case 245
M
M/B-way merge-sort (external memory merge-sort) 255 – 257
majority problem 76 – 79
map library 4
marked attribute 95
massive datasets 1 – 15
challenges of 10 – 12
CPU memory performance gap 10 – 11
distributed systems 12
latency vs. bandwidth 12
memory hierarchy 11 – 12
comment data 3 – 8
as stream 7
in database 8
solving 4 – 8
structure and purpose of book 8 – 10
median of medians algorithm (Blum, Floyd, Pratt, Rivest, and Tarjan) (BFPRT) 171
memory
CPU memory performance gap 10 – 11
memory hierarchy 11 – 12
mergeability 183
message-queuing tier 122
min-heap 83
min variable 202
mmh3 MurmurHash wrapper 88
mmh3 package 33
MOSS (Measure of Software Similarity) 26 – 28
moveResources helper method 39
Murmur 33
MurmurHash 33 – 34
MySQL 229 – 230
N
network traffic tracking 130 – 132
Node class 37, 46
nodes
adding 39 – 41
removing 41 – 44
O
old cluster 182
O(log n) nodes 45
one pass 132
online aggregation 143
open addressing 29
open operation 204
P
PERTURB_SHIFT constant 33
perturb variable 33
pivot 219
plagiarism detection 26 – 28
PMI (pointwise mutual information) 85
Poisson sampling 146
priority sampling 160 – 163
PRNG (pseudo-random number generator algorithm) 144
probabilistic counting 101 – 103
probability of residing at N 151
probability p 60
product_id attribute 99
Psaltis, Andrew G. 122
Python
dict key-value dictionary 32 – 33
quotient filter lookups 69 – 71
Q
q-digest 184 – 189
constructing from scratch 184 – 186
error and space considerations in 188
merging 186 – 187
quantile queries with 188 – 189
quantile 170
query processing time 133
quotient 64
QuotientFilter class 69
quotient filters 63 – 72
Bloom filters compared to 72 – 74
successful lookups 74
uniform random inserts 73
uniform random lookups 73 – 74
false positive rate and space considerations 72
inserting items into 66 – 69
metadata bits 65 – 66
Python code for lookups 69 – 71
quotienting 64 – 65
resizing and merging 71 – 72
storing 70 – 71
quotienting 64
R
Rabin-Karp fingerprinting 26 – 28
range queries
with count-min sketch 91 – 97
computing dyadic intervals 95 – 97
dyadic intervals 91 – 92
estimate phase 94 – 95
update phase 93 – 94
read_block function 202
readline() function 207
read operation 204
read optimized 8
read-write optimized 8
real-time analytics 132
relative error
in data domain 174
overview 173 – 174
remainder 64
representative sampling 136
reservoir sampling
biased 151 – 156
overview 146 – 151
residing probabilities 152
resources dictionary 38, 41
restriction rule 106
robot motion planning 249 – 250
rolling hashes 26
run_continued bit 66, 70
runs 65, 242
S
sampling from data streams 135 – 167
algorithms comparison 163 – 166
biased strategy 136 – 139
from landmark stream 143 – 156
Bernoulli sampling 143 – 146
biased reservoir sampling 151 – 156
reservoir sampling 146 – 151
from sliding window 156 – 163
chain sampling 156 – 160
priority sampling 160 – 163
seed parameter 33
seek() function 204, 207
seek operation 204
SELECT operation 99
session_id attribute 99
signed parameter 33
simple random sample (SRS) 136, 160
simplified payment verification (SPV) 54
sketch 174
sliding window model 133 – 135
sampling from 156 – 163
chain sampling 156 – 160
priority sampling 160 – 163
Slot class 69
small space 132
small time 133
sort() function 248
sorted arrays 22
sorting and selection problem 171
spatial locality 12
speed of aging factor 153
SPV (simplified payment verification) 54
squeezing 148 – 149
Squid 53 – 54
SRS (simple random sample) 136, 160
SSTs (string-sorted tables) 49, 245
stochastic averaging
overview 103 – 104
with harmonic mean 106 – 109
streaming data 121 – 141
approximate quantiles on 168 – 193
additive error 172 – 173
exact quantiles 169 – 171
q-digest 184 – 189
relative error 173 – 174
relative error in data domain 174
simulation code and results 189 – 192
t-digest 174 – 184
estimation 135, 139 – 141
meta example 126 – 132
Bloom-joins 126 – 128
deduplication 128 – 129
load balancing and tracking network traffic 130 – 132
practical constraints and concepts 132 – 135
concept shifts and concept drifts 133
real-time analytics 132
sliding window model 133 – 135
small time and small space 133
sampling 135 – 167
algorithms comparison 163 – 166
biased strategy 136 – 139
landmark stream 143 – 156
sliding window 156 – 163
Streaming Data (Psaltis) 122
stream package 156, 163 – 164
succinct data structures 6
sudden bursts 130
summary 174
T
t-digest 174 – 184
digestion 175 – 176
merging 180 – 183
scale functions 177 – 180
space bounds for 183 – 184
tdigest library 189
tell() function 207
tiering merge policy 244
time/date logs, merging 210 – 213
external memory version 210 – 213
RAM version 210
timestamp attribute 99
TokuDB 238 – 239
top-k PMIs 86
top-k queries 76
top-k restless sleepers 82 – 83, 85
top-k trending queries 75
truncation rule 106
two-way merge-sort 252
U
unclustered index 216
unsorted array 22
update operation 80, 93 – 94
(user-id, amount) pair 82 – 84
(user-id,sensor-id) pair 82
user_ip_address attribute 99
V
visit_duration attribute 99
W
weight 175
weighted Bloom filter 63
(word, context), pair 87
write amplification effect 244
write operation 204
write optimized 8
writes 240