index

Numerics

2d keys 219220, 224225

A

acceptance-rejection method 147

additive error 172173

adversary argument 231

algorithms

comment data example 38

comment data as stream 7

comment data in database 8

how to solve 48

designing with hardware in mind 1214

sampling from data streams 163166

structure and purpose of book 810

analysis tier 122, 125

anchor slot 66

approximate membership

Bloom filters 5052

quotient filters 6372

approximate quantiles 168193

additive error 172173

exact quantiles 169171

q-digest 184189

constructing from scratch 184186

error and space considerations in 188

merging 186187

quantile queries with 188189

relative error 173174

simulation code and results 189192

t-digest 174184

digestion 175176

merging 180183

scale functions 177180

space bounds for 183184

arrival index intervals 175

(article-id -> keyword_frequency) dictionaries 4

(article-id -> keyword_frequency) hash tables 7

Art of Computer Programming, Volume 2, The (Knuth) 145

B

B+-trees 227229

balanced binary search trees 22

bandwidth, latency vs. 12

Bernoulli sampling 7, 143146

Bε-trees

buffering mechanics 233234

cost analysis 236238

deletes 236

I/Os 240

inserts 236

lookups 236

overview 233

spectrum of data structures 238

use case 238239

BFPRT (Blum, Floyd, Pratt, Rivest, and Tarjan) 171

BFs (Bloom filters) 6, 5052, 127

adaptations and alternatives 6263

better false positive rates 6162

configuring 5659

inserting items into 51

lookups 5152

quotient filters compared to 7274

successful lookups 74

uniform random inserts 73

uniform random lookups 7374

simple implementation of 5556

theory 5962

use cases 5354

Bitcoin mobile app 54

Squid 5354

biased parameter 165

biased reservoir sampling 151

biased sampling strategy 136

binary search 204207

minimum median income 204206

runtime analysis 206207

bioinformatics 204206

bitarray library 55

Bitcoin mobile app 54

BLOCK_SIZE_ELEMENTS 210211

Bloom filters see BFs (Bloom filters)

Bloom-joins 126128

Blum, Floyd, Pratt, Rivest, and Tarjan (BFPRT) 171

brokers 129

B-trees 219230

balancing 220221

deletes 224227, 231232

inserts 221224, 231232

lookups 221, 230232

use case 229230

bucket_occupied bit 6667, 70

buffer_in list 210

buildFingerTables(self) method 46

C

cache digests 53

cancer genomics 250251

cardinality estimation 98118

aggregation using HLL 114116

counting distinct items in databases 99100

incremental design 100109

use cases for HLL 109110, 114116

Cassandra 245

centroid 175

chain sampling 156160

Chernoff bounds 60

Chord 4447

chordLookup(self,hashValue) method 47

ChunkStash [1] 24

close operation 204

clustered index 216

clusters 66

CMS (count-min sketch) 7597

error vs. space in 88

estimate operation 8081

general heavy-hitters problem 7879

majority problem 7679

range queries with 9197

computing dyadic intervals 9597

dyadic intervals 9192

estimate phase 9495

update phase 9394

simple implementation of 8891

update operation 80

use cases 8287

scaling distributional similarity of words 8587

top-k restless sleepers 8285

collision resolution 2932

comment data example 38

comment data as stream 7

comment data in database 8

how to solve 48

(comment-id -> frequency) dictionary 4

(comment-id -> frequency) hash table 67

compression parameter 184

concept drift 133

concept shifts 133

consistent hashing 3447

adding new nodes/resources 3941

Chord 4447

hashring 3638

lookup 3839

removing nodes 4144

typical problem 35

constant density 169

constant-time operations 2829, 52

convex hull 249

CountMinSketch class 89

count(v) 185

crawlers 136

cuckoo hashing 32

CurrentSample object 165

D

DAM (disk-access model) 197214

binary search 204207

runtime analysis 206207

use case 204206

finding minimum 201204

merging K sorted lists 209213

optimal searching 207209

overview 199201

simple vs. simplistic 213

data intensive, meaning of 3

data stream data (DSD) objects 163

data stream task (DST) class 163, 165

data structures

comment data 38

as stream 7

in database 8

solving 48

overview 2223

structure and purpose of book 810

deduplication 24, 128129

in backup/storage solutions 2426

delete operation 62

dictionaries 22

dict key-value dictionary 32

dict library 4

dict type 46

digest 174

distance method 38

DISTINCT keyword 99

distributed systems

hash tables for 3447

adding new nodes/resources 3941

Chord 4447

hashring 3638

lookup 3839

removing nodes 4144

typical problem 35

massive datasets and 12

distributional similarity 85

dkeys 224

DSC_Sample() function 164

DSC_Sample class 165

DSD (data stream data) objects 163

DSD_Gaussians() function 163

DSD_ReadCSV class 164

DST (data stream task) class 163, 165

dyadic intervals

computing 9597

overview 9192

E

ε-approximate φ quantile 172

empirical relative errors 175

estimate operation 7981, 83

estimate query 87

estimation

count-min sketch 8081, 9495

streaming data 135, 139141

event stream 7

external-memory algorithms 2, 8

external memory sorting 248265

challenges of 251255

external memory merge-sort 255257

external quick-sort 258263

multiway 259260

pivots 260262

two-way 258259

use cases 249251

cancer genomics 250251

robot motion planning 249250

external quick-sort 258263

multiway 259260

pivots 260262

two-way 258259

F

file_names list 210

file_processed list 211

files_loc list 210

fingerprint 63

fingerprint variable 65

fingerTable attribute 46

frequency estimation 7597

error vs. space in count-min sketch 88

estimate operation 8081

general heavy-hitters problem 7879

majority problem 7679

range queries with count-min sketch 9197

simple implementation of count-min sketch 8891

update operation 80

use cases for count-min sketch 8287

fully merged t-digest 179

G

general heavy-hitters problem 7879

gripping power 170

H

hash128 function 34

hash64 function 34

hash-based sketching data structures 8

hashing 1947

collision resolution 2932

consistent hashing 3447

adding new nodes/resources 3941

Chord 4447

hashring 3638

lookup 3839

removing nodes 4144

typical problem 35

constant-time operations 2829

MurmurHash 3334

ubiquitous nature of 2022

usage scenarios 2428

deduplication 2426

plagiarism detection 2628

Python dict key-value dictionary 3233

HashMap library 4

hashring 3638

HashRing class 3638, 46

hashValue attribute 37

h-bit hash 64

HDFS query processor (HQP) 128

HLL1[1..m] HyperLogLog 115

HLL2[1..m] HyperLogLog 115

HLL (HyperLogLog) 98118

aggregation using 114116

counting distinct items in databases 99100

effect of number of buckets 113114

incremental design 100109

error and space considerations in 109

LogLog 105106

probabilistic counting 101103

stochastic averaging 103104

stochastic averaging with harmonic mean 106109

use cases 109110, 114116

HLL_UNION[1..m] HyperLogLog 115

Horvitz-Thompson type estimator 140

HQP (HDFS query processor) 128

HyperLogLog data structure 7

I

inclusion probability 150151

indexing 216218

inserts

Bloom filters 51, 73

B-trees 221224, 231232

quotient filters 6669, 73

inverse probability integral transform 145

inverted index 26

is_shifted bit 66

ith bucket 103

K

keys fingers 45

K-mers 204

krandom bits 74

k-wise independent hash functions 31

L

landmark stream,sampling from 143156

Bernoulli sampling 143146

biased reservoir sampling 151156

reservoir sampling 146151

latency, bandwidth vs. 12

leveling merge policy 243

light node 54

likes attribute 7

limited working storage 133

linear probing 29

linked lists 22

load balancing 130132

load shedding 125

LogLog 105106

error and space considerations in 105106

super LogLog 106

lookupNode method 39, 41, 45

lookups

Bloom filters 5152, 7374

B-trees 221, 230232

hash tables 3839

quotient filters 6971, 7374

LSM-trees (log-structured merge-trees) 240245

cost analysis 245

overview 241244

use case 245

M

M/B-way merge-sort (external memory merge-sort) 255257

majority problem 7679

map library 4

marked attribute 95

massive datasets 115

challenges of 1012

CPU memory performance gap 1011

distributed systems 12

latency vs. bandwidth 12

memory hierarchy 1112

comment data 38

as stream 7

in database 8

solving 48

structure and purpose of book 810

median of medians algorithm (Blum, Floyd, Pratt, Rivest, and Tarjan) (BFPRT) 171

memory

CPU memory performance gap 1011

memory hierarchy 1112

mergeability 183

message-queuing tier 122

min-heap 83

min variable 202

mmh3 MurmurHash wrapper 88

mmh3 package 33

MOSS (Measure of Software Similarity) 2628

moveResources helper method 39

Murmur 33

MurmurHash 3334

MySQL 229230

N

network traffic tracking 130132

Node class 37, 46

nodes

adding 3941

removing 4144

O

old cluster 182

O(log n) nodes 45

one pass 132

online aggregation 143

open addressing 29

open operation 204

P

PERTURB_SHIFT constant 33

perturb variable 33

pivot 219

plagiarism detection 2628

PMI (pointwise mutual information) 85

Poisson sampling 146

priority sampling 160163

PRNG (pseudo-random number generator algorithm) 144

probabilistic counting 101103

probability of residing at N 151

probability p 60

product_id attribute 99

Psaltis, Andrew G. 122

Python

dict key-value dictionary 3233

quotient filter lookups 6971

Q

q-digest 184189

constructing from scratch 184186

error and space considerations in 188

merging 186187

quantile queries with 188189

quantile 170

query processing time 133

quotient 64

QuotientFilter class 69

quotient filters 6372

Bloom filters compared to 7274

successful lookups 74

uniform random inserts 73

uniform random lookups 7374

false positive rate and space considerations 72

inserting items into 6669

metadata bits 6566

Python code for lookups 6971

quotienting 6465

resizing and merging 7172

storing 7071

quotienting 64

R

Rabin-Karp fingerprinting 2628

range queries

with count-min sketch 9197

computing dyadic intervals 9597

dyadic intervals 9192

estimate phase 9495

update phase 9394

read_block function 202

readline() function 207

read operation 204

read optimized 8

read-write optimized 8

real-time analytics 132

relative error

in data domain 174

overview 173174

remainder 64

representative sampling 136

reservoir sampling

biased 151156

overview 146151

residing probabilities 152

resources dictionary 38, 41

restriction rule 106

robot motion planning 249250

rolling hashes 26

run_continued bit 66, 70

runs 65, 242

S

sampling from data streams 135167

algorithms comparison 163166

biased strategy 136139

from landmark stream 143156

Bernoulli sampling 143146

biased reservoir sampling 151156

reservoir sampling 146151

from sliding window 156163

chain sampling 156160

priority sampling 160163

seed parameter 33

seek() function 204, 207

seek operation 204

SELECT operation 99

session_id attribute 99

signed parameter 33

simple random sample (SRS) 136, 160

simplified payment verification (SPV) 54

sketch 174

sliding window model 133135

sampling from 156163

chain sampling 156160

priority sampling 160163

Slot class 69

small space 132

small time 133

sort() function 248

sorted arrays 22

sorting and selection problem 171

spatial locality 12

speed of aging factor 153

SPV (simplified payment verification) 54

squeezing 148149

Squid 5354

SRS (simple random sample) 136, 160

SSTs (string-sorted tables) 49, 245

stochastic averaging

overview 103104

with harmonic mean 106109

streaming data 121141

approximate quantiles on 168193

additive error 172173

exact quantiles 169171

q-digest 184189

relative error 173174

relative error in data domain 174

simulation code and results 189192

t-digest 174184

estimation 135, 139141

meta example 126132

Bloom-joins 126128

deduplication 128129

load balancing and tracking network traffic 130132

practical constraints and concepts 132135

concept shifts and concept drifts 133

real-time analytics 132

sliding window model 133135

small time and small space 133

sampling 135167

algorithms comparison 163166

biased strategy 136139

landmark stream 143156

sliding window 156163

Streaming Data (Psaltis) 122

stream package 156, 163164

succinct data structures 6

sudden bursts 130

summary 174

T

t-digest 174184

digestion 175176

merging 180183

scale functions 177180

space bounds for 183184

tdigest library 189

tell() function 207

tiering merge policy 244

time/date logs, merging 210213

external memory version 210213

RAM version 210

timestamp attribute 99

TokuDB 238239

top-k PMIs 86

top-k queries 76

top-k restless sleepers 8283, 85

top-k trending queries 75

truncation rule 106

two-way merge-sort 252

U

unclustered index 216

unsorted array 22

update operation 80, 9394

(user-id, amount) pair 8284

(user-id,sensor-id) pair 82

user_ip_address attribute 99

V

visit_duration attribute 99

W

weight 175

weighted Bloom filter 63

(word, context), pair 87

write amplification effect 244

write operation 204

write optimized 8

writes 240

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset