Index
Page numbers followed by “f” indicates figures and “t” indicates tables.
A
A* Search algorithm
Access-pattern limitations,
68,
80
executable plans, generating,
81–84
Acyclicity constraints,
166
Agglomerative hierarchical clustering (AHC),
180,
200
Algorithm decide completeness,
91
Analysis time vs. execution time,
225
Answering queries, using views,
43–44
algorithms, comparison of,
57–58
closed-world assumption,
60–61
interpreted predicates with,
61–62
Inverse-rules algorithm,
56–57
open-world assumption,
59–60
Archiving update logs,
444
Attribute correspondences, schema mappings,
351
Attribute names, mediated schema,
65–66
Attribute-level uncertainty,
347
Autonomous data sources, interfacing with,
223
Autonomy for data integration,
B
representing and reasoning with,
184–185
Bidirectional expansion,
404
Bidirectional mappings,
322
Block-independent-disjoint (BID) model,
349–350
Build large-scale structured Web databases,
454–455
C
Cardinality estimation,
214
Character data (CDATA),
297
Classifier techniques,
133
Closed-world assumptions,
60–61
Cloud-based parallel process,
457
Cluster-based parallel process,
457
Clustering
collective matching based on,
200–201
Collaboration in data integration
annotation
corrections and feedback process,
436–437
user updates propagated, upstream/downstream,
437–438
Collaborative data sharing system (CDSS)
reconciliation process,
449
entity mentions in documents,
201–202
Commercial relational databases,
23
Compatible data values, discovering,
408
Complete orderings, query refinements and,
36–37
Complex query reformulation algorithms,
75
Composing scores, uncertainty,
347
Computational complexity,
356
Condition variables, uncertainty,
347
Conjunctive queries,
26–28
interpreted predicates,
35–37
query containment of,
32–34
Consistent target instance,
352–353
interpreted predicates with,
35
Content-free element,
295
Conventional query processor, modules in,
211f
Corrective query processing (CQP),
232
Cost-based backwards expansion,
404
Cost-based reoptimization, CQP,
235–238
Curation, scientific annotation and,
440
D
placement and shipment in DBMS,
217–218
relationships
transformation modules,
275
materialized repository,
283
Data integration
challenges of,
setting expectations,
social and administrative,
8–9
goal of,
Data integration engine,
222
probabilistic approaches to,
182–183
Data-level heterogeneity,
92–93
Data-level variations,
67
Database concepts, review of
conjunctive queries,
26–28
integrity constraints,
24–25
queries and answer,
25–26
Database management system (DBMS)
cost and cardinality estimation,
214
Database reasoning vs. description logics,
333–334
Database systems, queries,
25
Declarative warehousing, data exchange,
276–277
Dependent join operator,
224
Distinguished variables,
26
Distributed vs. parallel DBMS,
216–217
Document object model (DOM),
300–301
Document type definition (DTD),
296–298
Domain integrity constraints,
135–137
Dynamic data, CDSS
reconciliation process,
449
Dynamic-programming algorithm,
97
E
Eddy
lottery scheduling routing,
234–235
queueing-based plan selection,
232–234
Edges
Efficient reformulation,
70
Enterprise information integration (EII),
283
Equality-generating dependencies (EGDs),
24,
80,
277
Event-condition-action rule framework,
226
Event-driven adaptivity,
226
handling source failures and delays,
227–228
handling unexpected cardinalities,
228–231
Executable plans, generating,
81–84
Executable query plans,
81–82
Execution time vs. analysis time,
225
Existential variables,
26
Explanation, provenance,
363
eXtensible Markup Language (XML),
292,
446
namespaces and qualified names,
294–295
query language
schema mapping for
nested mappings, query reformulation with,
321–322
structural and schema definitions
tags, elements, and attributes,
293–294
Extensional database (EDB) relations,
28
External data, direct analysis of,
284–287
Extract-transform-load (ETL)
Extraction rules with Lixto,
267–269
F
FindMapping algorithm,
156
Flat-file-based data analysis,
287
Foreign key constraints,
24
Fullserve company database,
1–4,
2f,
7–8
Functional dependencies,
24
G
matching entity mentions,
195
Global-and-Local-as-View (GLAV),
77–78
with integrity constraints,
88–89
Google’s MapReduce programming paradigm,
284
Graph expansion algorithms,
403–404
Graph random-walk algorithms,
401
Graphical user interface,
153
H
Handling limited access patterns,
224
Hash-based exchange scheme,
217
Hash-based operators for faster initial results,
223
Head-left-right-tail (HLRT) wrappers,
249–250
semantic,
Higher-level similarity measure,
108
Horizontal partitioning,
217
Hybrid similarity measures
Monge-Elkan similarity measure,
109
HyperText Markup Language (HTML),
292
I
Immediate consequent, provenance,
361
Incremental update propagation,
447
Information-gathering query operators,
229
Instance-based matchers,
132
Integrated data, visualization,
456
on mediated schema,
85–89
Intensional database (IDB) relations,
28
Interactive wrapper construction,
263
creating extraction results with Lixto,
267–269
identifying extraction results with poly,
264–267
labeling of pages with stalker,
263–264
Internet data, query execution for,
222
Interpreted atoms,
27,
35
Inverse document frequency (IDF) measure,
105–106,
105f
J
Jaccard similarity measures,
200
Jaro-Winkler measure,
104
K
Keyword search
Knowledge representation (KR) systems,
325–327
L
Learning-based wrapper construction,
249
Left outer join operator,
317
Linearly weighted matching rules,
176
Lixto system, creating extraction rules with,
267–269
Local completeness,
89–90
Local contributions table,
447
Local data, direct analysis of,
284–287
Local rejections table,
447
Local-as-View (LAV),
73,
415
syntax and semantics,
74–75
with integrity constraints,
85–87
Local-completeness constraint,
89–90
Logistic regression matching rules,
175–176
Lottery scheduling scheme for routing,
234–235
M
Machine learning techniques,
409
Manual wrapper construction,
247–249
Margin-Infused Ranking Algorithm (MIRA),
410
Master data management (MDM),
273–274
Match predictions, combining,
134
Materialized repository,
283
integrity constraints on
Message-passing systems,
162
translations between,
166
Mid-query reoptimization,
228,
238
MiniCon description (MCD),
51,
424
use of generic set of,
161
Model management systems,
163,
170
Modern database optimizers,
212
Monge-Elkan similarity measure,
109
Multi-strategy learning,
146
N
Naive Bayes
classification technique,
134
Negative log likelihood,
367
Nested mappings, query reformulation with,
321–322
Nested tuple-generating dependency (Nested tgds),
320–321
O
Object-oriented database schemas vs. description logics,
334
Online analytic processing (OLAP) queries,
273
Open DataBase Connectivity (ODBC) wrapper,
223
Open-world assumption,
59–60
Optimizer, runtime reinvocation of,
231
Overlap similarity measure,
104,
113
P
Parallel vs. distributed DBMS,
216–217
Pay-as-you-go
Peer data management systems (PDMSs),
413
complexity of query answering in,
419–421
for coordinating emergency response,
415
with looser mappings
query reformulation algorithm,
421–426
reformulation construction,
426
storage descriptions,
414,
417
inclusion and equality,
417
interpreted predicates in,
421
Performance-driven adaptivity,
231–232
Phonetic similarity measures,
109–110
Physical database,
Physical query plan for data integration,
223
Physical-level query operators,
217
Piazza-XML mappings language,
318–319
Probabilistic conditional table (Pc-table),
348–349
Probabilistic data representations
tuple-independent model,
349
Probabilistic generative model,
201
Probabilistic mappings (P-mappings),
350,
352
semi-automatic schema mapping tool,
351
Probabilistic matching method,
204,
205
Probability
of perturbation types,
196,
197
Processing instruction,
293,
295
Prolog programming language,
29
Publishing update logs,
444
Q
Query answer-based feedback,
401
Query answering inference in description logics,
332–333
Query capabilities and limited data statistics,
209–210
Query containment, conjunctive queries,
32–34
Query plans, generating initial,
221–222
Query tree, score as sum of weights in,
402–403
R
Real-world data matching systems,
177
Reconciliation process, CDSS,
449
Recurrence equation
for Needleman-Wunch score,
99,
99f
Reformulation
Relation names, mediated schema,
65–66
Reoptimization
Resolving cycle constraints,
166
Resource Description Framework (RDF),
335–337
Resource Description Framework Schema (RDFS),
335,
340–341
Rewriting queries, length of,
47–48
Root-leaf costs, score as sum of,
403
Runtime re-invocation of optimizer,
231
S
Scalability challenge,
96
Scalable automatic edge inference,
407–408
Scalable query answering,
409
combined similarity matrix for,
138t
with integrity constraints,
137f
propagating constraints,
142
tree representation of,
143f
languages
tuple-generating dependencies,
78–80
uncertainty
Scientific data sharing setting,
440–441
Scoring
Select-project-join (SPJ) expression,
211,
212
Semantics
compatibility, considering,
408
Semi-supervised learning,
409,
456
Semiautomatic techniques,
345
Sequence-based similarity measures
Jaro-Winkler measure,
104
Set-based similarity measures
Similarity measures
hybrid
Monge-Elkan similarity measure,
109
sequence-based
Jaro-Winkler measure,
104
set-based
Simple delete-insert update model,
449–450
Single-database context,
401
Social media, integration of,
456
Softened overlap set,
107
Source descriptions, vertical-search engine,
382
Spreading activation,
404
Stalker extraction rules,
254
Standard data integration applications,
388
State modules (STeMs),
235
Statistics collection operators,
229
Steiner tree algorithms,
402,
403
Streaming XPath evaluation,
312
String matching
problem description of,
95–96
scaling up
similarity measures
Structured data, keyword search,
399–403
Structured Generalized Markup Language (SGML),
292
Suboperators for eddy,
233
Subsumption inference in description logics,
331–332
Support vector machines (SVM),
178
T
Target data instance,
276
Threshold Algorithm (TA),
404,
405
Top-$k$ query processing,
404
Transactions, challenges of,
449–450
Transient data integration tasks,
378
Tuple-generating dependencies (tgds),
24,
78–80,
277
Tuple-independent model,
349
Tuple-level uncertainty,
347
Two-way Bloomjoin operator,
218–219
U
probabilistic data representations,
348–350
Uniform resource indicator (URI),
294,
338
User-supervised techniques,
392
V
Variable network connectivity and source performance,
210
Vertical partitioning,
217
Vertical-search engines,
378,
385
Virtual data integration, ,
10
Virtual integration system, caching,
284
W
lightweight combination of
multiple data sets, combining,
393
structured data, discovering,
391
Web end user information sharing,
440
Web Service Description Language (WSDL),
300
Web sites with databases of jobs,
4–5,
5f
Web-based applications,
284
Web-oriented data integration systems,
225
Weighted-sum combiners,
134
Wrappers
construction
X
XML Stylesheet Language Transformations (XSLT),
300
Z