Index

A

AADAM format, 281, 292
Abstract Syntax Notation One (ASN1), 407
abstraction, process, 358–359
access, restriction, in Sequence Retrieval System, 128
accession number, 44
accuracy, 25–27
Affymetrix Analysis Data Model, 407
Affymetrix GeneChip microarray, 280
GeneExpress system and, 282
aggregation, 407
algebra, relational, 42–43, 420
algorithm
cell averaging, 280
gene expression data, 286–287
AllGenes project, 53–54
AllGenes query, 57, 58
ampersand, 120
analysis, 404
complexity of, 6–7
analysis package, Kleisli query system and, 165
analysis program, sensitivity of, 26
analysis software, 19
analysis tool in Sequence Retrieval System, 137–139
annotation, gene
as integration challenge, 289–290
standardization involving, 282
annotation data mapping, gene, 295–296
annotation data space, gene, 279
annotation pipeline, genome, 26
anomaly, update, 40
ANSI-SPARC three-schema architecture, 254–257
application programming interface (API), 381, 407
application semantics, 19
architecture
DiscoveryLink, 309–312
federated. See federation
grid, 91–92
of KIND model-based mediator, 361
mediator, 256–261
of Sequence Retrieval System, 111
three-schema, 254–257
array
different versions of, 285–286
probe, 280
Atlas, SMART, 362–364
automated server maintenance in Sequence Retrieval System, 141–143
automatic summary table, 407
autonomous data source, 18
autonomy of databases, 407

B

bag, 408
Basic Local Alignment Search Tool (BLAST), 25, 137–138, 408
DiscoveryLink and, 311, 313–316
FASTA and, 146
functionality of, 383
integration of, 45–46
querying vs. browsing, 47
batch queue, 139
benchmarks in performance evaluation, 374–375
bi-valued semantics, 90
Binary Large Object (BLOB), 408
bindjoin, 408
bioinformatics
biological data integration, 4–7
definition of, 3
future of, 394–396
problem and scope of, 2–4
system development, 7–10
biological data, nature of, 15–17
biological data integration, 7–10
biological database, Kleisli query system and, 165–166
biological ontology, 216–217
biological resource, 397–405
query processing and, 92–93
biological sample data space, 278–279
biological tool, legacy, 79–80
biology
fusion with information science, 2–3
systems, 421
blastn, 408
blastp, 408
Boolean circuit, 408
Boolean query, 24
BottomUpOnce strategy, 172
box plot, 408
browsing
definition of, 408
design of, 89–90
example of, 50–52
querying vs., , 46–48
scientific objects, 100–101
semantic, in model-based mediation, 344
strengths and weaknesses of, 61–62
bulk data type, 408

C

calcium channel protein, example using, 319–322
Call-Level Interface (CLI), 409
canned query, 139
capability, source, 93
capturing, relational schema, 125–126
capturing process knowledge, 340–341
CDATA, 408
cDNA, 409
CDS, 409
cell averaging algorithm, 280
Cell-Centered Database, 345–347, 362–364
CGI, 409
challenges of information integration, 11–31
data integration, 21–24
meta-data specification, 24–25
ontology, 27–30
provenance and accuracy, 25–27
Web presentations, 30–31
Character Large Object (CLOB), 409
CLUSTAL, 137–138
clustering technique, 13
CNS tissue, 409
co-clustered fragment, 409
code
Icarus, 113–115
Perl, 167, 168
code generator, 260
Collection Programming Language (CPL)
definition of, 409
DiscoveryLink and, 308
K2 system and, 228
P/FDM mediator and, 267
query processor and, 205
combining old and new data, 68
Common Object Request Broker Architecture (CORBA), 22, 91, 140, 141
definition of, 410
TAMBIS and, 214–215
comparative genomics, 409
compensation in query optimization, 317–318
compilation of domain maps, 354–355
compiler
condition, 260
execution plan, in KIND model-based mediator, 362
complex DTDs, 121
complex multiple-world scenario, 336–337
complex objects in Sequence Retrieval System, 134
complex value data, 233, 409
composite structure, links to create, 136
composition, view, 68
comprehension syntax-based language, 151
Comprehensive Data Center, 397
computational analysis tool, 19
concept
definition of, 190
parameterized, 356–357
recursive, 356
restricting of, 200–201
role as, 353
in system design, 85–86
concept description, query as, 197–202
concept integration, 4–5
concept overloading, 5
Conceptual Model (CM), 410
conceptual schema, 44, 255
condition compiler, 260
consortium, Gene Ontology, 29
construction, of links, 131–132
context-sensitive optimizations, 171–174
contextual references, in model-based mediation, 349
contextualization, in model-based mediation, 344, 350–351
controlled vocabulary, 40
cost model in performance evaluation, 372–374
cost of query processing, 96–97
DiscoveryLink and, 318, 322–326
coverage of information sources, 92
CPL2Perl, 176–179
CPU, 410
creating wrapper in DiscoveryLink registration, 313
criterion, 193
curated database, 26
curated gene data source, simple, 37–38
curation, data, definition of, 410

D

Daplex query language
capabilities of, 264–265
example using, 261, 262, 264
functional data model and, 252, 253
data
model, in K2 information integration system, 232–235
multimedia, 99–100
standardization involving, 282
data cleansing, 410
data curation, 410
data dictionary, 22
data distribution, in system evaluation, 386–387
data-driven integration, 91–92
data driver
decoupled, 242
integrated, 241
data exchange
for integration of third-party gene expression data, 291–293
standards for, 282
data federation, use case, 68–69
data format, updating of, 6
data fusion, 82, 410
data integration. See Integration, data
data loading, 296–297
data management, 35–69
basics, 36–39
gene expression, 277–299. See also gene expression data management
relational model, 41–44
retrieving genes, 38–39
semi-structured text files, 40–41
simple curated gene data source, 37–38
spreadsheets, 39–40
traditional, 41–44
transforming of database structure, 44
data mapping, semantic, 293–296
data mining, 87–89, 411
data model, 411
in K2 information integration system, 232–235
non-relational, 64
relational, 41–44
strengths and weaknesses of, 64
data organization, traditional, 81
data provenance, 25–27
data provider in model-based mediation, 343
data replication approach, 250–251
data repository, 4
data-shipping, 411
data source
characteristics of, 17–19
definition of, 147, 411
DiscoveryLink registration and, 314
gene expression data management and, 290
in K2 information integration system, 240–242
Kleisli query system and, 165–167
mediator and, 349–351
P/FDM mediator and, 265–266
simple curated gene, 37–38
Web, 65–67
data space, gene expression, 278–281
biological sample, 278–279
gene annotation, 279
gene expression measurement, 279–281
data transformation, 5
data type, 411
data warehouse. See Warehousing
databank
definition of, 112
relational, viewing entry from, 128–129
XML, loading from, 135
Databank, in Sequence Retrieval System, 109–116
database
autonomy of, 407
biologic, Kleisli query system and, 165–166
cell-centered, 362–364
definition of, 36, 112, 410
Expressed Sequence Tag, 319
flat files vs., , 78
heterogeneous, definition of, 414
link-driven federation of, 415–416
number of, 4
patent, Kleisli query system and, 164
relational, query performance to, 128
virtual, in DiscoveryLink, 305
database management, traditional, 80–81
database management system (DBMS), 36
definition of, 411
relational, 21–22
database structure, transforming, 44
database system, 405
Datalog, 411
DB2 DataJoiner, 306
DDL statement, 313–314
declarative access, procedural access vs., , 49
declarative query language, 63
decomposition, query, 68
decoupled data driver, 242
definition
integrated view, 345
intensional, 348–349
delivery pattern in query processing, 93
Department of Energy unanswerable query challenge, 226, 228, 229, 375–376
deployment issues in GeneExpress
system, 283–284
description, concept, query as, 197–202
description logic ontology, 194
description logics, 411
design of biological information system, 75–101
browsing, 89–90
concepts and ontologies, 85–86
data fusion, 82
engineering vs. experimental science, 76–77
fully structured vs. semi-structured, 82–84
generic system vs. query-driven, 77–78
legacy data and tools, 78–80
queries, 86–98. See also Query
scientific object identity, 84–85
searching, 87–89
tool-driven vs. data-driven, 91–92
traditional database management, 80–81
visualization, 98–101
development process, 9
dictionary
data, 22
in K2 system, 233
difference operation, 42
discovery process, life sciences, 12–14
discoveryHub, efficiency of, 377
DiscoveryLink, 24, 55–58, 303–331
approach, 306–316
architecture, 309–312
registration, 313–316
ease of use, scalability, and performance of, 327–329
efficiency of, 377
functionality of, 383
Kleisli query system and, 181–182
materialized vs. non-materialized approach and, 386
query processing in, 316–326
determining costs, 322–326
example of, 319–322
optimization and, 317–319
system information for, 428
distributed data, 45
distributed database systems, 411
distributed integration approach, 22
distributed object technology, 91
distribution, data, in system evaluation, 386–387
diversity, 15–16, 19–20
DNA, definition of, 412
DNA microarray, 412
DNA sequence, resources for, 397–398
DNA sequencing, 412
domain, constantly changing, 80
domain map, 335
definition of, 412
for model-based mediator system, 352–357
compilation of, 354–355
definition of, 352–353
deriving role hierarchy, 355–356
as logic rules, 354–355
parameterized role and concepts, 356–357
recursive concepts, 356
reified roles as concepts, 353
remarks, 355
role hierarchy, 354
domain semantics, 337
domain-specific benchmark, 374
driver
decoupled data, 242
integrated data, 241
DTD file, complex, 121
DTDGenerator, 120–121

E

EcoCyc, 216
efficiency
as implementation criterion, 377–378
as user criterion, 382
elaboration, process, 358–359
elaboration identifier, 358
EMBOSS, 138
Empty syntax, XML and, 118–119
end user in model-based mediation, 344
engineering
experimental science vs., , 76–77
knowledge, 353
entity, general, 119–120
Entrez interface, 88–89
entry ID, hub table as, 126
environment, for life science discovery, 14–15
ENZYME, 403
enzyme, definition of, 412
ER model, 412
error
propagation of, 26
in spreadsheet, 40
EST sequence, definition of, 412
European Bioinformatics Institute (EBI), 91
evaluation, query, 95, 96
evaluation matrix, 372
evaluation of data management system, 9–10, 371–390
implementation criteria for, 376–381
efficiency, 377–378
extensibility, 378–379
functionality, 379
scalability, 379–380
understandability, 380
usability, 381
performance model for, 371–376
benchmarks, 374–375
cost model, 372–374
evaluation matrix, 372
tradeoffs in, 385–389
data distribution and heterogeneity, 386–387
integrating applications, 389
materialized vs. non-materialized approach, 385–386
semi-structured vs. fully structured data, 387–388
user criteria for, 382–385
efficiency, 382
extensibility, 382–383
functionality, 383
scalability, 383
understandability, 384
usability, 384–385
evolution biology, 12
Excel, 39–40
exchange format
Kleisli, 156, 157
self-describing, 156
standards for, 282
for third-party gene expression data integration, 291–293
execution plan compiler in KIND model-based mediator, 362
experimental science, engineering vs., , 76–77
explorer window in TAMBIS, 195–197
exporter in P/FDM mediator, 251
exporting from SRS to XML, 136–137
Expressed Sequence Tag database, 319
expression
shorthand, 119–120
table, 421
expression profile, 13
extensibility
as implementation criterion, 378–379
as user criterion, 382–383
extensible markup language (XML), 43–44
for biological Web services, 30–31
browsing and, 90
categories of, 83
database integration into Sequence Retrieval System, 116–124
challenge of, 122–124
procedure for, 120–121
support features, 121–122
uniqueness of, 118–120
definition of, 423
exporting objects from SRS, 136–137
loading from, 135
navigational capabilities of, 90
semi-structured vs. fully structured data and, 387–388
Sequence Retrieval System and, 110, 116–124
TAMBIS and, 215
wrapper, 312
external schema, 254

F

FASTA, 137–138, 146
Feature table of GenBank, 159
federation, 22
definition of, 412
DiscoveryLink based on, 306
example of, 54–58
link-driven, 415–416
P/FDM mediator and, 249–272
alternative architectures for integration, 250–252
analysis, 266–272
data sources, 265–266
example of, 261–264
functional data model, 252–254
mediator architecture, 257–261
query capabilities, 264–265
schemas in federation, 254–257
Sequence Retrieval System and, 143
use case, 68–69
warehousing vs., 49
fields, SRS, 130
file
hypertext markup language, 147–148
probe intensity, 281
semi-structured text, 40–41
filler, 193
filter, 208
First Order logic, 413
flat file, database vs., , 78
flat file databank integration, 112–116
foreign key, 413
format
data
semi-structured text, 40–41
updating of, 6
exchange
Kleisli, 156, 157
self-describing, 156
standards for, 282
for third-party gene expression data integration, 291–293
self-describing exchange, 156
fragment, gene, 289
definition of, 413
frame-based system, 217
frame of reference, terminological, 347
FTP, 413
fully structured data, semi-structured data vs., , 387–388
fully structured information system, 82–84
functional data model, 252–254
functional genomics, 413
functional programming language, 413
functionality
as implementation criterion, 379
as user criterion, 383
fuser, result, 261
fusion
data, 82
definition of, 410
vertical loop, 170
future of bioinformatics, 394–396

G

Garlic project, 306–307
GenAtlas, querying in, 85
GenBank
accession number, 44
feature table of, 159
identifiers in, 100–101
Kleisli query system and, 150
materialized vs. non-materialized approach and, 385–386
search in, 66–67
gene, definition of, 413
gene annotation
as integration challenge, 289–290
standardization involving, 282
gene annotation data mapping, 295–296
gene annotation data space, 279
gene chip microarray technology, 414
gene data source, simple curated, 37–38
gene discovery, 319
gene expression, 399, 413
Gene Expression Array (GXA), 283–284
gene expression data management, 277–299
data spaces, 278–281
biological sample, 278–279
gene annotation, 279
gene expression measurement, 279–281
GeneExpress system for, 282–284
integration in, 285–290
algorithms and normalization and, 286–287
array versions and, 285–286
gene annotation and, 289–290
sample data and, 288
of third-party gene expression data, 291–298
variability and, 287–288
gene expression measurement data space, 279–281
gene fragment, definition of, 413
Gene Logic, DiscoveryLink and, 308
Gene Nomenclature Committee (HGNC), 28, 402
Gene Oncology (GO) Consortium, 29, 217
description of, 402
gene product, 413
GeneCards, search in, 66–67
GeneChip, 413
GeneChip microarray, 280
GeneExpress, system information for, 427
GeneExpress Data Warehouse (GXDW), 283–284
gene annotation component of, 290
GeneExpress system, 282–284
algorithms in, 286–287
components of, 283
deployment and update issues in, 283–284
integrating third-party expression data in, 291–298
sample data in, 288
general entity, 119–120
generator
code, 260
logic plan, 360–361
generic approach, 49–50
query-driven approach vs., , 77–78
strengths and weaknesses of, 63
generic benchmark, 374
generic query optimization, 267–268
genetics, 399
Genetics Computer Group (GCG), 307–308
genome
definition of, 414
resources of, 398
genome annotation pipeline, 26
Genome DataBase (GDB)
Kleisli query system and, 150–151
materialized vs. non-materialized approach and, 385–386
object identity and, 84–85
genome project, 414
genomic data source as integration challenge, 289–290
Genomic Unified Schema, 385–386
genomics, 414
functional, 413
research needs of, 12–13
GenPept report, 153–154
creating warehouse of, 164–165
Glimpse search engine, 88
global-as-view technique, 216
definition of, 414
in model-based mediation, 349, 350
global integration schema, 266
global schema, 45–46, 414
Globus Pallidus External, 351
GO databank in Sequence Retrieval System, 126–127
GRAIL, 202
GRAIL query, 205–206
query planner, 208–211
graphical interface, 179
graphical user interface, for P/FDM, 269, 271
Grid, 414
grid architecture, 91–92
GUI, 414

H

hard-coding, 49–50
legacy tools including, 80
strengths and weaknesses of, 63
hardwired access to data sources, 304
hardwiring of mapping in GeneExpress system, 295
hash table, 321
heterogeneity
in semantic data integration, 58–59
syntactic and semantic, 212
heterogeneous data format, 18, 19–20
heterogeneous database, definition of, 414
hierarchy, role, 355–356
hierarchy, in GeneExpress system, 293
host variable, 414
HTTP, 414
hub table, 126–127
HUGO name, withdrawn or approved, 84–85
human computer interaction, 375
Human Genome Initiative, 415
Human Genome Project, 415
Human Genome Organization (HUGO), 28, 402
hybrid integration approach, 64–65
hybridization, 415
hypertext markup language file (HTML), 147–148
hypothesis as design step, 76

I

Icarus code, 113–114
ICode rewriter, 260
ID, entry, hub table as, 126
identifier, elaboration, 358
identity
pre-defined, 81
scientific object, 84–85
IBM DiscoveryLink middleware system, 24
ImMunoGeneTics information system, 403
implementation, experiment as, 76
implementation criteria system evaluation, 376–381
efficiency, 377–378
extensibility, 378–379
functionality, 379
scalability, 379–380
understandability, 380
usability, 381
in silico discovery kit (ISDK), 160, 161, 415
indexing, SRS support for, 121–122
indexing tool output, 138
industrial merger, 303
information integration
in bioinformatics, 213–215
biologic ontologies, 216–217
data challenges, 21–24
data provenance and accuracy, 25–27
knowledge based, 215–216
meta-data specification, 24–25
ontology, 27–30
Web presentations, 30–31
information integration system, K2, 225–247. See also K2 information integration system
information science, fusion with biology, 2–3
Informax, 307
Infosleuth, 266
initial process semantics, 357
input, processing of, 138
input/output format, 19
integrated data driver, 241
Integrated Taxonomic Information System, 402
integrated view definition, 345
integrated view of biology, 12
integration
schema, 421
in system evaluation, 389
view, 423
integration, data, 4–10, 60–69
browsing vs. querying, 46–48, 61–62
as challenges, 21–24
challenges of, 11–31
concept, 4–5
declarative query language, 63
definition, 410
development process, 9
evaluation of, 9–10
of flat file databanks with SRS, 112–116
of gene expression data, 285–290
algorithms and normalization and, 286–287
array versions and, 285–286
sample data and, 288
gene annotation and, 289–290
variability and, 287–288
generic approach to, 63
hard-coded approach to, 63
hybrid approach to, 64–65
issues of, 4–7
procedural code, 63
relational vs. non-relational, 64
semantic, 58–60
semantic query planning, 65–67
specifications for, 7–8
syntactic vs. semantic, 48–49
technical approach, 8–9
of third-party gene expression data, 291–298
data exchange formats for, 291–293
data loading issues in, 296–297
semantic data mapping issues in, 293–296
structural data transformation issues in, 293
update issues in, 297–298
tool-driven vs. data-driven, 91–92
use case for, 45–46
Web data sources, 66
integration schema, global, 266
intensional definitions, 348–349
intensity file, probe, 281
interaction, human computer, 375
interface
application programming, 407
Entrez, 88–89
graphical, 179
in K2 information integration system, 243–244
keyword-search querying, 24
Kleisli query system and, 166
for P/FDM, 268–271
to Sequence Retrieval System, 139–141
TAMBIS, 195–205
constructing queries, 197–202
exploring ontology, 195–197
query processor, 205–212
reasoning in query formulation, 202–205
intermediary, 8
internal language, of K2 information integration system, 239–240
internal schema, 254, 256
International Classification of Diseases, Ninth Revision, 402
International Organization for Standardization, 415
International Union of Biochemistry and Molecular Biology (IUBMB), 28, 403
International Union of Pure and Applied Chemistry (IUPAC), 28, 403
is a hierarchy, 192
ISA relationship, 415
iteration, 207

J

Java-based visual interface, for P/FDM, 268
Java DataBase Connectivity (JDBC), 229, 415
Java RMI, 241–242
join, 42
joining data in DiscoveryLink query
processing, 317–318
joins, spatial, 337
Journal of Nucleic Acid Research, , 17

K

K2 information integration system, 225–247
approach in, 229–232
data model and languages in, 232–235
data sources in, 240–242
example of, 235–239
impact of, 245–246
internal language of, 239–240
Kleisli vs., , 228–229
query optimization in, 242–243
scalability of, 244–245
system information for, 426
user interfaces in, 243–244
K2MDL, 231–232, 415
key, primary, 81
keyword-search querying interface, 24
KIND
mediator prototype, 360–362
system information for, 428–429
understandability of, 381, 384
Kleisli query system, 23–24, 147–184
approach of, 151–153
data model and representation in, 153–157
data sources in, 165–167
DiscoveryLink and, 181–182
efficiency of, 377–378
functionality of, 383
K2 information integration system vs., , 228–229
motivating example for, 149–151
Object-Protocol Model and, 182–183
optimizations, 167–169
context-sensitive, 171–174
monadic, 169–170
relational, 174–175
query capability of, 158–163
Sequence Retrieval System and, 179–181
system information for, 425
understandability of, 384
user interfaces, 175–179
graphical, 179
program language, 175–179
warehousing capability of, 163–165
knowledge, process, 340–341
knowledge base, 90
knowledge based information integration, TAMBIS, 215–216
knowledge engineering, 353
knowledge representation in model-based mediator system
domain maps for, 352–357
compilation of, 354–355
definition of, 352–353
deriving role hierarchy, 355–356
as logic rules, 354–355
parameterized role and concepts, 356–357
recursive concepts, 356
reified roles as concepts, 353
remarks, 355
role hierarchy, 354
process maps for, 357–360
domain maps and, 358
initial process, 357
as logic rules, 359–360
process elaboration and abstraction, 358–359
known gene, 416
KRAFT, 266
Kyoto Encyclopedia of Genes and Genomes (KEGG), 416

L

Laboratory Information Management System (LIMS), 13, 127
definition of, 416
GeneChip, 281
output, 20
language
Daplex, 253
extensible markup. See extensible markup language (XML)
functional programming, 413
of K2 information integration system, 232–235, 239–240
query
definition of, 419
limitations of, 86–87
legacy data and tools
biologic, 78–79
workflows, 79–80
LENS, 86
library, subentry, 116
life sciences discovery process, 12–14
link
browsing, 89–90
in browsing scientific objects, 100
link-driven federation of databases, 416
link operator in SRS query language, 132–133
linking, databank, to Sequence Retrieval System, 130–133
LION, 307
LISP, 416
list, definition of, 416
list comprehension, 257
literature reference, 401
loader, object, in Sequence Retrieval System, 133–137
loading
data, 296–297
from XML databank, 135
local-as-view technique, 216
definition of, 416
in model-based mediation, 350–351
local ontology, in model-based mediation, 344
local schema, 45–46
LocusLink, 403
logic
First Order, 413
temporal, 90
logic plan generator, 360–361
logic rule
domain map as, 354
process map as, 359–360
logics, description, 411
LOGSPACE, 416
long-term potentiation in nerve cell, 340
loop design, 76
loosely coupled system, 250

M

maintenance, automated server, in Sequence Retrieval System, 141–143
management
data, 35–69. See also data management
multimedia, 99–100
schema, 67–69
space, 373
time, 372–373
traditional database, 80–81
map
domain, 335
definition of, 412
in neuroscience, 339–342
process, 335
definition of, 419
simple process, 342
subprocess, 359
mapped role, 208
mapping
P/FDM mediator and, 263
schema, 68
semantic data, in integration of third-part expression data, 293–296
markup language, extensible. See extensible markup language
MAS algorithm, 286–287
materialized approach, 385–386
materialized view, 44, 416
matrix
evaluation, 372
measurement data space, gene expression, 279–281
mediation, semantic, 364
mediator
definition of, 417
sources and, 349–351
mediator architecture, 256–261
mediator database system, 22–24
mediator system
description of, 237–239
model-based, 335–366. See also model-based mediator system
P/FDM, 249–272. See also P/FDM mediator
prototype, 261–266
MEDLINE, 66
MEDLINE report, 153
merger, industrial, 303
meta-data, 56
Sequence Retrieval System and, 109–110, 111
meta-data specification, 24–25
meta language (ML), 417
microarray
different versions of, 285–286
microarray analysis, 404
Microarray Gene Expression Database society (MGED), 281, 417
microarray suite algorithm, 286–287
microarray suite (MAS), GeneChip, 280
microarray technology, gene chip, 414
Microsoft Distributed Component Object Model (DCOM), 91
Microsoft Visual Basic, 40
middleware, 417
middleware system, DiscoveryLink, 24. See also DiscoveryLink
minimum information about a microarray experiment (MIAME), 281–282, 417
mining, data, 87–89, 411
mismatch probe, 280
model
conceptual, 410
cost, 372–374
data, relational, 41–44
ER, 412
functional data, 252–254
object-oriented, 418
relational, 420
sources and services, 206–208
model-based mediator system, 335–366
background of, 336–337
Cell-Centered Database and SMART Atlas, 362–364
challenges from neurosciences, 338–342
conceptual models and source registration at, 344–349
for Cell-Centered Database, 345–347
contextual references, 349
creating terminological frame of reference, 347
intensional definitions, 348–349
ontological grounding of OM (S), 348
semantics of relationships in, 347–348
domain maps for, 352–357
compilation of, 354–355
definition of, 352–353
deriving role hierarchy, 355–356
as logic rules, 354–355
parameterized role and concepts, 356–357
recursive concepts, 356
reified roles as concepts, 353
remarks, 355
role hierarchy, 354
interplay between mediator and sources, 349–351
KIND mediator prototype, 360–362
process maps for, 357–360
domain maps and, 358
initial process, 357
as logic rules, 359–360
process elaboration and abstraction, 358–359
protagonists in, 343–344
reason-able meta-data, 365–366
related work, 364–365
model-based mediation (MBM), 417
module
optimizer, 260
reordering, 260
monad approach, 228
monadic optimizations, 169–170
motif, 192, 204
motivating use case, 45–46, 47
Mouse Genome Database
syntactic vs. semantic integration, 48–49
use case for integration, 45–46
mRNA, 417
multi-database approach, 251–252, 417
multidisciplinary approach, 15
multimedia data, 99–100
multiple sequence alignment, 404

N

name, HUGO, withdrawn or approved, 84–85
National Biological Information Infrastructure, 402
NCBI Entrez, 51–52
NCMIR, 338–339
nested object in Sequence Retrieval System, 134
Nested Relational Calculus (NRC), 152, 163, 418
nested relationalized version of SQL, 151–153
nested structure in K2 system, 226
neuroinformatics, 12
neuroscience, data integration in, 338–339
nomenclature, sample data mapping, 294–295
non-databased query, 175–176
non-materialized approach, 385–386
non-materialized view, 44, 418
non-relational data model, 64
relational data model vs., , 50
nonsensical question, 201–202
normal syntax, XML and, 118
normalization, gene expression data and, 286–287
novel gene discovery, 319
NP (NPTIME), 418
NP-complete, 418
number, accession, 44

O

OASIS, 31
object
browsing of, 100–101
complex and nested, 134
Sequence Retrieval System, 140–141
Object Data Management Group (ODMG), 231–233, 418
Object Definition Language (ODL), 418
object identity, scientific, 84–85
object loader in Sequence Retrieval System, 133–137
complex and nested objects, 134
exporting objects to XML, 136–137
links to create composite structures, 136
support for, 135
Object Management Group (OMG), 22, 28, 419
object model, 344
object-oriented database, 308
object-oriented interface to Sequence Retrieval System, 140–141
object-oriented model, 418
object-oriented programming, 253, 254
object-oriented technology, 22
Object-Protocol Model (OPM), 24
DiscoveryLink and, 308
Kleisli query system and, 162, 182–183
system based on, 85–86
TAMBIS and, 213–214
Object Query Language (OQL), 86, 419
definition of, 418
K2 system and, 228
ODB-Tools, 365
on-line analytical processing (OLAP), 419
one-world/multiple-world scenarios, 419
ontological grounds of OM (S), 348
ontology, 27–30
biological, 216–217
definition of, 419
in model-based mediation, 344
neuroscience, 339
in system design, 85–86
Ontology Inference Layer (OIL), 418
Ontology for Molecular Biology (OMB), 217
Open DataBase Connectivity (ODBC), 418
optimization, query, 95–98
Daplex and, 264
in DiscoveryLink, 317–319
generic, 267–268
in K2 information integration system, 242–243
Kleisli query system and, 167–169
monadic, 169–170
relational, 174–175
semantic, 258, 267
optimizer module, 260
Oracle, 308
Oracle wrapper, 311
organ resources, 401
organism resources, 401
organization, data, 78–79
traditional, 81
output, processing of, 138
overloading, concept, 5

P

P (PTLME), 420
P/FDM mediator, 249–272
alternative architectures for integration, 250–252
analysis, 266–272
optimization, 267–268
scalability, 271–272
user interface, 268–271
data sources, 265–266
example of, 261–264
functional data model, 252–254
mediator architecture, 257–261
query capabilities, 264–265
schemas in federation, 254–257
system information for, 427
package, analysis, 165
parameterized roles and concepts, 356–357
parser module, 257
parsing tool output, 138
patent database, 166
pattern, in query processing
delivery, 93
statistical, 93
pattern recognition, 405
perfect-match probe, 280
performance model for system evaluation, 371–376
benchmarks, 374–375
cost model, 372–374
evaluation matrix, 372
performance of DiscoveryLink, 327–329
Perl codes, 167, 168
pharmacogenomics, 400–401
definition of, 420
pharmacology research, 304
phrase-based system, 217
phylogeny and evolution biology, 12
pipeline, genome annotation, 26
planning, query, 94–95
Plant Ontology Consortium, 402
platform, establishing, 8
pre-defined identity, 81
pre-processing, 138
precision, of text retrieval, 388–389
primary key, 81, 419
Prisma, SRS, 141–143
probe, definition of, 419
probe array, 280
probe array version, 285
probe data, 280
probe intensity file, 281
probe pair, 280
procedural access, declarative access vs., , 49
procedural code, 63
process
life sciences discovery, 12–14
map, definition of, 419
process elaboration and abstraction, 358–359
process knowledge, capturing, 340–341
process map, 335
in neuroscience, 339
simple, 342
process maps for model-based mediator system, 357–360
domain maps and, 358
initial process, 357
as logic rules, 359–360
process elaboration and abstraction, 358–359
process semantics, initial, 357
processing, query, 92–98
processor, query, 205–212, 220. See also query processor
profile, user, 7–8
program, structural recursion, 162–163
programming, object-oriented, 253
programming interface, application, 407
programming language, functional, 413
projection, 42
Prolog, 254
propagation of errors, 26
protein, calcium channel, 319–322
protein domain, 400
protein family, 400
protein sequence, resources for, 397–398
proteome, definition of, 419
proteomics, 400, 419
prototype mediator, 261–266
KIND, 360–362
provenance, 25–27
provider
data, 343
view, 343–344
Public Catalog of Databases, 17
public data source, 17–18
PubMed
identifiers in, 100–101
search in, 51–52, 66–67, 89

Q

query, 86–98
AllGenes, 57, 58
Boolean, 24
browsing, 89–90
cost of processing, 322–326
Daplex, 252, 261, 262, 264
capabilities of, 264–265
definition of, 420
DiscoveryLink and, 305–306, 316–326
architecture and, 309–310
determining costs, 322–326
example of, 319–322
optimization and, 317–319
efficiency of, 377–378
old and new data, 68
reasoning in formulation of, 202–205
in relational database, 128
searching and mining, 87–89
semantics of, 90
in Sequence Retrieval System, 128, 129–130
SQL, 127
in TAMBIS, 191, 197–202
unanswerable, 226, 228, 229, 375–376
to Web interface, 139
query decomposition, 68
query-driven approach, 77–78
query execution plan, 65
query language
declarative, 63
definition of, 420
standard, 43–44
query optimization
in K2 information integration system, 242–243
semantic, 258
query processing, 92–98
biological resources in, 92–93
optimization in, 95–98
planning in, 94–95
query processor, TAMBIS, 205–212, 220
query planner, 208–211
sources and services model, 206–208
syntactic and semantic heterogeneity, 212
wrappers, 211–212
query rewriter in KIND model-based mediator, 362
query-shipping, 420
query splitter, 260, 268
query system, Kleisli, 147–184. See also Kleisli query system
querying, 420
browsing vs., , 46–48
object identity and, 84–85
SRS support for, 121–122
strengths and weaknesses of, 61–62
querying interface, keyword-search, 24
question, nonsensical, 201–202
queue, batch, 139

R

reason-able meta-data, 365–366
reasoning, in query formulation, 202–205
record, definition of, 420
recursion program, structural, 162–163
recursive concept, 356
reductionist molecular biology, 12
registration
in DiscoveryLink, 309
process of, 313–316
in model-based mediation, 344–349
reified roles as concepts, 353
relational algebra, 42–43, 420
relational data model, 41–44
non-relational model vs., , 50
strengths and weaknesses of, 64
relational database, 153
integration into Sequence Retrieval System, 124–129
capturing relational schema, 125–126
hub table selection, 126–127
query performance, 128
restricting access, 128
SQL generation, 127
summary of, 129
viewing entries, 128–129
whole schema integration, 124–125
query performance to, 128
viewing entry from, 128–129
relational database management system (RDBMS), 21–22
Kleisli query system and, 165
relational model, 420
relational optimizations, 174–175
relational schema, capturing, 125–126
relationships, semantics of, in model-based mediation, 347–348
relevance
semantic, 364–365
source, 92–93
reliability, data provenance and, 26–27
reordering module, 260
replication approach, data, 250–251
report
GenPept, 153–154
creating warehouse of, 164–165
MEDLINE, 153
repository, data, 4
research and development, revolution in, 2–3
resolution, concept integration and, 4–5
resource, biological
list of, 397–105
in query processing, 92–93
Resource Description Framework (RDF), 420
restriction
access, 128
concept, 200–201
result fuser, 261
retrieval, text, 388–389
retrieval system, 405
rewriter
ICode, 260
query, in KIND model-based mediator, 362
RiboWeb, 216, 403
RNA, 420
role, 193
as concept, 353
mapped, 208
parameterized, 356–357
in TAMBIS, 207–208
role hierarchy, 355–356
rule
Icarus, 113–115
logic
domain map as, 354
process map as, 359–360
in query optimization, 96
rule-based rewriter, 258

S

sample data
gene expression, 288
standardization involving, 282
sample data mapping
nomenclature, 295
studies of, 294
sample data space, biological, 278–279
sanctioning, 203
scalability
of DiscoveryLink, 327–329
as implementation criterion, 379–380
of K2 information integration system, 244–245
P/FDM and, 271–272
as user criterion, 383
scaling factor, 287
schema
conceptual, 44
in database federation, 258
definition of, 41–42, 421
global integration, 266
relational, capturing, 125
three-schema architecture, 254–257
whole schema integration, 124–125
schema integration, 421
schema management, 67–69
schema mapping, 68
science, experimental, engineering vs., , 76–77
scientific analysis program, sensitivity of, 26
scientific analysis tool in Sequence Retrieval System, 137–139
scientific object, browsing of, 100–101
scientific object identity, 84–85
search, spreadsheet, 40
search engine, Glimpse, 88
searching
definition of, 421
design of, 87–89
and mining, 87–89
selection, 42
self-describing exchange format, 156
semantic browsing in model-based mediation, 344
semantic data integration, 58–60
semantic data mapping in integration of third-party expression data, 293–296
semantic heterogeneity, 212
semantic mediation, 364
semantic query optimization, 258
semantic relevance, 364–365
semantic vs. syntactic integration, 48–49
Semantic Web, 421
semantics
application, 19
of biological data, 5
initial process, 357
in model-based mediation, 347–348
of query, 90
semi-structured data, fully structured data vs., , 387–388
semi-structured information system, 82–83
semi-structured text file, advantages and disadvantages, 40–41
SeqStore, 307
sequence
DNA or protein, resources for, 397–398
EST, definition of, 412
sequence data source, searching against, 87
sequence folding, 404
Sequence Retrieval System (SRS), 109–144
architecture of, 111
automated server maintenance, 141–143
integrating flat file databanks, 112–116
subentry libraries, 116
token server, 113–115
interfaces to, 139–141
Kleisli query system and, 179–181
linking databanks, 130–133
object loader, 133–137
complex and nested objects, 134
exporting objects to XML, 136–137
links to create composite structures, 136
support for, 135
query language of, 129–130
relational database integration, 124–129
capturing relational schema, 125–126
hub table selection, 126–127
query performance, 128
restricting access, 128
SQL generation, 127
summary of, 129
viewing entries, 128–129
whole schema integration, 124–125
scientific analysis tools, 137–139
system information for, 425
TAMBIS and, 213
XML database integration, 116–124
challenge of, 122–124
procedure for, 120–121
support features, 121–122
uniqueness of, 118–120
sequence similarity search, 404
sequencing, definition of, 412
server
DiscoveryLink, 309
query processing and, 318
GeneExpress system on, 283
SOAP, TAMBIS and, 214–215
token, 113–115
server in Sequence Retrieval System, 111–112
maintenance of, 141–143
set, definition of, 421
shorthand expression, 119–120
simple curated gene data source, 37–38
simple multiple-world scenario, 336
Simple Object Access Protocol (SOAP), 141, 421
TAMBIS and, 214–215
simple one-world scenario, 336
simple process map, 342
simplified SQL, 148–149
simplified Structured Query Language (sSQL), 148–149, 151–152, 421
simplifier, 257
single channel gene expression microarray system, 279–281
SNOMED. See Systematized
Nomenclature of Medicine
software, analysis, 19
software benchmark, 374–375
source, data
characteristics of, 17–19
definition of, 147, 411
gene expression data management and, 290
in K2 information integration system, 240–242
Kleisli query system and, 165–167
mediator and, 349–351
P/FDM mediator and, 265–266
simple curated gene, 37–38
types of, 78
Web, 65–67
source dependent query plan, 191
source relevance, 92–93
sources and services model, 206–208
space management, 373
spatial joins, 337
Spatial Markup Rendering Tool (SMART) Atlas, 360, 362–364
specification, meta-data, 24–25
specifications
determining, 7–8
translating into technical approach, 8–9
splitter, query, 260
spreadsheet, 39–40
SRS Prisma, 141–143
SRSCS, 140, 141
stackPACK, 138
Staged Prisma, 142
Standard Markup Language (SML), definition of, 421
standard query language, 21–22
standardization
benefits and limitations of, 281–282
of gene names, 28
Stanford-IBM Manager of Multiple Information Sources (TSIMMIS), 24
statement, DDL, 313–314
statistical pattern in query processing, 93
statistical technique for gene expression data, 287–288
storage schema, 256
stored procedure, 422
structural data transformation in integration of third-party gene expression data, 293
structural recursion program, 162–163
structure
composite, links to create, 136
database, transformation of, 44
resources of, 399
structure prediction, 404
Structured Query Language (SQL), 43, 86
definition of, 422
DiscoveryLink and, 311
generation of, 127
mining and, 87
plan generator, 362
subentry library in integrating flat file databanks, 116
subprocess map, 359
summary table, automatic, 407
survey, TAMBIS, 218
Swiss-Prot
accession number, 44
query optimization and, 97–98
SYNAPSE, 338–339
syntactic heterogeneity, 212
syntactic vs. semantic integration, 48–49
syntactical problem, SRS solution of, 123–124
synthetic approach to biology, 12
system evaluation, 9–10
system requirements
determining, 7–8
translating into technical approach, 8–9
Systematized Nomenclature of Medicine (SNOMED), 28, 288, 294–295, 402
systems analysis, demands of, 12
systems biology, 422

T

table
automatic summary, 407
hash, 321
table expression, 422
tagged union type, 153
TAMBIS, 24, 66, 149, 189–220
current and future developments in, 217–219
DiscoveryLink and, 308
extensibility of, 378–379
information integration, 213–215
biological ontologies, 216–217
knowledge based, 215–216
ontology, 192–197
P/FDM mediator and, 267
scalability of, 380
semantic integration and, 60
system information for, 426
tools-driven technology used by, 91
understandability of, 384
usability of, 381
user interface
constructing queries, 197–202
exploring ontology, 195–197
query processor, 205–212
reasoning in query formulation, 202–205
technology, gene chip microarray, 413
temporal logic, 90
term, 85
terminological frame of reference, 347
text file, semi-structured, advantages and disadvantages, 40–41
text retrieval, in system evaluation, 388–389
third-party gene expression data, integration of, 291–298
data exchange formats for, 291–293
data loading issues in, 296–297
semantic data mapping issues in, 293–296
update issues in, 297–298
three-level hierarchy, in GeneExpress system, 293
three-schema architecture, 254–257
tightly coupled system, 250
time management, 372–373
tissue resources, 401
token server, 113–115
tool
legacy, 79–80
scientific analysis, in Sequence Retrieval System, 137–139
tool-driven integration, 91–92
traditional database management, 80–81
traditional database system, searching and mining in, 88
transcription, 422
transcriptome, 422
transformation
data, 5
of database structure, 44
translation, 422
Transparent Access to Multiple Bioinformatics Information Sources. See TAMBIS
tuple, 81
two channel gene expression microarray system, 279–281
two-level hierarchy in GeneExpress system, 293

U

UMLS ontology, 363
unanswerable query challenge, 226, 228, 229, 375–376
understandability
as implementation criterion, 380
as user criterion, 384
Unified Modeling Language (UML), 422
Uniform Resource Locators (URL), 423
union, 42
Universe, Sequence Retrieval System and, 110–111
update anomaly, 40
updating
GeneExpress system, 283–284
in integration of third-party gene expression data, 297–298
usability
as implementation criterion, 381
as user criterion, 384–385
use case, 36–39
combining old and new data, 68
data federation, 68–69
data warehousing, 68
for integration, 45–46
retrieving genes and associated expression results, 38–39
simple curated gene data source, 37–38
user interface
in K2 information integration system, 243–244
for P/FDM, 268–271
in TAMBIS, 220
constructing queries, 197–202
exploring ontology, 195–197
query processor, 205–212
reasoning in query formulation, 202–205
user profile, 7–8
user survey, TAMBIS, 218

V

variability, 17
in gene expression data, 287–288
variant, definition of, 423
vector, differing meanings of, 29
vertical loop fusion, 170
view
definition of, 423
materialized, 416
non-materialized, 418
view building, 68
view composition, 68
view integration, 228, 423
view provider in model-based mediation, 343–344
viewing entry from relational databank, 128–129
virtual database in DiscoveryLink, 305
visualization
browsing scientific objects, 100–101
multimedia data, 99–100
vocabulary
consistent, 30
controlled, 40

W

warehousing, 21–22
definition of, 411
DiscoveryLink and, 307–308
example of, 52–54
federation vs., , 49
gene expression data management and, 290
GeneExpress system and, 283
in K2 system, 229
in Kleisli query system, 163–165
strengths and weaknesses of, 62–63
use case, 68
Web data source, 65–67
Web interface
for P/FDM, 268–269, 270
to Sequence Retrieval System, 139
Web presentation, 30–31
Web services, 141
webomim-get-detail function in Kleisli system, 166–167
whole schema integration, 124–125
window, explorer, in TAMBIS, 195–197
withdrawn HUGO name, 84–85
workflow
biological tools and, 80
definition of, 423–424
World Wide Web, 30–31
data sources on, 6, 17–18
wrapped sources, 191
wrapper, 23, 49–50
BLAST, 315
in database federation, 260–261
definition of, 424
DiscoveryLink, 56, 308, 310–311
cost of query processing and, 322–326
registration and, 313–316
TAMBIS, 211–212

X

XA, 423
XPath, 90
XQuery, 90, 423
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset