Index

Note: Page numbers followed by “f” indicate figures.

A

Abbreviations 60–62
Abstraction 289, 292–293
Acronyms  See Abbreviations
Advanced Encryption Standard (AES) 205, 216
Algorithm, term extraction and index 103
Alphabetization, sorting 48–49
Alphanumeric sorting, text structuring 49
Annotation and metadata, text structuring 
HTML 63
tags 62, 64
XML tags 64
Apollo Lunar Surface Experiments Package (ALSEP) 285
Archeological effort 286
ASCII 
sorting, text structuring 51, 52f, 55
text structuring 69
ASCIIbetical order 51
Associative array 107, 246, 298, 301
Authentication 204–208, 351
Autocoder, index 111
Autocoding/autoencoding 
defined 105
index with nomenclatures 
associative array 107
autocoding/autoencoding, defined 105
nomenclature, defined 104
Perl code 107
plesionymous terms 104–105
Python code 106–107
Ruby code snippet 106
textual data 108

B

Bayesian analysis 331
Beauty and simplicity 135f
Blended classes, data analysis 143
Burrows wheeler transform (BWT) 348–351

C

CaBig(tm) 8
CAG repeat 343–344
Central limit theorem 326
Chaffing 351–352
Classification 1, 3
algorithm 239–240
corrupted 265f
definition 236
driving down complexity with 236–240
graphic version of 264–265
hypothesis-generating machines 250
of living organisms 246f, 248, 249f
nonchaotic and computable 245
reality creation 252
self-converging 248
self-correcting 247
unreasonable effectiveness of 244–253
visualization of relationships in 262–267, 264f
Classifiers 239
Command line utilities 12–14
Committees, text structuring standards 67
Complexity barrier 5–8
Computational operations, index 
autocoder 111
encryption protocols 109
one-way hash value 109
Concordances, index lists and 
Perl script 93–94, 96–97
Python 94–95
reconstruction 97–98
Ruby 95
Converting nonprintable files to plain-text 75–77
Corrupted classifications 265f
Cross multiple classes 253–256
Cygwin 14–16

D

Data analysis 
blended classes 143
data profiling 148–152
data signature 148–149
Gnuplot output 153f
Pareto’s principle 150
Perl script 151
phashconvert 149–150
Python script 151–152
stopwords 152
Zipf distribution 150
data reduction 153–155
data flattening 155
Euclidean distance 154
overfitting 155
principle component analysis 155
supercomputers 154
data retrieval 144–148
header 146
identify-verbose command 146, 147f
ImageMagick 144
imdisplay 144
Python script 144
Ruby script 145
mean field approximation 142
multimodality 142
open source tools 
Gnuplot 156–158
ImageMagick 162–163
LaTeX, displaying equations 163–165
MatPlotlib 159–160
normalized compression distance 165–168
Numpy (Numerical Python) 161
Pearson’s correlation 168–169
R, for statistical programming 160–161
Scipy 162
simple dot product 169–171
ranges and outliers 136–140
data object 136
data quality 137
data set 136–137
metadata 136
secondary data 137
simple estimates 140
Data encryption and authentication 204–208
Data flattening 155
Data immutability 346–348
Data integration 234
Data object 2, 136, 234
Data permanence 346–348
Data profiling 
data signature 148–149
Gnuplot output 153f
Pareto’s principle 150
Perl script 151
phashconvert 149–150
Python script 151–152
stopwords 152
Zipf distribution 150
Data quality 137
Data Quality Act 343
Data reduction 153–155
data flattening 155
Euclidean distance 154
overfitting 155
principle component analysis 155
supercomputers 154
Data retrieval 144–148
header 146
identify-verbose command 146, 147f
ImageMagick 144
imdisplay 144
Python script 144
Ruby script 145
Data scrubbing 201–204
minimal necessary concept 202
"safe phrases" list 202
Data set 136–137
Data signature 148–149
Decryption with OpenSSL 215–217
Deidentifiers 198–201
Digital Encryption Standard (DES) 216–217
Digital signal processing (DSP) 348
Digital watermarking 219–220
DOS batch scripts 16–18
DOS directory, sorting 49–50
Doublet lists, index 114–116
Dublin Core, text structuring 77–78

E

Encapsulation 288
Encryption 
with OpenSSL 215–217
protocols, index 109
Euclidean distance 154
Euler’s identity 244
Event identifiers 208–212
Extensible Markup Language (XML) 
annotation and metadata 64
simplifications 2
text structuring standards 68

F

Format commands 73–75
Free-text 47

G

Glyphs 1
Gnuplot 156–158
output 153f
GraphViz 262–263

H

Hash implementations, one-way 217–219
Header, data retrieval 146
Homonyms 45
HTML 
annotation and metadata 63
simplifications 2
sorting, text structuring 53

I

Identification system 238–239
Identified data 199
Identifiers 189
ImageMagick 191–193
implementation 191
poor 194–198
Identify-verbose command 146, 147f
ImageMagick 144, 162–163
Imdisplay, data retrieval 144
Index 
computational operations 
autocoder 111
encryption protocols 109
one-way hash value 109
find operation 91–92
international standard (ISO 999) 93
lists and concordances 
Perl script 93–94, 96–97
Python 94–95
reconstruction 97–98
Ruby 95
with nomenclatures, autoencoding 
associative array 107
autocoding/autoencoding, defined 105
nomenclature, defined 104
Perl code 107
plesionymous terms 104–105
Python code 106–107
Ruby code snippet 106
textual data 108
open source tools 
doublet lists 114–116
Linux wordlist 113–114
Ngrams lists 116–120, 117f, 119f
word lists 112–114
PageRank 93
properties of 92–93
search box 91
term extraction and 
algorithm 103
indexer 98
Perl script 99
Python script 103
Ruby script 103
scalable methods 100
software extracts 98
stop/barrier word list 100–102
Indexer, term extraction 98
Inheritance 288
Interactive line interpreters 19–20
International Code of Botanical Nomenclature (ICBN) 248
International standard (ISO 999) 93
Introspection 287–288

J

JADE project 286
Janus sentences 45

L

LaTeX, displaying equations 163–165
Legacy data, text structuring standards 66
Libraries 1
LibreOffice 12
Line-by-line sentence parsing 57
Linux bash scripts 18–19
Linux emulation for windows 14–16
Linux wordlist, index 113–114

M

MapReduce 2
Mathematics 2
MatPlotlib 159–160
Mayan glyphs 286
Mean field approximation 142
Meaning 233–236
Meaninglessness, text structuring 45–48
homonyms 45
Janus sentences 45
noncompositionality 46
reifications 46
unrestricted extensibility 46
word meanings 46
Mersenne Twister algorithm 212
Message digest version 5 (md5) 217
Metadata 136 See also Annotation and metadata
Monte Carlo simulations 327–335
M programming language  See Mumps
Multimodality 142
Mumps 236

N

Namespace 190, 235
National Cancer Institute dbGaP Data Access Committee 203–204
National Patient Identifier 190
Natural language processor 59
Ngrams lists, index 116–120, 117f, 119f
Noncompositionality 46
Normalized compression distance 165–168
Notation 3 (n3) 68, 258
Nouns and names 1
N-triples 258
Numerals 1
Numpy (Numerical Python) 161

O

Object-oriented data 
abstraction 289, 292–293
benefits 293–297
encapsulation 288
inheritance 288
introspection and reflection 287–288
persistent data 297–303
polymorphism 289
self-explaining data 285–287
SQLite databases 303–306
triplestore databases 293–297
Object-oriented programming, ontology 241–242
One-way hash 
implementations 217–219
value, index 109
Ontology 
driving up complexity with 240–243
object-oriented programming 241–242
SUMO 237–238
OpenOffice 11
Open source tools 
encryption and decryption with OpenSSL 215–217
Gnuplot 156–158
ImageMagick 162–163
LaTeX, displaying equations 163–165
MatPlotlib 159–160
normalized compression distance 165–168
Numpy (Numerical Python) 161
one-way hash implementations 217–219
Pearson’s correlation 168–169
pseudorandom number generators 212–214
RDF parsers 260–262
RDF Schema 259–260
R, for statistical programming 160–161
Scipy 162
simple dot product 169–171
steganography 219–220
syntax for triples 256–259
UUID 214–215
visualizing class relationships 262–267, 264f
OpenSSL 208
encryption and decryption with 215–217
Open systems interconnection (OSI) 65–66
Outliers  See Ranges and outliers
Overfitting 155

P

Package installers 20–21
PageRank, index 93
Papyrus 1
Pareto’s principle profiling 150
Pearson’s correlation 168–169
Perl script 10–11
index lists and concordances 93–94, 96–97
Monte Carlo simulations 328–329
persistent data 298–299
profiling 151
random number generation 321–322
sorting, text structuring 55
term extraction and index 99
triplestore data 295–297
Permutating 335–342
Persistent data 297–303
PETRA collider data 285–286
Phashconvert, profiling 149–150
Polymorphism 289
Poor identifiers 194–198
Principle component analysis 155
Problem simplification 
data permanence and immutability 346–348
Monte Carlo simulations 327–335
random numbers 321–326
reanalysis 343–344
resampling and permutating 335–342
validation 343
verification 342–343
winnowing and chaffing 351–352
Profiling  See Data profiling
Pseudorandom number generators 212–214
Public domain 303
Python script 11
data retrieval 144
index lists and concordances 94–95
Monte Carlo simulations 330
persistent data 299
profiling 151–152
term extraction and index 103

R

Random number generation 321–326
Random number generator 213
Ranges and outliers 136–140
data object 136
data quality 137
data set 136–137
metadata 136
secondary data 137
Reanalysis 343–344
Reconstruction, index lists and concordances 97–98
Reflection 287–288
Regular expressions, text structuring 69–73
Reidentifiers 198–201
Reifications 46
Repurposing project 286
Resampling 335–342
Resource Description Framework (RDF) 236
parsers 260–262
schema 259–260
text structuring standards 68
R, for statistical programming 160–161
Ruby script 11
classified triples 289–291
data retrieval 145
index lists and concordances 95
introspection 287
Monte Carlo simulations 330–331
random number generation 324
term extraction and index 103
Runes 1

S

Scalable methods, term extraction and index 100
Science 1–2
Scipy 162
Script loads, sentence parsing 58
Search box, index 91
Secondary data 137
Secure Hash Algorithm (SHA) 217–219
Self-explaining data 285–287
Sentence parsing 
line-by-line 57
natural language processor 59
script loads 58
Signatures 208–212
Simple dot product 169–171
Simple estimates 140
Simplification tools 1–2
Social Security Act 196
Social Security number 195–196
Software extracts, index 98
Sorting, text structuring 
alphabetization 48–49
alphanumeric 49
ASCII 51, 52f, 55
DOS directory 49–50
guidelines 51
HTML 53
importance 49
order of business 50
Perl script 55
Unix directory 49
web browser rendition 54f
wordprocessor directory 50
Specification 68 See also Text structuring
SQLite databases 303–306
Standards, text structuring 
committees 67
legacy data 66
Notation3 68
open systems interconnection (OSI) 65–66
RDF 68
specification 68
Turtle 68
XML 68
Steganography 219–220
Stone tablets 1
Stop/barrier word list 100–102
Stopwords, profiling 152
SUMO ontology 237–238
Supercomputers 154
System calls 21–23

T

Tags, annotation and metadata 62, 64
Term extraction and index 
algorithm 103
indexer 98
Perl script 99
Python script 103
Ruby script 103
scalable methods 100
software extracts 98
stop/barrier word list 100–102
Text editors 11
Text structuring 
abbreviations 60–62
annotation and metadata 
HTML 63
tags 62, 64
XML tags 64
meaninglessness 45–48
homonyms 45
Janus sentences 45
noncompositionality 46
reifications 46
unrestricted extensibility 46
word meanings 46
open source tools 
ASCII 69
converting nonprintable files to plain-text 75–77
Dublin Core 77–78
format commands 73–75
regular expressions 69–73
sentence parsing 
line-by-line 57
natural language processor 59
script loads 58
sorting 
alphabetization 48–49
alphanumeric 49
ASCII 51, 52f, 55
DOS directory 49–50
guidelines 51
HTML 53
importance 49
order of business 50
Perl script 55
Unix directory 49
web browser rendition 54f
word processor directory 50
standards 
committees 67
legacy data 66
Notation3 68
open systems interconnection (OSI) 65–66
RDF 68
specification 68
Turtle 68
XML 68
Timestamps 208–212
Triples 233–236
examples of 233–234
syntax for 256–259
Triplestore 
databases 236, 293–297
resources 235
Turtle, text structuring standards 68
Turtle triples 258–259

U

Unique identifiers 189–194
Uniqueness 189
Universally Unique Identifier (UUID) 190–191, 214–215
implementation 214
timestamp 209
Unix directory, sorting 49
Unreasonable effectiveness, classifications 244–253
Unrestricted extensibility 46

V

Validation 343
Verification 342–343

W

Watermarking 219–220
Web browser rendition, sorting 54f
Winnowing 351–352
Word lists, index 112–114
Word meanings 46
Word processor directory 50

X

Z

Zipf distribution, profiling 150
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset