© Raul Estrada and Isaac Ruiz 2016

Raul Estrada and Isaac Ruiz, Big Data SMACK, 10.1007/978-1-4842-2175-4_11

11. Glossary

Raul Estrada and Isaac Ruiz1

(1)Mexico City, Mexico

This glossary of terms and concepts aids in understanding the SMACK stack.

ACID

The acronym for Atomic, Consistent, Isolated, and Durable. (See Chapter 9.)

agent

A software component that resides within another, much larger, software component. An agent can access the context of the component and execute tasks. It works automatically and is typically used to execute tasks remotely. It is an extension of a software program customized to perform tasks.

API

The acronym for application programming interface. A set of instructions, statements, or commands that allow certain software components to interact or integrate with one another.

BI

The acronym for business intelligence. In general, the set of techniques that allow software components to group, filter, debug, and transform large amounts of data with the aim of improving a business processes.

big data

The volume and variety of information collected. Big datais an evolving term that describes any large amount of structured, semi-structured, and unstructured data that has the potential to be mined for information. Although big data doesn’t refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data. Big data sy stems facilitate the exploration and analysis of large data sets.

CAP

The acronym for Consistent, Available, and Partition Tolerant . (See Chapter 9.)

CEP

The acronym for complex event processing. A technique used to analyze data streams steadily. Each flow of information is analyzed and generates events; in turn, these events are used to initiate other processes at higher levels of abstraction within a workflow/service.

client-server

An application execution paradigm formed by two components that allows distributed environments. This consists of a component called the server, which is responsible for first receiving the requests of the clients (the second component). After receiving requests, they are processed by the server. For each request received, the server is committed to returning an answer.

cloud

Systems that are accessed remotely; mainly hosted on the Internet. They are generally administrated by third parties.

cluster

A set of computers working together through a software component. Computers that are part of the cluster are referred to as nodes. Clusters are a fundamental part of a distributed system; they maintain the availability of data.

column family

In the NoSQL world, this is a paradigm for managing data using tuples—a key is linked to a value and a timestamp. It handles larger units of information than a key-value paradigm.

coordinator

In scenarios where there is competition, the coordinator is a cornerstone. The coordinator is tasked with the distribution of operations to be performed and to ensure the execution thereof. It also manages any errors that may exist in the process.

CQL

The acronym for Cassandra Query Language. A statements-based language very similar to SQL in that it uses SELECT, INSERT, UPDATE, and DELETE statements. This similarity allows quick adoption of the language and increases productivity.

CQLS

A Cassandra-owned CLI tool to run CQL statements.

concurrency

In general, the ability to run multiple tasks. In the world of computer science, it refers to the ability to decompose a task into smaller units so that you can run them separately while waiting for the union of these isolated tasks that represent the execution the total task.

commutative operations

A set of operations are said to be commutative if they can be applied in any order without affecting the ending state. For example, a list of account credits and debits is considered commutative because any ordering leads to the same account balance. If there is an operation in the set that checks for a negative balance and charges a fee, however, then the order in which the operations are applied does matter, so it is not commutative.

CRDTs

The acronym for conflict-free replicated data types. A collection data structures designed to run on systems with weak CAP consistency, often across multiple data centers. They leverage commutativity and monotonicity to achieve strong eventual guarantees in a replicated state. Compared to strongly consistent structures, CRDTs offer weaker guarantees, additional complexity, and can require additional space. However, they remain available for writes during network partitions that would cause strongly consistent systems to stop processing.

dashboard

A graphical way for indicators to report certain processes or services. Mainly used for monitoring critical activities.

data feed

An automated mechanism used to retrieve updates from a source of information. The data source must be structured to read data in a generic way.

DBMS

The acronym for database management system. A software system used to create and manage databases. It provides mechanisms to create, modify, retrieve, and manage databases.

determinism

In data management, a deterministic operation always has the same result given a particular input and state. Determinism is important in replication. A deterministic operation can be applied to two replicas, assuming the results will match. Determinism is also useful in log replay. Performing the same set of deterministic operations a second time will give the same result.

dimension data

Infrequently changing data that expands upon data in fact tables or event records. For example, dimension data may include products for sale, current customers, and current salespeople. The record of a particular order might reference rows from these tables so as not to duplicate data. Dimension data not only saves space, but it also allows a product to be renamed and have that new name instantly reflected in all open orders. Dimensional schemas also allow the easy filtering, grouping, and labeling of data. In data warehousing, a single fact table, a table storing a record of facts or events, combined with many dimension tables full of dimension data, is referred to as a star schema.

distributed computing .

A physical and logical model that allows communication between computers distributed across a network. Its goal is to keep the computers together as a single computer, thus achieving resource utilization. This is a complex issue in the world of computer science.

driver

In a general sense, a driver is a connection between two heterogeneous pieces of hardware or software. A driver connects the software of two separate systems and provides an interface that allows interaction between them.

ETL

An acronym for extract, transform, load. The traditional sequence by which data is loaded into a database. Fast data pipelines may either compress this sequence, or perform analysis on or in response to incoming data before it is loaded into the long-term data store.

exabyte

(EB) Equivalent to 1024^6 bytes.

exponential backoff

A way to manage contention during failure. During failure, many clients try to reconnect at the same time, overloading the recovering system. Exponential backoff is a strategy of exponentially increasing the timeouts between retries on failure. If an operation fails, wait one second to retry. If that retry fails, wait two seconds, then four seconds, and so forth. This allows simple one-off failures to recover quickly, but for more complex failures, there will eventually be a load low enough to successfully recover. Often the growing timeouts are capped at some large number to bound recovery times, such as 16 seconds or 32 seconds.

failover

Also known as fault tolerance, this is the mechanism by which a system is still operating despite failure.

fast data

The processing of streaming data at real-time velocity, enabling instant analysis, awareness, and action. Fast data is data in motion, streaming into applications and computing environments from hundreds of thousands to millions of endpoints—mobile devices, sensor networks, financial transactions, stock tick feeds, logs, retail systems, telco call routing and authorization systems, and more. Systems and applications designed to take advantage of fast data enable companies to make real-time, per-event decisions that have direct, real-time impact on business interactions and observations. Fast data operationalizes the knowledge and insights derived from “big data” and enables developers to design fast data applications that make real-time, per-event decisions. These decisions may have direct impact on business results through streaming analysis of interactions and observations, which enables in-transaction decisions to be made.

gossip

(Protocol) The protocol that Cassandra uses to maintain communication between nodes that form a cluster. Gossip is designed to quickly spread information between nodes and thereby quickly overcome the failures that occur, thus achieving the reliability of the data.

graph database

In the NoSQL world, a type of data storage based on graph theory to manage it. This basically means that nodes maintain their relationships through edges; each node has properties and the relationship between properties that can work with them.

HDSF

The acronym for Hadoop Distributed File System. A distributed file system that is scalable and portable. Designed to handle large files and used in conjunction TCP/IP and RPC protocols. Originally designed for the Hadoop framework, today it is used by a variety of frameworks.

HTAP

The acronym for Hybrid Transaction Analytical Processing architectures. Enables applications to analyze live data as it is created and updated by transaction processing functions. According to the Gartner 2014 Magic Quadrant, HTAP is described as follows: “…they must use the data from transactions, observations, and interactions in real time for decision processing as part of, not separately from, the transactions.”1

IaaS

The acronym for Infrastructure as a Service. Provides the infrastructure of a data center on demand. This includes (but not limited to) computing, storage, networking services, etc. The IaaS user is responsible for maintaining all software installed.

idempotence

An idempotent operation is an operation that has the same effect no matter how many times it is applied. See Chapter 9 for a detailed discussion on idempotence, including an example of idempotent processing.

IMDG

The acronym for in-memory data grid. A data structure that resides entirely in RAM and is distributed across multiple servers. It is designed to store large amounts of data.

IoT

The acronym for the Internet of Things. The ability to connect everyday objects with the Internet. These objects generally get real-world information through sensors, which take the information to the Internet domain.

key-value

In the NoSQL world, a paradigm for managing data using associative arrays; certain data related to a key. The key is the medium of access to the value to update or delete it.

keyspace

In Apache Cassandra, a keyspace is a logical grouping of column families. Given the similarities between Cassandra and an RDBMS, think of a keyspace as a database.

latency

(Net) The time interval that occurs between the source (send) and the destination (receive). Communication networks require physical devices, which generate the physical reasons for this “delay.”

master-slave

A communication model that allows multiple nodes (slaves) to maintain the data dependency or processes of a master node (master). Usually, this communication requires that slaves have a driver installed to communicate with the master.

metadata

Data that describes other data. Metadata summarizes basic information about data, which make finding and working with particular instances of data easier.

NoSQL

Data management systems that (unlike RDBMS systems) do not use scheme, have non-relational data, and are "cluster friendly," and therefore are not as strict when managing data. This allows better performance.

operational analytics

(Another term for operational BI). The process of developing optimal or realistic recommendations for real-time, operational decisions based on insights derived through the application of statistical models and analysis against existing and/or simulated future data, and applying these recommendations to real-time interactions. Operational database management systems (also referred to as OLTP, or online transaction processing databases) are used to manage dynamic data in real time. These types of databases allow you to do more than simply view archived data; they allow you to modify that data (add, change, or delete) in real time.

RDBMS

The acronym for relational database management system. A particular type of DBMS that is based on the relational model. It is currently the most widely used model in production environments.

real-time analytics

An overloaded term. Depending on context, “real time” means different things. For example, in many OLAP use cases, “real time” can mean minutes or hours; in fast data use cases, it may mean milliseconds. In one sense, “real time” implies that analytics can be computed while a human waits. That is, answers can be computed while a human waits for a web dashboard or a report to compute and redraw. “Real time” also may imply that analytics can be done in time to take some immediate action. For example, when someone uses too much of their mobile data plan allowance, a real-time analytics system notices this and triggers a text message to be sent to that user. Finally, “real time” may imply that analytics can be computed in time for a machine to take action. This kind of real time is popular in fraud detection or policy enforcement. The analysis is done between the time a credit or debit card is swiped and the transaction is approved.

replication

(Data) The mechanism for sharing information with the aim of creating redundancy between different components. In a cluster, data replication is used to maintain consistent information.

PaaS

The acronym for Platform as a Service. Offers integration with other systems or development platforms, which provides a reduction in development time.

probabilistic data structures

Probabilistic data structures are data structures that have a probabilistic component. In other words, there is a statistically bounded probability for correctness (as in Bloom filters). In many probabilistic data structures, the access time or storage can be an order of magnitude smaller than an equivalent non-probabilistic data structure. The price for this savings is the chance that a given value may be incorrect, or it may be impossible to determine the exact shape or size of a given data structure. However, in many cases, these inconsistencies are either allowable or can trigger a broader, slower search on a complete data structure. This hybrid approach allows many of the benefits of using probability, and also can ensure correctness of values.

SaaS

The acronym for Software as a Service. Allows the use of hosted cloud applications. These applications are typically accessed through a web browser. Its main advantages are to reduce initial cost and to reduce maintenance costs. It allows a company to focus on their business and not on hardware and software issues.

scalability

A system property to stably adapt to continued growth; that is, without interfering with the availability and quality of the services or tasks offered.

shared nothing

A distributed computing architecture in which each node is independent and self-sufficient. There is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.

Spark-Cassandra Connector

A connector that allows an execution context Spark and to access an existing keyspace on a Cassandra server.

streaming analytics

Streaming analytics platforms can filter, aggregate, enrich, and analyze high-throughput data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real time, detect urgent situations, and automate immediate actions. Streaming operators include Filter, Aggregate, Geo, Time windows, temporal patterns, and Enrich.

synchronization

Data synchronization. In a cluster that consists of multiple nodes, you must keep data synchronized to achieve availability and reliability.

unstructured data

Any information that is not generated from a model or scheme or is not organized in a predefined manner.

Footnotes

1 Gartner, Inc., “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation,” January 2014, https://www.gartner.com/doc/2657815/hybrid-transactionanalytical-processing-foster-opportunities .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset