This glossary of terms and concepts aids in understanding the SMACK stack.
ACID
The acronym for Atomic, Consistent, Isolated, and Durable. (See Chapter 9.)
agent
A software component that resides within another, much larger, software component. An agent can access the context of the component and execute tasks. It works automatically and is typically used to execute tasks remotely. It is an extension of a software program customized to perform tasks.
API
The acronym for application programming interface. A set of instructions, statements, or commands that allow certain software components to interact or integrate with one another.
BI
The acronym for business intelligence. In general, the set of techniques that allow software components to group, filter, debug, and transform large amounts of data with the aim of improving a business processes.
big data
The volume and variety of information collected. Big datais an evolving term that describes any large amount of structured, semi-structured, and unstructured data that has the potential to be mined for information. Although big data doesn’t refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data. Big data sy stems facilitate the exploration and analysis of large data sets.
CAP
The acronym for Consistent, Available, and Partition Tolerant . (See Chapter 9.)
CEP
The acronym for complex event processing. A technique used to analyze data streams steadily. Each flow of information is analyzed and generates events; in turn, these events are used to initiate other processes at higher levels of abstraction within a workflow/service.
client-server
An application execution paradigm formed by two components that allows distributed environments. This consists of a component called the server, which is responsible for first receiving the requests of the clients (the second component). After receiving requests, they are processed by the server. For each request received, the server is committed to returning an answer.
cloud
Systems that are accessed remotely; mainly hosted on the Internet. They are generally administrated by third parties.
cluster
A set of computers working together through a software component. Computers that are part of the cluster are referred to as nodes. Clusters are a fundamental part of a distributed system; they maintain the availability of data.
column family
In the NoSQL world, this is a paradigm for managing data using tuples—a key is linked to a value and a timestamp. It handles larger units of information than a key-value paradigm.
coordinator
In scenarios where there is competition, the coordinator is a cornerstone. The coordinator is tasked with the distribution of operations to be performed and to ensure the execution thereof. It also manages any errors that may exist in the process.
CQL
The acronym for Cassandra Query Language. A statements-based language very similar to SQL in that it uses SELECT, INSERT, UPDATE, and DELETE statements. This similarity allows quick adoption of the language and increases productivity.
CQLS
A Cassandra-owned CLI tool to run CQL statements.
concurrency
In general, the ability to run multiple tasks. In the world of computer science, it refers to the ability to decompose a task into smaller units so that you can run them separately while waiting for the union of these isolated tasks that represent the execution the total task.
commutative operations
A set of operations are said to be commutative if they can be applied in any order without affecting the ending state. For example, a list of account credits and debits is considered commutative because any ordering leads to the same account balance. If there is an operation in the set that checks for a negative balance and charges a fee, however, then the order in which the operations are applied does matter, so it is not commutative.
CRDTs
The acronym for conflict-free replicated data types. A collection data structures designed to run on systems with weak CAP consistency, often across multiple data centers. They leverage commutativity and monotonicity to achieve strong eventual guarantees in a replicated state. Compared to strongly consistent structures, CRDTs offer weaker guarantees, additional complexity, and can require additional space. However, they remain available for writes during network partitions that would cause strongly consistent systems to stop processing.
dashboard
A graphical way for indicators to report certain processes or services. Mainly used for monitoring critical activities.
data feed
An automated mechanism used to retrieve updates from a source of information. The data source must be structured to read data in a generic way.
DBMS
The acronym for database management system. A software system used to create and manage databases. It provides mechanisms to create, modify, retrieve, and manage databases.
determinism
In data management, a deterministic operation always has the same result given a particular input and state. Determinism is important in replication. A deterministic operation can be applied to two replicas, assuming the results will match. Determinism is also useful in log replay. Performing the same set of deterministic operations a second time will give the same result.
dimension data
Infrequently changing data that expands upon data in fact tables or event records. For example, dimension data may include products for sale, current customers, and current salespeople. The record of a particular order might reference rows from these tables so as not to duplicate data. Dimension data not only saves space, but it also allows a product to be renamed and have that new name instantly reflected in all open orders. Dimensional schemas also allow the easy filtering, grouping, and labeling of data. In data warehousing, a single fact table, a table storing a record of facts or events, combined with many dimension tables full of dimension data, is referred to as a star schema.
distributed computing .
A physical and logical model that allows communication between computers distributed across a network. Its goal is to keep the computers together as a single computer, thus achieving resource utilization. This is a complex issue in the world of computer science.
driver
In a general sense, a driver is a connection between two heterogeneous pieces of hardware or software. A driver connects the software of two separate systems and provides an interface that allows interaction between them.
ETL
An acronym for extract, transform, load. The traditional sequence by which data is loaded into a database. Fast data pipelines may either compress this sequence, or perform analysis on or in response to incoming data before it is loaded into the long-term data store.
exabyte
(EB) Equivalent to 1024^6 bytes.
exponential backoff
A way to manage contention during failure. During failure, many clients try to reconnect at the same time, overloading the recovering system. Exponential backoff is a strategy of exponentially increasing the timeouts between retries on failure. If an operation fails, wait one second to retry. If that retry fails, wait two seconds, then four seconds, and so forth. This allows simple one-off failures to recover quickly, but for more complex failures, there will eventually be a load low enough to successfully recover. Often the growing timeouts are capped at some large number to bound recovery times, such as 16 seconds or 32 seconds.
failover
Also known as fault tolerance, this is the mechanism by which a system is still operating despite failure.
fast data
The processing of streaming data at real-time velocity, enabling instant analysis, awareness, and action. Fast data is data in motion, streaming into applications and computing environments from hundreds of thousands to millions of endpoints—mobile devices, sensor networks, financial transactions, stock tick feeds, logs, retail systems, telco call routing and authorization systems, and more. Systems and applications designed to take advantage of fast data enable companies to make real-time, per-event decisions that have direct, real-time impact on business interactions and observations. Fast data operationalizes the knowledge and insights derived from “big data” and enables developers to design fast data applications that make real-time, per-event decisions. These decisions may have direct impact on business results through streaming analysis of interactions and observations, which enables in-transaction decisions to be made.
gossip
(Protocol) The protocol that Cassandra uses to maintain communication between nodes that form a cluster. Gossip is designed to quickly spread information between nodes and thereby quickly overcome the failures that occur, thus achieving the reliability of the data.
graph database
In the NoSQL world, a type of data storage based on graph theory to manage it. This basically means that nodes maintain their relationships through edges; each node has properties and the relationship between properties that can work with them.
HDSF
The acronym for Hadoop Distributed File System. A distributed file system that is scalable and portable. Designed to handle large files and used in conjunction TCP/IP and RPC protocols. Originally designed for the Hadoop framework, today it is used by a variety of frameworks.
HTAP
The acronym for Hybrid Transaction Analytical Processing architectures. Enables applications to analyze live data as it is created and updated by transaction processing functions. According to the Gartner 2014 Magic Quadrant, HTAP is described as follows: “…they must use the data from transactions, observations, and interactions in real time for decision processing as part of, not separately from, the transactions.”1
IaaS
The acronym for Infrastructure as a Service. Provides the infrastructure of a data center on demand. This includes (but not limited to) computing, storage, networking services, etc. The IaaS user is responsible for maintaining all software installed.
idempotence
An idempotent operation is an operation that has the same effect no matter how many times it is applied. See Chapter 9 for a detailed discussion on idempotence, including an example of idempotent processing.
IMDG
The acronym for in-memory data grid. A data structure that resides entirely in RAM and is distributed across multiple servers. It is designed to store large amounts of data.
IoT
The acronym for the Internet of Things. The ability to connect everyday objects with the Internet. These objects generally get real-world information through sensors, which take the information to the Internet domain.
key-value
In the NoSQL world, a paradigm for managing data using associative arrays; certain data related to a key. The key is the medium of access to the value to update or delete it.
keyspace
In Apache Cassandra, a keyspace is a logical grouping of column families. Given the similarities between Cassandra and an RDBMS, think of a keyspace as a database.
latency
(Net) The time interval that occurs between the source (send) and the destination (receive). Communication networks require physical devices, which generate the physical reasons for this “delay.”
master-slave
A communication model that allows multiple nodes (slaves) to maintain the data dependency or processes of a master node (master). Usually, this communication requires that slaves have a driver installed to communicate with the master.
metadata
Data that describes other data. Metadata summarizes basic information about data, which make finding and working with particular instances of data easier.
NoSQL
Data management systems that (unlike RDBMS systems) do not use scheme, have non-relational data, and are "cluster friendly," and therefore are not as strict when managing data. This allows better performance.
operational analytics
(Another term for operational BI). The process of developing optimal or realistic recommendations for real-time, operational decisions based on insights derived through the application of statistical models and analysis against existing and/or simulated future data, and applying these recommendations to real-time interactions. Operational database management systems (also referred to as OLTP, or online transaction processing databases) are used to manage dynamic data in real time. These types of databases allow you to do more than simply view archived data; they allow you to modify that data (add, change, or delete) in real time.
RDBMS
The acronym for relational database management system. A particular type of DBMS that is based on the relational model. It is currently the most widely used model in production environments.
real-time analytics
An overloaded term. Depending on context, “real time” means different things. For example, in many OLAP use cases, “real time” can mean minutes or hours; in fast data use cases, it may mean milliseconds. In one sense, “real time” implies that analytics can be computed while a human waits. That is, answers can be computed while a human waits for a web dashboard or a report to compute and redraw. “Real time” also may imply that analytics can be done in time to take some immediate action. For example, when someone uses too much of their mobile data plan allowance, a real-time analytics system notices this and triggers a text message to be sent to that user. Finally, “real time” may imply that analytics can be computed in time for a machine to take action. This kind of real time is popular in fraud detection or policy enforcement. The analysis is done between the time a credit or debit card is swiped and the transaction is approved.
replication
(Data) The mechanism for sharing information with the aim of creating redundancy between different components. In a cluster, data replication is used to maintain consistent information.
PaaS
The acronym for Platform as a Service. Offers integration with other systems or development platforms, which provides a reduction in development time.
probabilistic data structures
Probabilistic data structures are data structures that have a probabilistic component. In other words, there is a statistically bounded probability for correctness (as in Bloom filters). In many probabilistic data structures, the access time or storage can be an order of magnitude smaller than an equivalent non-probabilistic data structure. The price for this savings is the chance that a given value may be incorrect, or it may be impossible to determine the exact shape or size of a given data structure. However, in many cases, these inconsistencies are either allowable or can trigger a broader, slower search on a complete data structure. This hybrid approach allows many of the benefits of using probability, and also can ensure correctness of values.
SaaS
The acronym for Software as a Service. Allows the use of hosted cloud applications. These applications are typically accessed through a web browser. Its main advantages are to reduce initial cost and to reduce maintenance costs. It allows a company to focus on their business and not on hardware and software issues.
scalability
A system property to stably adapt to continued growth; that is, without interfering with the availability and quality of the services or tasks offered.
shared nothing
A distributed computing architecture in which each node is independent and self-sufficient. There is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.
Spark-Cassandra Connector
A connector that allows an execution context Spark and to access an existing keyspace on a Cassandra server.
streaming analytics
Streaming analytics platforms can filter, aggregate, enrich, and analyze high-throughput data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real time, detect urgent situations, and automate immediate actions. Streaming operators include Filter, Aggregate, Geo, Time windows, temporal patterns, and Enrich.
synchronization
Data synchronization. In a cluster that consists of multiple nodes, you must keep data synchronized to achieve availability and reliability.
unstructured data
Any information that is not generated from a model or scheme or is not organized in a predefined manner.
Footnotes
1 Gartner, Inc., “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation,” January 2014, https://www.gartner.com/doc/2657815/hybrid-transactionanalytical-processing-foster-opportunities .