Chapter 3. Moving from Data Silos to Real-Time Data Pipelines

Providing a modern user experience at scale requires a streamlined data processing infrastructure. Users expect tailored content, short load times, and information to always be up-to-date. Framing business operations with these same guiding principles can improve their effectiveness. For example, publishers, advertisers, and retailers can drive higher conversion by targeting display media and recommendations based on users’ history and demographic information. Applications like real-time personalization create problems for legacy data processing systems with separate operational and analytical data silos.

The Enterprise Architecture Gap

A traditional data architecture uses an OLTP-optimized database for operational data processing and a separate OLAP-optimized data warehouse for business intelligence and other analytics. In practice, these systems are often very different from one another and likely come from different vendors. Transferring data between systems requires ETL (extract, transform, load) (Figure 3-1).

Legacy operational databases and data warehouses ingest data differently. In particular, legacy data warehouses cannot efficiently handle one-off inserts and updates. Instead, data must be organized into large batches and loaded all at once. Generally, due to batch size and rate of loading, this is not an online operation and runs overnight or at the end of the week. 

Figure 3-1. Legacy data processing model

The challenge with this approach is that fresh, real-time data does not make it to the analytical database until a batch load runs. Suppose you wanted to build a system for optimizing display advertising performance by selecting ads that have performed well recently. This application has a transactional component, recording the impression and charging the advertiser for the impression, and an analytical component, running a query that selects possible ads to show to a user and then ordering by some conversion metric over the past x minutes or hours.

In a legacy system with data silos, users can only analyze ad impressions that have been loaded into the data warehouse. Moreover, many data warehouses are not designed around the low latency requirements of a real-time application. They are meant more for business analysts to query interactively, rather than computing programmatically generated queries in the time it takes a web page to load.

On the other side, the OLTP database should be able to handle the transactional component, but, depending on the load on the database, probably will not be able to execute the analytical queries simultaneously. Legacy OLTP databases, especially those that use disk as the primary storage medium, are not designed for and generally cannot handle mixed OLTP/OLAP workloads.

This example of real-time display ad optimization demonstrates the fundamental flaw in the legacy data processing model. Both the transactional and analytical components of the application must complete in the time it takes the page to load and, ideally, take into account the most recent data. As long as data remains siloed, this will be very challenging. Instead of silos, modern applications require real-time data pipelines in which even the most recent data is always available for low-latency analytics.

Real-Time Pipelines and Converged Processing

Real-time data pipelines can be implemented in many ways and it will look different for every business. However, there are a few fundamental principles that must be followed:

  1. Data must be processed and transformed “on the fly” so that, when it reaches a persistent data store, it is immediately available for query.
  2. The operational data store must be able to run analytics with low latency.
  3. Converge the system of record with the system of insight.

On the second point, note that the operational data store need not replace the full functionality of a data warehouse—this may happen, but is not required. However, to enable use cases like the real-time display ad optimization example, it needs to be able to execute more complex queries than traditional OLTP lookups.

One example of a common real-time pipeline configuration is to use Kafka, Spark Streaming, and MemSQL together.

At a high level, Kafka, a message broker, functions as a centralized location for Spark to read from disparate data streams. Spark acts a transformation layer, processing and enriching data in micro batches. MemSQL serves as the persistent data store, ingesting processed data from Spark. The advantage of using MemSQL for persistence is twofold:

  1. With its in-memory storage, distributed architecture, and modern data structures, MemSQL enables concurrent transactional and analytical processing.
  2. MemSQL has a SQL interface and the analytical query surface area to support business intelligence.

Because data travels from one end of the pipeline to the other in seconds, analysts have access to the most recent data. Moreover, the pipeline, and MemSQL in particular, enable use cases like real-time display ad optimization. Impression data is queued in Kafka, preprocessed in Spark, then stored and analyzed in MemSQL. As a transactional system, MemSQL can process business transactions (charging advertisers and crediting publishers, for instance) in addition to powering and optimizing the ad platform.

In addition to enabling new applications, and with them new top-line revenue, this kind of pipeline can improve the bottom line as well. Using fewer, more powerful systems can dramatically reduce your hardware footprint and maintenance overhead. Moreover, building a real-time data pipeline can simplify data infrastructure. Instead of managing and attempting to synchronize many different systems, there is a single unified pipeline. This model is conceptually simpler and reduces connection points.

Stream Processing, with Context

Stream processing technology has improved dramatically with the rise of memory-optimized data processing tools. While leading stream processing systems provide some analytics capabilities, these systems, on their own, do not constitute a full pipeline. Stream processing tools are intended to be temporary data stores, ingesting and holding only an hour’s or day’s worth of data at a time. If the system provides a query interface, it only gives access to this window of data and does not give the ability to analyze the data in a broader historical context. In addition, if you don’t know exactly what you’re looking for, it can be difficult to extract value from streaming data. With a pure stream processing system, there is only one chance to analyze data as it flies by (see Figure 3-2).

Figure 3-2. Availability of data in stream processing engine versus database

To provide access to real-time and historical data in a single system, some businesses employ distributed, high-throughput NoSQL data stores for “complex event processing” (CEP). These data stores can ingest streaming data and provide some query functionality. However, NoSQL stores provide limited analytic functionality, omitting common RDBMS features like joins, which give a user the ability to combine information from multiple tables. To execute even basic business intelligence queries, data must be transferred to another system with greater query surface area.

The NoSQL CEP approach presents another challenge in that it trades speed for data structure. Ingesting data as is, without a schema, makes querying the data and extracting value from it much harder. A more sophisticated approach is to structure data before it lands in a persistent data store. By the time data reaches the end of the pipeline, it is already in a queryable format.

Conclusion

There is more to the notion of a real-time data pipeline than “what we had before but faster.” Rather, the shift from data silos to pipelines represents a shift in thinking about business opportunities. More than just being faster, a real-time data pipeline eliminates the distinction between real-time and historical data, such that analytics can inform business operations in real time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset