Chapter 11. The Future of Data Processing for Artificial Intelligence

As discussed in prior chapters, one major thread in designing machine learning (ML) pipelines and data processing systems more generally is pushing computation “down” whenever possible.

Data Warehouses Support More and More ML Primitives

Increasingly, databases provide the foundations to implement ML algorithms that run efficiently. As databases begin to support different data models and offer built-in operators and functions that are required for training and scoring, more machine learning computation can execute directly in the database.

Expressing More of ML Models in SQL, Pushing More Computation to the Database

As database management software grows more versatile and sophisticated, it has become feasible to push more and more ML computation to a database. As discussed in Chapter 7, modern databases already offer tools for efficiently storing and manipulating vectors. When data is stored in a database, there simply is no faster way of manipulating ML data than with single instruction, multiple data (SIMD) vector operations directly where the data resides. This eliminates data transfer and computation to change data types, and executes extremely efficiently using low-level vector operations.

When designing an analytics application, do not assume that all computation needs to happen directly in your application. Rather, think of your application as the top-level actor, delegating as much computation as possible to the database.

External ML Libraries/Frameworks Could Push Down Computation

When choosing analytics tools like ML libraries or business intelligence (BI) software, one thing to look for is the ability to “push down” parts of a query to the underlying database. A good example of this is a JOIN in a SQL query. Depending on the type of JOIN, the results returned might be significantly fewer than the total number of rows that were scanned to produce that result. By performing the join in the database, you can avoid transferring arbitrarily many rows that will just be “thrown out” when they don’t match the join condition (see Figures 11-1 and 11-2).

dwaa 1101
Figure 11-1. An ML library pushing JOIN to a distributed database
dwaa 1102
Figure 11-2. Efficiently combining data for a distributed join

ML in Distributed Systems

There are several ways in which distributed data processing can expedite ML. For one, some ML algorithms can be implemented to execute in parallel. Some ML libraries provide interfaces for distributed computing, and others do not. One of the benefits of expressing all or part of a model in SQL and using a distributed database is that you get parallelized execution for “free.” For instance, if run on a modern, distributed, SQL database, the sample query in Chapter 7 that trains a linear regression model is naturally parallelized. A database with a good query optimizer will send queries out to the worker machines that compute the linear functions over the data on that machine, and send back the results of the computation rather than all of the underlying data. In effect, each worker machine trains the model on its own data, and each of these intermediate models is weighted and combined into a single, unified model.

Toward Intelligent, Dynamic ML Systems

Maximizing value from ML applications hinges not only on having good models, but on having a system in which the models can continuously be made better. The reason to employ data scientists is because there is no such thing as a self-contained and complete ML solution. In the same way that the work at a growing business is never done, intelligent companies are always improving their analytics infrastructure.

The days of single-purpose infrastructure and narrowly defined organizational roles is over. In practice, most people working with data play many roles. In the real world, data scientists write software, software engineers administer systems, and systems administrators collect and analyze data (one might even claim that they do so “scientifically”).

The need for cross-functionality put a premium on choosing tools that are powerful but familiar. Most data scientists and software engineers do not know how to optimize a query for execution in a distributed system, but they all know how to write a SQL query.

Along the same lines, collaboration between data scientists, engineers, and systems administrators will be smoothest when the data processing architecture as a whole is kept simple wherever possible, enabling ML models to go from development to production faster. Not only will models become more sophisticated and accurate, businesses will be able to extract more value from them with faster training and more frequent deployment. It is an exciting time for businesses that take advantage of the ML and distributed data processing technology that already exists, waiting to be harnessed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset