2

DATA

From Maximum to Minimum, and Back Again

The notion of leveraging something in business—capital, talent, research, technology, a process—is so frequently invoked that, like many familiar metaphors, it has lost its expressive power. “To leverage” has come to mean little more than “to use advantageously.” But it’s worth recalling the full power of the metaphor—what leverage really means is having a small number of resources yield a high level of return.

Nowhere is that dynamic more apt than with the use of data in AI as it travels from the “edge” of a local device to the cloud and sometimes back again to a device.

Data-hungry AI can certainly deliver some degree of leverage in an organization. As voracious and expensive as it might be, it can still yield higher return than alternatives (like manual processes). But recent developments in “small data” leverage AI at its source, processing data locally for AI uses at the “edge”—directly on the smartphones, sensors, autonomous vehicles, drones, and potentially any other device with a robust-enough microchip that generates streams of data. All this small data adds up to something big: IDC estimates that connected IoT devices such as these will collectively generate 73 zettabytes (or 73 trillion gigabytes) of data by 2025, quadrupling in size from 2019.1

Using small data on the edge to power big data insights can yield higher returns on AI performance itself, using far fewer resources of training time, computational power, infrastructure, people, and money. This data-leveraged AI then brings the leverage of AI generally to the organization. Mastering the art of marrying small data, big data, and AI could help make the competitive difference for many organizations, especially those finding themselves in a data arms race they’re unlikely to win.

The Trouble with Maximum Data

Starsky Robotics looked to be one of the emerging success stories in autonomous vehicles. The company’s proprietary AI system guided big-rig trucks over highways and on local roads and allowed a human driver, working from a remote operations center, to take over. Starsky was no pie-in-the sky venture built on a vague possibility of one day disrupting an industry. The company actually owned assets, employed former truckers as remote operators, and was pursuing a practical business model that seemed to put it commercially ahead of ambitious rivals like Waymo and Uber.

Starsky also enjoyed a string of increasingly impressive accomplishments: In 2016, its self-driving truck became the first street-legal vehicle to do real work with no one behind the wheel. In 2018, it completed a seven-mile trip on a closed Florida road without a human in the cab—the first fully unmanned truck run in history. In 2019, it became the first unmanned truck to drive on a public highway. It navigated a rest stop, merged onto the Florida Turnpike, and covered a 9.4-mile stretch of road at an average speed of 55 mph. Only the first and last segments of the trip, about two-tenths of a mile, were handled by a remote operator, sitting 200 miles away in Jacksonville.

In 2020, the company closed its doors.

What went wrong? In a wrenching farewell blog post, Starsky founder Stefan Seltz-Axmacher detailed what he called the many problems that plague the autonomous vehicle industry. The biggest problem: supervised machine learning doesn’t live up to the hype—especially when dealing with unusual situations (also known as “edge cases,” not to be confused with “edge AI”). “It’s widely understood that the hardest part of building AI is how it deals with situations that happen uncommonly,” wrote the Starsky founder. “In fact, the better your model, the harder it is to find robust datasets of novel edge cases. Additionally, the better your model, the more accurate the data you need to improve it. Rather than seeing exponential improvements in the quality of AI performance (à la Moore’s Law), we’re instead seeing exponential increases in the cost to improve AI systems.”2 It was a classic instance of diminishing returns. The number of training examples increased exponentially, but accuracy increased only linearly.

Unlike infants, who can learn and extrapolate from a single instance of a phenomenon, much AI runs on algorithms that must be trained on mountains of data. Driverless vehicles, like the Starsky truck, are trained on as many traffic situations as possible. (Human drivers, to receive a license, typically undergo only twenty to thirty hours of classroom instruction and a mere dozen or so hours of experience behind the wheel—and yet are able to react instantly to one-of-a-kind situations.) The first computer program to defeat a professional player in the ancient board game Go was trained on 30 million games.3 AIs that diagnose diseases can do so because they’ve been fed data on how often millions of other people with the same set of symptoms have contracted a particular ailment.

These deep learning and supervised machine learning approaches to AI, requiring massive amounts of data to train and maintain, are now running up against some serious limitations, including:

Lack of Existing Big Datasets for AI

There is “Big Data” and there are “big datasets”—and it’s important to distinguish between the two. Big Data refers to statistical analytic techniques brought to bear on huge volumes of data in order to extract patterns. A dataset has been organized into some type of data structure (e.g., a collection of names and contact information). Big sets of training data in machine learning—sets like ImageNet, with its more than 14 million hand-labeled images—enable the machine to teach itself. But for most AI problems, there are no existing big datasets on the order of ImageNet. That means the data must be laboriously gathered and labeled before any novel application can even be considered—an expensive and time-consuming process. And over time, as the focus of specific AI applications shifts, training sets may gradually lose relevance.

Similarly, for most business problems or opportunities that exist, there are no big datasets “cleansed” and ready for use by AI systems. In fact, in most organizations, the biggest obstacle to comprehensive AI solutions remains noisy, sparse, or incomplete data, much of it semi- or unstructured. This explosion of messy data comes from sources such as log files, call center recordings, videos, social media posts, transactions, and a wide range of devices. For example, Walmart collects 2.5 petabytes of unstructured data (2.5 million gigabytes) from 1 million customers every hour—equivalent to 167 times the number of books in the Library of Congress.4 By 2025, each human being will create an estimated 3.4 exabytes of data per day (1 billion gigabytes), mostly through social media, video sharing, and communications.5

To add to the complexity of all this unstructured data, mountains of structured data are often trapped within legacy enterprise data warehouses, replicating organizational silos and thereby limiting data sharing and innovation at scale.

The Need for Ever More Massive Infrastructure

Deep learning is caught in a vicious cycle. The more data at its disposal, the better it can perform its tasks, whatever they might be, from piloting vehicles to diagnosing diseases to classifying objects in images. That requires ever larger neural networks with billions of parameters and massive hardware and computation infrastructure. In fact, says the former head of Intel’s AI Products Group, “The rapid growth in the size of neural networks is outpacing the ability of hardware to keep up.”6

Yet the volume of data—from search engines, social media, enterprise resource planning systems, and countless other sources—continues to grow exponentially. For instance, Facebook trained image recognition networks on large sets of public images with hashtags, the biggest of which included 3.5 billion images and 17,000 hashtags.7 These vast and rapidly expanding troves of data invite the construction of larger and larger AI models that require more and more computational power. Since 2013, the amount of computational power required to train a deep learning model has increased 600,000-fold.8

Astronomically Rising Costs

Researchers at the University of Massachusetts, Amherst, analyzed the costs to train and develop the deep neural networks of several prominent natural language processing (NLP) AIs—the most data-hungry models of which attain the greatest accuracy.9 They found that while training a single model is relatively inexpensive, the cost of tuning a model for a new dataset or performing the full R&D required to develop the model quickly goes through the roof. For instance, they calculated that the cloud computing costs alone for training a model called Transformer, with 213 million parameters and a “neural architecture search” feature, could run anywhere from $1 million to $3 million, which may not seem like much to a big tech company but for small companies or researchers, it’s a lot. Continual fine-tuning drives costs even higher.

Increasingly Unattainable Resource Requirements

Deep learning requires resources that lie beyond the reach of many organizations. Only a limited number—Alibaba, Amazon, Apple, Google, Microsoft, and some Global 1000 companies—can keep up. Those that are digital businesses to start with can efficiently harvest training and tuning data on a massive scale. They are already fully staffed with data scientists, computer scientists, and other experts. And they, like other large companies, can afford the cost of massive deep learning systems. Many other enterprises cannot. Further, as deep learning systems grow bigger, research is becoming more concentrated in fewer hands, far beyond the means of academic labs. Professors or graduate students with a promising new idea for a computationally expensive AI simply cannot afford to test it.

Privacy Concerns

The advent of Big Data and datamining raised privacy challenges that now seem almost quaint. In our age of ubiquitous machine learning, systems can incorporate countless personal datapoints and features in complex ways that compromise privacy in ways that are difficult to understand, predict, or prevent.

These concerns aren’t theoretical. Countries and municipalities across the globe are using advances in facial recognition and analytics to create pervasive surveillance systems trained on billions of citizens. Meanwhile, AI’s more beneficent uses in fields like healthcare, where AI applications have helped save lives, are circumscribed by legal requirements to preserve privacy.

Government regulation, like the EU’s General Data Protection Regulation (GDPR), which went into effect in mid-2018, raises particularly thorny issues for systems that depend on deep learning. For instance, the GDPR requires that individuals who are the subjects of automated decisions based on their personal data can demand an explanation of how a decision was reached—a requirement that deep learning’s notorious “black box” problem is likely to run up against. The Covid-19 pandemic triggered debate about the use of smartphone GPS data to track individuals who may have been exposed to the virus, as well as the privacy implications of contact tracing apps. In the United States, congressional lawmakers have pushed legislation that would prohibit contact tracing without affirmative consent.10

Doing More with Less Data

For every dataset with 1 billion entries, there are 1,000 datasets with 1 million entries, and 1,000,000 datasets with only 1,000 entries.11 By capturing the value of these small datasets, companies can unlock a thousand-fold opportunity, especially when they use them in conjunction with big datasets they can build or buy. Moreover, since 90 percent of the work in data-hungry AI involves data cleansing, normalizing, and wrangling, the use of small data allows AI workers to focus on higher-value tasks and for big datasets to be used for the most strategic use cases.

Now, a few leading organizations are learning how to leverage smaller, unstructured datasets in ways that competitors can’t match. Instead of assuming that competitive advantage requires only enormous and carefully cleansed datasets, they are redefining notions of usable data, discovering where it resides, liberating it, and activating it across apps, infrastructure, and silos in their organizations as a new basis of innovation.

For example, in the fall of 2021, Apple announced that Siri will process voice instructions directly on the newest iPhones, using the built-for-AI processor Apple calls a “neural engine,” rather than sending those out to the cloud for processing, something that Google has been doing on its Pixel phones since 2019 for voice transcription without an internet connection. These AI applications still need to be trained on the cloud using the latest data, but eventually scientists expect edge AI systems to learn on their own.12 In the meantime, pioneering companies, along with cutting-edge researchers, are finding innovative ways to balance the complementary nature of big datasets and small datasets. These pioneers are addressing three distinct challenges: (1) when big and noisy data obscures the small subset of relevant, high-quality data needed to train AI; (2) when the amount of data for training AI is small; and (3) when there is no relevant data at all.

Filtering Out the Noise

The e-commerce retailer Wayfair maintains a vast catalog of more than 14 million home goods items, with product categories that range from furniture to storage to lighting to décor and more. Some of the categories include hundreds of thousands of items. That’s great for range of selection but challenging if you’re a customer looking for just the right option among all of those possibilities. And it’s even more challenging for you and the company if you’re a first-time customer with no shopping history on the site. The company doesn’t know your preferences and can’t personalize your experience. So, instead, Wayfair has found a way to make it easy for first-timers to find the products with the broadest appeal.

The problem is much harder than it looks. When the company’s data scientists try to identify the most appealing products, they run into the conundrum of “position effects” created by the site’s sorting algorithm. Products positioned at the top of the first page for a particular category tend to be ordered more frequently, regardless of their inherent appeal to a broad base of customers. In some cases, a product at the top of the page gets ordered twice as often as a more appealing product with less visibility.

To correct for position effects, you can model a product’s inherent appeal as the difference between its order rate and the average of any given product in a particular position. That calculation yields each product’s historical performance. With enough data, that big dataset approach would suffice. But although the site may handle as many as 9 million orders in a typical quarter, those orders are spread over millions of products, resulting in just a few orders per product. “Small integers like these can be extremely noisy, so we always have to worry that one product simply seems better than another because of random chance,” write two of Wayfair’s data scientists. “For example, it is hard to tell if a product that happened to attract three orders is actually any better than one that happened to attract two, or if it just got lucky.”13

To cut down on the noise, Wayfair added information about customers’ behavior at each stage of a product’s potential order: clicking on the product, adding it to their shopping cart, and ordering it. Because each step depends on the one before it, they can be weighted and multiplied to arrive at a reasonable estimate of the product’s likely order rate. To correct for changing appeal over time, Wayfair incorporates data from a single day’s orders into the model, rather than relying on a massive amount of big data that’s constantly changing. And it also shows products in a range of random positions each day, while making sure that the most appealing products are positioned prominently. Wayfair is now trying to apply these highly data-efficient techniques to other optimization and ranking challenges.

When Available Training Data Is Inherently Small

Each time a pure deep-learning AI takes on a new task, it must once again undergo training on massive amounts of data. But researchers have made great headway in recent years on techniques that can train for new tasks using far fewer examples (few-shot learning), one example (one-shot), and no examples (zero-shot). These less data-intensive techniques could help ensure that AI innovation isn’t limited to large technology companies.

Few-Shot Learning

Why can children learn the difference between an apple and an orange after seeing just a few examples, while machine learning models might require orders of magnitude larger numbers of labeled pieces of data to reliably identify objects? Data scientists at companies as diverse as a Swedish pizza-snack company and the NFL are using an AI approach known as “few-shot learning” to approximate this exquisitely complex human process.

Dafgårds is a family business in Sweden that has been making popular foods like meatballs and pizza snacks for distribution around the world for more than eighty years. With its popular line of pizza snacks, it needed to make sure each item had the right amount of cheese to meet the high standards of its discerning customers. The company’s IT team of twelve wanted to use a more intelligent and efficient method of quality control, but the team had limited experience in machine learning. So it partnered with Amazon Web Services (AWS) to build a machine learning system to do automated quality inspection.14

Using the Amazon Lookout for Vision service of AWS, customers like Dafgårds can identify defects in industrial processes with as few as thirty images to train the model: ten images with defects or anomalies and another twenty normal images. But a common complication is that modern manufacturing processes are often so finely tuned that defect rates are often 1 percent or less, and defects can often be very slight or nuanced. That yields a small dataset to use for quality control that often doesn’t match the reality of what’s happening on the shop floor.

To get around the relative lack of defect data, AWS built a mock factory, down to building a system of conveyor belts and objects of various types to simulate a range of manufacturing environments. It used trial and error to create synthetic defect datasets by drawing realistic anomalies on normal images of objects, such as missing components, scratches, discolorations, and other effects. This approach to few-shot learning occasionally allowed the team to work with no images of defects at all to significantly speed up quality control.

Or consider what the NFL did with computer vision to more easily and quickly search through thousands of its video and other media assets of games to find the small number of relevant images for videos like highlight reels. The people-power to tag all these assets at scale would have been time- and cost-prohibitive. Instead, the NFL’s content creation teams used the Custom Labels feature of Amazon Rekognition, a service for automated image and media analysis, to apply detailed tags for players, teams, numbers, jerseys, locations, events like penalties and injuries, and other metadata to their internal media asset collection. The automated process took a fraction of the time and used a much smaller subset of examples than it previously took teams to do manually.

The system combines Rekognition’s existing big-data AI training on tens to millions of images across numerous categories to identify objects and scenes with a user-provided small dataset of as few as ten images per category label. The user uploads a small dataset into the system and can start analyzing it with a few clicks. “In today’s media landscape, the volume of unstructured content that organizations manage is growing exponentially,” says Brad Boim, senior director of post-production and asset management at NFL Media. “Using traditional tools, users can have difficulty in searching through the thousands of media assets in order to locate a specific element they are looking for. These tools allow our production teams to leverage this data directly and provide enhanced products to our customers across all of our media platforms.”15

More Efficient Robot Reasoning

When robots have a conceptual understanding of the world, as humans do, it is easier to teach them things, using far less data. Vicarious, a highly secretive Silicon Valley startup backed by such investors as Peter Thiel, Mark Zuckerberg, Elon Musk, and Jeff Bezos, is working to develop artificial general intelligence for robots, enabling them to generalize from few examples. The company has developed robotic arms that get better at sorting items as they do it.16 These smarter robots have already been put to work assembling product sampler packs for makeup company Sephora, a job that previously had to be done exclusively by humans, owing to the large number of potential combinations of fast-changing SKUs, pouches, boxes, and types of sample packs. The system lowered the cost of this massive combinatorial problem by 80 percent.17 Examples of such combinatorial complexity are common in fast-moving, limitless inventory warehouses like Amazon’s, where items are often unstructured, inconsistently displayed, and constantly shifting places based on demand but still need to be found quickly and with 100 percent accuracy.18

Modeled on features of the neocortex in the human brain, the Vicarious models have several advantages over deep learning approaches that could only learn from big datasets: for example, they are better able to generalize, like the human brain, from a small number of examples, and they are better at handling what are called “adversarial examples,” optical illusions for machines19 that can be used to fool a neural network.20 Dileep George, one of the co-founders, is quick to point out the limitations of deep learning models: they can’t reason, understand causes, or do anything outside their experience. “Just scaling up deep learning is not going to solve those fundamental limitations,” he says. “We’ve made a conscious decision to find and tackle those problems.”21

Research from Vicarious shows how incorporating aspects of human learning to divide and conquer problems using “object factorization” and “sub-goaling” can enable AI systems to infer a high-level concept from an image and then apply it to a diverse array of circumstances. Researchers say such cognitive techniques are beginning to approach the general-learning algorithms of the human mind and can dramatically improve AI performance and explainability for complex problem-solving involving massive numbers of permutations.22

Consider those jumbles of letters and numerals that websites use to determine whether you’re a human or a robot. Called CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart), they are easy for humans to solve and hard for computers. Drawing on computational neuroscience (the study of brain function through computer modeling and mathematical analysis), researchers at Vicarious have developed a model that can break through CAPTCHAs at a far higher rate than deep neural networks and with 300-fold more data efficiency.23 To parse CAPTCHAs with almost 67 percent accuracy, the Vicarious model required only five training examples per character, while a state-of-the-art deep neural network required a 50,000-fold larger training set of actual CAPTCHA strings, owing to the millions of combinations of ways letters can be made to appear. Such models, with their ability to train faster and generalize more broadly than AI approaches commonly used today, are putting us on a path toward robots that have a human-like conceptual understanding of the world.

From Few Shots to One Shot

Google’s DeepMind Technologies is using “matching networks,” a neural network that uses recent advances in attention and memory, to address the challenge of rapidly learning new concepts from as little as one labeled piece of data. It’s the equivalent of a child being shown a single image of a giraffe and being able to identify all giraffes in the future. The researchers devised a novel architecture and training strategy that augments its system with a small “memory matrix,” or a support set of data filled with labeled information helpful to solving a problem. The approach significantly improved the accuracy of identifying images from the gigantic ImageNet and Omniglot datasets with only one labeled example, as compared to competing approaches.24 (The Omniglot dataset, designed for developing more human-like learning algorithms, is an encyclopedia of writing systems and languages that contains 1,623 different handwritten characters from fifty different alphabets.)

At Samsung’s AI Center in Moscow and Skolkovo Institute of Science and Technology, engineers and researchers have recently developed a face animation model that thrives on both few-shot and one-shot learning.25 Called “Talking Heads,” the model does have to be trained initially on a large dataset of face videos, which demonstrates the value of combining big and small data approaches. But after that, it can identify and extract facial landmarks from just a few examples of a new face and animate it into a realistic avatar that looks like actual video footage.

In a striking demonstration video, the narrator’s face, as he talks, runs side by side with an almost indistinguishable avatar.26 The model created the narrator’s avatar after seeing just eight video frames of him. An avatar of the actor Neil Patrick Harris, trained on thirty-two frames, is virtually perfect. And the system can generate avatars from just a few selfie photos, which differ sharply from video frames as source material.

The system also boasts an impressive one-shot learning capability. From single, iconic photographic portraits of celebrated people, including Marilyn Monroe, Salvador Dali, Fyodor Dostoevsky, and Albert Einstein, the system has generated convincing avatars. The system has even had some success generating avatars from classic paintings like the Mona Lisa.

The video of Dostoevsky is particularly jarring, given that the great nineteenth-century Russian novelist died in 1881, before the advent of film. And, taken together, the animations of the famous raise the specter of “deep fakes”—videos in which people are depicted doing and saying things they never did, from sex tapes to outrageous political statements. The Talking Heads creators acknowledge the danger of deep fakes, but point out that tools are rapidly being developed to detect them. However, some applications use what are called generative adversarial networks (GANs) in which an algorithm that generates an image is pitted against an algorithm that tries to determine whether the image is genuine, with both algorithms constantly improving, making detection increasingly difficult.

Malicious uses of the technology aside, lifelike telepresence could transform the world in the not-too-distant future, especially as users gain the ability to create avatars themselves.27 Potential benefits include reducing long-distance travel and short-distance commuting, democratizing education, and improving the quality of life for people with disabilities and health conditions. VR goggles have already allowed surgeons to step inside large-scale, accurate 3D models of a specific patient’s brain. Today, at the Ottawa Hospital in Canada, this approach helps implant microelectrodes thinner than a human hair into the brain, with millimeter precision.28

Remote brain surgery is not a futuristic notion. In March 2019, Ling Zhipei performed China’s first remote, 5G-supported surgery on the human brain on a patient 3,000 kilometers away. And medical students at Stanford University now use immersive systems to explore inside the human skull. Led by an instructor/avatar, they can see tumors and aneurysms from different angles and walk through the steps of surgical procedures.29

HOW TO DO MORE WITH LESS DATA

As AI continues to evolve, researchers and organizations are developing techniques that are rapidly advancing the ability to do more with less data. These techniques can work well in situations where companies have access only to smaller datasets, owing to the limited number of labeled examples available to study, as well as for edge cases involving small amounts of outlier data featuring rare defects or characteristics. Those small-data techniques include the following.

DATA ECHOING.   Researchers at Google Brain have been exploring a technique that reuses (or “echoes”) data to speed up the training of AI.a In a typical training process, an AI system first reads and decodes the input data and then shuffles the data, applying a set of transformations to augment it before gathering examples into batches and iteratively updating parameters to reduce error. Data echoing inserts a stage in the process that repeats the output data of a previous stage, thereby improving efficiency as the system reclaims idle computing power capacity.b

DYNAMIC FILTERING.   Presto uses a distributed engine for big data analysis that parses a data query, assigns it to multiple “workers,” and creates an optimal plan for answering that query. It collects data from different sources, using both big and small datasets, and determines if one data source is significantly smaller than another. It then dynamically filters the data in order to skip the scanning of irrelevant data from the larger source. This enables a significant performance improvement when joining data from different sources.c

SIMULTANEOUS TRAINING.   Google trained a single deep-neural-network on eight different tasks simultaneously. It can simultaneously detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and analyze sentences. The researchers showed that it is not only possible to achieve good performance while training jointly on multiple tasks, but also that performance actually improves on tasks with limited quantities of data.d

ACTIVE LEARNING.   In this approach to AI, the algorithm chooses which data to learn from. New York University Shanghai has been employing active learning algorithms to develop a computational framework that personalizes the assessment of human vision change.e The system rapidly evaluates test results in relation to a large library of potential test results and converges on a set of optimal queries for each patient based on their previous responses. Researchers believe AI-driven improvements to the standard eye chart would help with early detection and treatment of many eye-related afflictions, such as glaucoma and cataracts.

LOCAL DATA.   Just as global firms create geographically relevant products and customer channels, it’s important that data collected locally be used to train the algorithms that drive a company’s approach to interacting with local customers. Locally collected data embodies the culture, behaviors, and values that can be used to tailor a company’s approach to sales and achieve greater customer satisfaction. Radically human AI models are those that are trained with data that’s relevant to the populations that they will affect rather than ready-made models for the masses.

FEDERATED LEARNING.   Combined with edge computing, federated learning is helping to make data smaller. In federated learning, each device on the edge downloads the machine learning model from the cloud, updates the model, and sends it back where it will then be averaged with other updates from other sources, thus training the model with locally stored data. This approach eliminates a centralized big-data store, contributes to algorithmic development, and is particularly useful in scenarios where wireless latency is an issue.

SYNTHETIC DATA.   Unlike large tech companies, startups and researchers might not always have access to the data they need to train algorithms. But with synthetic data—artificial information generated by computers that mimics real data—engineers can teach an AI system how to react to novel situations. Synthetic data can help level the playing field for startups competing against the tech giants. Early-stage startup AiFi is using synthetic data to build a checkout-free solution for retailers along the lines of Amazon Go.f But this democratization of data doesn’t mean that the tech giants are sleeping.g Waymo, Alphabet’s self-driving car, traverses more than three million miles daily in a simulated environment, allowing engineers to test features before unleashing them in the real world.h

a. Dami Choi, Alexandre Passos, Christopher J. Shallue, and George E. Dahl, “Faster Neural Network Training with Data Echoing,” arXiv, January 4 2020, https://arxiv.org/abs/1907.05550.

b. Kyle Wiggers, “Google’s New ‘Data Echoing’ Technique Speeds Up AI Training,” VentureBeat, July 15, 2019, https://venturebeat.com/2019/07/15/googles-new-data-echoing-technique-speeds-up-ai-training/.

c. Shai Greenberg, “Querying Multiple Data Sources with a Single Query using Presto’s Query Federation,” BigData Boutique, May 26, 2020, https://blog.bigdataboutique.com/2020/05/querying-multiple-data-sources-with-a-single-query-using-prestos-query-federation-veulwi.

d. Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit, “One Model to Learn Them All,” arXiv, June 16, 2017, https://arxiv.org/pdf/1706.05137.pdf.

e. James Devitt, “A.I. Could Give Eye Charts a Personalized Overhaul,” Futurity, August 29, 2019, https://www.futurity.org/eye-charts-artificial-intelligence-2146852/.

f. Evan Nisselson, “Deep Learning with Synthetic Data Will Democratize the Tech Industry,” TechCrunch, May 11, 2018, https://techcrunch.com/2018/05/11/deep-learning-with-synthetic-data-will-democratize-the-tech-industry/.

g. Yashar Behzadi, “Why Synthetic Data Could Be the Ultimate AI Disruptor,” Upside, June 28, 2019, https://tdwi.org/articles/2019/06/28/adv-all-synthetic-data-ultimate-ai-disruptor.aspx.

h. Nisselson, “Deep Learning with Synthetic Data Will Democratize the Tech Industry.”

Creating a Modern Data Foundation

Mastering the use of big and small data to generate value from AI requires that organizations lay a solid foundation. In every industry, companies’ current successes are happening in spite of their foundations, not because of them. Data is locked in legacy, on-premise platforms that are often siloed, making it difficult, if not impossible, for people to get different types of data to work together. That makes it even harder for business users to find and consume the right data they need to arrive at the appropriate decisions.

Creating a modern foundation requires breaking data out of legacy silos so it can be unified in the cloud across different dimensions and processed with cutting-edge analytical tools. As we will see in chapter 4, this requires the right architecture (the A in IDEAS): the right storage, the right warehouses, the right computers, the right access—all in the cloud—to enable agile data capabilities that create bottom-line results.

Companies that get this right enjoy substantial competitive advantage. As we noted in the Introduction, according to recent Accenture research, companies that lead in this regard grew five times faster than laggards.30 And a significant number of “leapfroggers” scaled their investments in key technologies such as cloud and AI during the pandemic, not just to keep the lights on but also to create “second-mover advantage.”

How have they made the leap? Three capabilities are key: modern data engineering, AI-assisted data governance, and data democratization.

Modern Data Engineering

In a modern data foundation, data comes from a variety of internal and external sources through a number of mechanisms, including batch and real-time processing and application programming interfaces (APIs). It gets stitched together into highly curated and reusable datasets that can be consumed for a variety of analytic purposes. A good foundation relies on reusable frameworks for data ingestion and ETL (extract, transform, load) that support diverse data types. These frameworks also handle rules for data quality and standardization as well as metadata capture and data classification, and they enable a configuration-driven approach to data ingestion, processing, and curation so that new data pipelines for analytic use cases and data products can be developed quickly and at scale on the cloud.

AI-Assisted Data Governance

Cloud-based AI tools offer the advanced capabilities and scale to help automatically cleanse, classify, and secure data gathered on the cloud as it is ingested, which supports better data quality, veracity, and ethical handling.

Data Democratization

A modern data foundation gets more data into more hands. It makes data accessible and easy to use in a timely manner, while enabling multiple ways to consume data, including self-service, AI, business intelligence, and data science. As we will see in chapter 6, the latest cloud-based tools democratize data and empower more people across the enterprise to easily find and leverage data that’s relevant to their specific business needs—faster.


Together, these three capabilities help companies overcome some of the most common barriers to value: data accessibility, data trustworthiness, data readiness, and data timeliness. They enable companies to blend data from big and small datasets together in real time, build agile reporting, and leverage analytics and AI to create broadly accessible customer, market, and operational insights that deliver meaningful business outcomes.

With a solid data foundation—more data from more sources, AI-assisted data management, and more data in more of your people’s hands—you are no longer dominated by data, but driving it to ever more powerful and more fine-grained uses. As with more human-like intelligence, this approach to data transforms mechanistic processes into activities requiring more, not less, involvement of people, positioning you to unleash human expertise—the E in IDEAS—to which we turn in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset