Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. A Data Workflow Framework

In this chapter, we present a framework for working with data. Our goal is to cover the most common sequences of actions that people take as they move through the process of accessing, transforming, and using their data. We’ll begin at the end of this process, and discuss the value you will get from your data.

In the introduction, we talked about near-term and long-term value. Another dimension of value to consider is how that value will be delivered into your organization. Will value be delivered directly, through systems that can take automated actions based on data as it is processed? Or will value be delivered indirectly, by empowering people in your organization to take a different course of action than they otherwise would have?

Indirect value: Data provides value to your organization by influencing people’s decisions or inspiring changes in processes. Example: risk modeling in the insurance industry.
Direct value: Data provides value to your organization by feeding automated systems. Example: Netflix’s recommendation system.

Indirect value from data has a long tradition. Entire professions are built on it: accounting, risk modeling in insurance, experimental design in medical research, and intelligence analytics. On a smaller scale, you might have used data to generate reports or interactive visualizations. These reports and visualizations both use data to deliver indirect value. How? When others view your report or visualization, they incorporate the presented information into their understanding of the world and then use their updated understanding to improve their actions. In other words, the data shown in your reports and visualizations indirectly influences other people’s decisions. Most of the near-term, known potential value in your data will be delivered indirectly.

Direct value from data involves handing decisions to data-driven systems for speed, accuracy, or customization. The most common example of this involves automatic delivery and routing of resources. In the world of high-frequency trading and modern finance, this resource is primarily money. In some industries, like consumer packaged goods (think Walmart or Amazon), physical goods are routed automatically. A close cousin to routing physical goods is routing virtual ones: digital media companies like Netflix and Comcast use automated pipelines to optimize the delivery of digital content to their customers. At a smaller scale, systems like antilock brakes in cars use data from sensors to route energy to different wheels. Modern testing systems, like the GRE graduate school entrance exam, now dynamically sequence questions based on the tester’s evolving performance. In all of these examples, a significant number of operational decisions are directly controlled by data-driven systems without any human input.

How Data Flows During and Across Projects

Deriving indirect, human-mediated value from your data is a prerequisite to deriving direct, automated value. At the outset, human oversight is required to discover what is “in” your data and to assess whether the quality of your data is sufficiently high to use it in direct and automated ways. You can’t send data blindly into an automated system and expect valuable results. Reports must be authored and digested to understand the wider potential of your data. As that wider potential comes into focus, automated systems can be designed to use the data directly.

This is the natural progression of data projects: from near-term answering of known questions, to longer-term analyses that assess the core quality and potential applications of a dataset, and finally to production systems that use data in an automated way. Underlying this progression is the movement of data through three main data stages: raw, refined, and production. Table 2-1 provides an overview of this progression. For each stage, we list the primary objectives.

Table 2-1. Data moves through stages
	Data Stage
	Raw	Refined	Production
Primary Objectives	Ingest data Data discovery and metadata creation	Create canonical data for widespread consumption Conduct analyses, modeling, and forecasting	Create production-quality data Build regular reporting and automated data products/services

In the raw stage, the primary goal is to discover the data. When examining raw data, you ask questions aimed at understanding what your data looks like. For example:

What kinds of records are in the data?
How are the record fields encoded?
How does the data relate to your organization, to the kinds of operations you have, and to the other data you are already using?

Armed with an understanding of the data, you can then refine the data for deeper exploration by removing unusable parts of the data, reshaping poorly formatted elements, and establishing relationships between multiple datasets. Assessing potential data quality issues is also frequently a concern during the refined stage, because quality issues might negatively affect any automated use of the data downstream.

Finally, after you understand the data’s quality and potential applications in automated systems, you can move the data to the production stage. At this point, production-quality data can feed automated products and services, or enter previously established pipelines that drive regular reporting and analytics activities.

A minority of data projects will end in the raw or production stages. The majority will end in the refined stage. Projects ending in the refined stage will add indirect value by delivering insights and models that drive better decisions. In some cases, these projects might last multiple years. Google’s Project Oxygen is a great example of a project that ended in the refined stage.¹ Realizing that managing people is a critical skill for a successful organization, Google kicked off a multiyear study to assess the characteristics of a good manager and then test how effective they could be at teaching those characteristics. The results of the study indireclty influenced employee behavior, but the study data itself was not incorporated into a production pipeline.

The hand-off between IT shared services organizations and lines of business traditionally occurs in the refined stage. In such an environment, IT is responsible for Extract-Transform-Load (ETL) operations. ETL moves data through the three data stages in a centrally controlled manner. Lines of business own the data analysis process, including everything from reporting and ad hoc research tasks, to advanced modeling and forecasting, to data-driven operational changes. This division of concerns and responsibilities has two intended benefits: basic data governance due to centralized data processing, and efficiency gains due to IT engineers reusing broadly useful data transformations.

However, in practice, the perceived benefits of centrally transforming data are often eclipsed by the reality of organizational inefficiencies and bottlenecks. Most of these bottlenecks arise from line-of-business analysts being dependent upon IT. In the age of agile analytics and data-driven services, there is increasing pressure to speed up the extraction of value from your data. Unsurprisingly, the best plan of attack involves identifying and removing bottlenecks.

In our experience, there are two primary bottlenecks. The first bottleneck is the time it takes to wrangle your data. Even when you start from refined data, there are often nontrivial transformations required to prepare your data for analysis. These transformations can include removing unnecessary records, joining in additional information, aggregating data, or pivoting datasets. We will discuss each of these common transformation actions in more detail in later chapters.

The second bottleneck is the simple capacity mismatch that arises when a large pool of analysts relies on a small pool of IT professionals to prepare “refined” data for them. Removing this bottleneck is more of an organizational challenge than anything else, and it involves expanding the range of users who have access to raw data and providing them with the requisite training and skills.

To help motivate these organizational changes, let’s step back and consider the gross mechanics of successfully using data. The most valuable uses of your data will be production uses that take the form of automated reports or data-driven services and products. But every production use of your data depends on hundreds or even thousands of exploratory, ad hoc analyses. In other words, there is a funnel of effort leading to direct, production value that begins with exploratory analytics. And, as with any funnel, your conversation rate will not be 100 percent. You’ll need as many people as possible exploring your data and deriving insights in order to discover a relatively small number of valuable applications of your data.

As Figure 2-1 demonstrates, a large number of raw data sources and exploratory analyses are required to produce a single valuable application of your data.

When it comes to delivering production value from your data, there are two critical points to consider. First, data can produce insights that are not useful to you and your business. These insights might not be actionable, or their potential impact might be too small to warrant a change in existing processes. A good strategy for mitigating this risk is to empower the people who know your business priorities to explore your data. Second, your exploratory analytics efforts should be as efficient as possible. This brings us back to data wrangling. The faster you can wrangle data, the more explorations of your data you can conduct, and the more analyses you will be able to move into production. Ultimately, implementing an effective data wrangling workflow can enable more business analysts to explore a larger quantity of data at a faster pace.

Connecting Analytic Actions to Data Movement: A Holistic Workflow Framework for Data Projects

We began this chapter with a discussion of the direct and indirect value delivered by data projects.

In this section, we expand our discussion of data stages into a complete framework that captures the basic analytic actions involved in most data projects. Figure 2-2 illustrates the overall framework and will serve as our map through the rest of the book.

As Figure 2-2 illustrates, data moves through stages, from raw to refined to production. Each stage has a small set of primary actions. The actions come in two types: in the top three boxes in Figure 2-2 are actions whose results are the data itself, and in the bottom six boxes are actions whose results are derived from or built on top of the data inferences (e.g., insights, reports, products, or services). For simplicity, the connecting links between actions in Figure 2-2 are drawn in one direction. However, real data projects will often loop back through actions, iterating toward better results.

Of course, many individuals and organizations will customize the steps in this framework to fit their specific needs. Although we describe each possible action in our framework, not every data project will involve all of these actions. You might decide to define variants of each action that are tailored to specific customers or business objectives. You might also decide to create multiple locations for refined data and multiple locations for production data. We have seen this frequently at organizations where data security is important, and different business units are not allowed to access each other’s data. However, most organizations that we have worked with follow the uncustomized version of this framework.

In the rest of this chapter, we’ll discuss the actions in Figure 2-2. The discussion will move through the three data stages in order.

Raw Data Stage Actions: Ingest Data and Create Metadata

There are three primary actions in the raw data stage: ingestion of data, creation of generic metadata, and creation of propriety metadata. We can separate these actions into two groups based on their output, as shown in Figure 2-3. One group is focused on outputting data—the two ingestion actions. The second group is focused on outputting insights and information derived from the data—the metadata creation actions.

Ingesting Known and Unknown Data

The process of ingesting data can vary widely in its complexity. At the less complex end of the spectrum, many people receive their data as files via channels like email, shared network folders, or FTP websites. At the more complex end of the spectrum, modern open source tools like Sqoop, Flume, and Kafka enable more granular and real-time transferring of data, though at the cost of requiring nontrivial software engineering to set up and maintain. Somewhere in the middle of this spectrum are proprietary platforms like Alteryx, Talend, and Informatica Cloud that support a variety of data transfer and ingestion functionality, with an eye toward easing of configuration and maintenance for nonengineers.

In traditional enterprise data warehouses, the ingestion process involves some initial data transformation operations. These transformations are primarily aimed at mapping inbound elements to those elements’ standard representations in the data warehouse. For example, you might be ingesting a comma-separated values (CSV) file and need each field in that file to correspond to a particular column in a relational data warehouse. After it is transformed to match the syntax rules defined by the warehouse, the data is stored in predefined locations. Often this involves appending newly arrived data to related prior data. In some cases, appends can be simple, literally just adding new records at the “end” of the dataset. In other cases, when the incoming data contains edits to prior data as well as new data, the append operation becomes more complicated. These scenarios often require you to ingest new data into a separate location, where more complex merging rules can be applied during the refined data stage.

Some modern NoSQL databases like MongoDB or Cassandra support less-rigid syntax constraints on incoming data while still supporting many of the classic data access controls of more traditional warehouses. Further along the spectrum (toward relaxed constraints on incoming data) are basic storage infrastructures like HDFS and Amazon S3 buckets. For most users, S3 and HDFS look and act like regular filesystems. There are folders and files. You can add to, modify, and move them around. And, if necessary, you can control access on a per-file, per-user basis.

The primary benefit of modern distributed filesystems like HDFS and S3 is that data ingestion can be as simple as copying files or storing a stream of data into one or more files. In this environment, the work to make this data usable and accessible is often deferred until the data is transformed and moved to the refined data stage. This style of data ingestion is often referred to as schema-on-read. In schema-on-read ingestion, you do not need to construct or enforce a usable data structure until you need to use the data. Traditional data warehouses, in contrast, require schema-on-write, in which the data must adhere to certain structural and syntactic constraints in order to be ingested.

In other words, the two ends of the ingestion complexity spectrum differ based on when the initial enforcement of data structure happens. However, it is important to note that along this entire spectrum of ingestion infrastructures, you will still require a separate refined data stage. This is because refined data has been further wrangled to align with foreseeable analyses.

Let’s consider an example data ingestion use case. It is common practice for consumer packaged goods (CPG) retailers (e.g., Walmart and Target) and manufacturers (e.g., Pepsico and General Mills) to share data about their supply chains. This data enables better forecasting, helping both sides to better manage inventory. Depending on the size of the companies, data might be shared on a daily, weekly, or monthly basis. The ingestion complexity comes from the many-to-many partnerships in this ecosystem: retailers sell products from many manufacturers, and manufacturers sell products to many retailers. Each of these companies produces data in different formats and conforms to different syntactic conventions. For example, each company might refer to products by using their own product IDs or product descriptions. Or some companies might report their data at case or bundle levels instead of the individual units that an end consumer would purchase. Retailers with strong weekly patterns (e.g., much higher sales activity on weekends versus weekdays) might report their overall sales activity on a weekly basis instead of a monthly basis. Even retailers that report their data at the same frequency might define the beginning and ending of each period differently. Further complexity arises in retailer sales data when consumers return purchased goods. Return transactions require amendments to previously shared sales data, often going back multiple weeks.

The ingestion processes for these CPG companies can range from simple file transfers, which wait for the refined data stage to tackle the potentially complex wrangling tasks required to sort out the aforementioned difficulties, to more engineered ETL processes that fix some of these difficulties as the data is ingested. In either case, both retailers and manufacturers are interested in forecasting future sales. Because these forecasts are regularly refreshed, and because the historical sales data on which they are based can and is amended, most large CPG companies work with supply chain data in a time-versioned way. This means that a forecast for the first week of January 2017 based on data received through August 31, 2016 is kept separate and distinct from a forecast for the same first week of January 2017 using data received through September 30, 2016.

In addition to storing data in time-versioned partitions, data from different partners is often ingested into separate datasets. This greatly simplifies the ingestion logic. After ingestion, as the data moves into the refined stage, the separate partner datasets are harmonized to a standard data format so that cross-partner analyses can be efficiently conducted.

Creating Metadata

In most cases, the data that you are ingesting during the raw data stage is known; that is, you know what you are going to get and how to work with it. But what happens when your organization adds a new data source? In other words, what do you do when your data is partially or completely unknown? Ingesting unknown data triggers two additional actions, both related to the creation of metadata. One action is focused on understanding the characteristics of your data, or describing your data. We refer to this action as generating generic metadata. A second action is focused on using the characteristics of your data to make a determination about your data’s value. This action involves creating custom metadata.

Before discussing the two metadata-producing actions, let’s cover some basics. Datasets are composed of records. Records are composed of fields. Records often represent or correspond to people, objects, relationships, or events. The fields within a record represent or correspond to measurable aspects of the person, object, relationship, or event. For example, if we consider a dataset that contains transactions from a store, each record could correspond to a single purchase, and fields might represent the monetary value of the purchase, the specific goods purchased, the time of the purchase, and so on. If you’re used to working with relational data, you might be used to speaking about rows and columns. Records are synonymous with rows, and fields are synonymous with columns.

When you are describing your data, you should be focused on understanding the structure, granularity, accuracy, temporality, and scope of your data. Structure, granularity, accuracy, time, and scope are key aspects of representational consistency. As such, they are also the characteristics of a dataset that must be tuned or improved by your wrangling efforts.

Beyond generic metadata descriptions, the data discovery process often requires inferring and creating custom metadata related to the potential value of your data. Whereas the generic metadata should be broadly useful to anyone working with the dataset, custom metadata contextualizes this information to a specific analysis or organization. In other words, custom metadata builds on or extends generic metadata.

Both generic and custom metadata are composed of the same base set of characteristics: structure, granularity, accuracy, temporality, and scope. We will discuss each of these characteristics in turn and explain how you can better understand them in the context of your data.

Structure

The structure of a dataset refers to the format and encoding of its records and fields. We can consider datasets on a spectrum related to the homogeneity of their records and fields. At one end of the spectrum, the dataset is “rectangular” and can be formatted as a table with a fixed number of rows and columns. In this format, rows in the table correspond to records, and columns correspond to fields.

If the record fields in a dataset are not consistent (some records have additional fields, others are missing fields, etc.), you could be dealing with a “jagged” table. Such a table is no longer perfectly rectangular. Data formats like JSON and XML support datasets like this, in which record fields are not fully consistent.

Further along the spectrum are datasets with a heterogeneous set of records. For example, a heterogeneous dataset from a retail organization might mix customer information and customer transactions. This is a common occurrence if you consider the tabs in a complex Excel workbook. Most analysis and visualization tools will require these different kinds of records to be split into separate files or tables. Analyses that require all of these records will often start by joining or blending the records together based on shared fields.

The encoding of a dataset specifies how the record fields are stored and presented to the user. Are the timestamps in local time zones or mapped to UTC? Do date/time fields conform to specifications like ISO 8601? Are ages stored in years, days, or fractions of decades? Are account balances in US dollars? Are the postal codes up to date? Are accented characters supported, or have they been replaced with their nonaccented counterparts? Are right-to-left writing systems, like Arabic, supported?

In many cases, it is advisable to encode a dataset in plain text. This makes the file human-readable. A major drawback to encoding the file in plain text is the size of the file; it is far more space efficient to use binary encodings of numerical values, or to compress the file using standard algorithms like gzip and bzip.

Assessing the structure of your dataset is primarily a generic metadata question. Before you can begin wrangling your data, you need to understand how that data is structured. This often requires counting the number of records and fields in a dataset, and determining the dataset’s encoding.

Beyond these generic metadata concerns, you might need to generate custom metadata pertaining to the specific structure of your dataset. Earlier in this chapter, we discussed CPG organizations that need to work with data provided by their external trading partners. Because this data originates outside of the organization, CPG analysts often need to determine the custom fields that each trading partner is adding to the dataset. Maybe Walmart’s datasets include a slightly different set of fields than those for Target. When you make these determinations, you generate custom metadata.

Basic Questions to Assess Structure

Here are some of the questions you need to ask when assessing data structure:

Do all records in the dataset contain the same fields?
How can you access the same fields across records? By position? By name?
How are the records delimited/separated in the dataset? Do you need sophisticated parsing logic to separate the records from one another?
How are the record fields delimited from one another? Do you need to parse them?
How are record fields encoded? Human readable strings? Binary numbers? Hash keys? Compressed? Enumerated codes?
What is the complexity of the encoding? Primitive elements like integers, decimal numbers, short strings, and so on? Higher-order elements like key-value sets or arrays?
What are the semantics of the encoded data? Do these semantics entail associated data quality, consistency, and accuracy checks?
What are the relationship types between records and the record fields? Singular (record should have one and only one value for a field, like customer date of birth)? Set-based (record could have many values for the field, like customer shipping addresses)?

Granularity

The granularity of a dataset refers to the kinds of entities that each data record represents or contains information about. In their most common form, records in a dataset will contain information about many instances of the same kind of entity.

We typically describe granularity in terms of coarseness and fineness. In the context of data, this means the level of depth or the number of distinct entities represented by a single record of your dataset. For example, a dataset in which a single record represents a single sales transaction by a single customer at a particular store would have a fine granularity. A dataset in which each record represents the total sales in each store for each day would have a coarse granularity. At an even coarser granularity, you might have a dataset in which each record represents total sales by region and week. Depending on your intended use of the dataset, the granularity might be just right, too coarse, or too fine.

There are subtleties to assessing the granularity of a dataset that involve applying organizational knowledge. These are examples of creating custom metadata related to granularity. For example, at first glance, a dataset might appear to contain records that represent customers. However, those records might in fact correspond to all known contacts of your company (only a subset of whom are actual, paying customers). Moreover, this contacts dataset might contain multiple entries for the same person, resulting from that person signing up to receive information through multiple channels (e.g., Facebook and via a direct visit to your website). In this case, a more appropriate description of the granularity of the dataset might be “registration events.”

Accuracy

The accuracy of a dataset refers to its quality. In other words, the values populating record fields in the dataset should be consistent and accurate. For example, consider a customer actions dataset. This dataset contains records corresponding to when customers added items to their shopping carts. In some cases, the reference to the item added to the cart in the record is not accurate; perhaps a UPC code is used but some of the digits are missing, or the UPC code is out of date and has since been reused for a different item. Inaccuracies would, of course, render any analysis on the dataset problematic.

Other common inaccuracies are misspellings of categorical variables, like street or company names; lack of appropriate categories, like ethnicity labels for multiethnic people; underflow and overflow of numerical values; and missing field components, like a timestamp encoded in a 12-hour format but missing an AM/PM indication.

Accuracy assessments can also apply to frequency outliers—values that occur more or less often than you’d expect. Making frequency assessments is primarily a custom metadata concern, given that determining whether the range of values in a dataset is accurate relies on organizational knowledge. Returning to our CPG example, a supply-chain analyst might know that a certain trading partner only reports UPCs within a particular range. In this case, the analyst needs to generate custom metadata around the accuracy of the UPC code distribution in a dataset.

Basic Questions to Assess Data Accuracy

When assessing data accuracy, you need to look at the following:

Many accuracy issues are type specific:
- For date times, are time zones included or adjusted into a standard time zone like UTC? If the times are presented in 12-hour format, is AM/PM demarcated? Are the positions of month and day fields ambiguous (e.g., 01/02/17 versus 02/01/17)? Are there signs that the information is wrong (e.g., across your customer dataset, do you have an unusual amount of people with date-of-births corresponding to the first year of a decade like 1970, 1980, 1990, or born in January or on the first date of the month)?
- For addresses, are the address components consistent? Is the correct postal code associated with the address? If there are GPS coordinates, do they match the address?
- For numeric items like phone numbers and UPC codes, are digits missing or are subcomponents of the number invalid (like an invalid area code on a phone number)?
- For names, are there misspellings? Or missing fields (like no first name for a customer)?
- For email addresses, is the email domain valid?
- For sales transactions, are the currency amounts all in the same currency? Do they show signs of inaccuracy or fraud (i.e., a preponderance of unlikely values)?
Some accuracy is related to the mechanisms that produce the data:
- Is there any chance of sensor drift that has caused systematic inaccuracies over time?
- Is the data entered by people? If so, there might be a high incidence of misspellings and nonstandard abbreviations.
Finally, you will want to understand the distribution of inaccuracies across your dataset:
- What is the measurable distribution of inaccuracies?
- Does the distribution of inaccuracies affect a large number of records?
- Are inaccuracies concentrated in a particular subset of records?

Temporality

A data record is a representation of an entity at a particular time (or set of times). Accordingly, even though a dataset might have been an accurate and consistent representation at the time it was created, subsequent changes might render the representation inaccurate or inconsistent. For example, you might use a dataset of customer actions to determine the distribution of items people own. Weeks or months after an initial sale, however, some of these items might be returned. Now the original dataset, although an accurate representation of the original sales transaction, is no longer an accurate representation of the items a person owns.

The time-sensitive nature of representations, and hence of datasets, is an important aspect that should be explicitly noted. Sometimes this is done on a per-record basis (e.g., each customer action contains a timestamp). Data records that do not correspond to events are less likely to contain embedded timestamps; for example, a customer database with names, addresses, demographics, or a reference dataset of UPCs with item descriptions. However, even when time is not explicitly represented in a dataset, it is still important to understand how time may have impacted the records in the dataset. In the first example, a customer might move and his address can become inaccurate. In the second example, UPC numbers might be recycled and item descriptions can become inaccurate. In all such cases, knowing when the dataset was generated can provide valuable insight into potential inaccuracies and the appropriate wrangling actions needed to remediate those inaccuracies.

Basic Questions to Assess Data Temporality

The following questions will help you assess data temporality:

When was the dataset collected?
Were all the records and record fields collected/measured at the same time? If not, is the temporal range significant?
Are the timestamps associated with collection of the data known and available (as a record field) or as associated metadata?
Have some records or record field values been modified after the time of creation? Are the timestamps of these modifications available?
In what ways can you determine if the data is “stale”? For example, you might have purchased a marketing leads database and want to verify the contact information for the people represented in the dataset. Is it sufficient to sample the records and manually verify the data? Can you automatically verify it by using third-party services?
If there are conflicting values in the data (e.g., multiple mailing addresses for a person), can you use timestamps to determine which value is “correct”?
Can you forecast when the values in the dataset might become “stale”?

Scope

The scope of a dataset has two major dimensions. The first dimension concerns the number of distinct attributes represented in a dataset. For example, for each customer action, we might know when it happened (e.g., a timestamp) and some details about it (like which UPC a customer added to a basket). The second dimension concerns the attribute-by-attribute population coverage: are “all” the attributes for each field represented in the dataset, or have some been randomly, intentionally, or systematically excluded?

Let’s begin our discussion of the importance of scope by addressing the number of distinct attributes in a dataset. Each distinct attribute is generally represented in a dataset by a single field. A dataset with a wide scope will have a large number of fields, whereas a dataset with a narrow scope will have a small number of fields. You could imagine that a customer information dataset with a very wide scope could have hundreds of distinct fields, each representing a different attribute of a customer (age, salary, ethnicity, family size, etc.).

Obviously, increasing the scope of a dataset by adding in more entity characteristics extends the analytical potential of the dataset. However, as with granularity, you want to include only as much detail as you might use, but no more. The level of detail required might depend on your analytics methodology. Some methodologies, like deep learning, call for keeping many redundant attributes and letting statistical methods boil those many attributes down to a smaller number. Other methodologies operate best with a limited number of attributes.

Regarding population coverage, the most common scenario is that not all of the possible entities have been represented. Sometimes, missing records are the result of external factors, like the logging infrastructure failing due to a power outage. Other times you’re missing some of the data due to an operational or logistical error: a chunk of a file was destroyed, a sensor was knocked off power, somebody didn’t give you an entire file, or somebody redacted the dataset. The cause of the missing data can be helpful to know because it can help account for the bias downstream. For example, if you know that a dataset was scrubbed of people under 18 for legal reasons, you should account for the average age in the dataset being higher than the true population that the dataset (imperfectly) represents.

More generally, we might want to introduce the notion of an idea or “true” source of data in the real world, and an imperfect record of that data that was acquired. When statisticians or scientists speak of “sampling,” they usually are talking about this—acquiring a small but representative subset of the “true” data as a stored dataset. The ideal (true) source of data might be infeasible to capture (think about the temperature of every cubic inch of air at every second hovering over the globe—we sample that to get temperature readings at particular thermometers). The data that is recorded is inherently a sample. This is a big difference between real-world (e.g., IoT) data and electronic data like transactions or web click logs.

It is important to understand any systematic bias in a dataset, because for cases in which systematic bias exists, any analytical inferences made using the biased dataset can be rendered invalid. A canonical example is drug trial analysis, where analysts are concerned with assessing the efficacy of the drug being trialed. The granularity of drug trial datasets is often at the level of patients in the trial. If, however, the scope of the dataset has been intentionally reduced by systematically removing records associated with some patients in the trial (either because they died during the trial or because they began showing abnormal biometrics), there’s a good chance that any analysis on the dataset is likely to misrepresent the actual impact of the drug.

The most common custom metadata question is: “How can this dataset blend with (join or union to) my other datasets?” To answer this question, we must understand the scope of the datasets. In some cases, the new dataset might represent an extension of an existing dataset. Sometimes, this extension is disjoint, involving records representing entities in the overall population that are missing in your existing datasets. Sometimes, the dataset extension is overlapping, creating a need to deduplicate or harmonize records representing the same entity. In other cases, a new dataset might provide additional record fields (e.g., household disposable income to be matched against your customer list or counts of different kinds of accidents by postal code).

Basic Questions to Assess Data Scope

The following questions will help you assess the scope of your data:

Given the granularity of the dataset, what characteristics of the things (e.g., people, objects, relationship, events, etc.) represented by the records are captured by the record fields? What characteristics are not captured?
Are the record fields consistent? For example, does the customer’s age field make sense relative to the date-of-birth field? If the record corresponds to a purchase transaction, does the cost of the listed set of purchased items add up to the total transaction amount?
For the analysis that you want to perform, can you deduce or infer additional relevant characteristics from the ones that you have? For example, can you infer the demographics of the people in a household from partner and dependents record fields?
Are the same record fields available for all records? Are they accessible via the same specification (position, name, etc.)?
Do the records in the dataset represent the entire population of associated things (people, objects, relationships, events, etc.)? Are there missing records (e.g., things in the population, say people, with no associated record)? Are the missing records randomly missing or systematically missing?
Are there multiple records for the same thing? If so, does this change the granularity of your dataset (e.g., from customers to contacts) or require some amount of deduplication before analysis?
Does the dataset contain a heterogeneous set of records (representing different kinds of entities)? If so, what is the relationship between the different kinds of records?

Refined Data Stage Actions: Create Canonical Data and Conduct Ad Hoc Analyses

After you have ingested your raw data and fully understood the metadata aspects of your raw data, the next major stage in data projects involves refining the data and conducting a broad set of exploratory analyses. Three primary actions define this stage, as shown in Figure 2-4: design and preparation of “refined” data, ad hoc reporting analyses, and exploratory modeling and forecasting. As in the raw data stage, these actions can be separated into two groups distinguished by their output. One group is focused on outputting refined data that enables immediate application to a wide range of analyses. The second group is focused on outputting insights and information derived from the data, ranging from simple reporting to complex models and forecasts.

Designing Refined Data

The overarching goal in designing and creating your refined data is to simplify the foreseeable analyses you want to perform. Of course, you won’t foresee every analysis. The likely scenario is that insights and patterns gleaned from an initial set of analyses will inspire new analysis directions you hadn’t previously considered. To support these new directions, you might need to compile new refined datasets (or, at least, modify the ones you already have). We can, and often do, iterate between the actions in the refined data stage.

In the raw data stage, ingestion involves minimal data transformation—just enough to comply with the syntactic constraints of the data storage system. By contrast, the act of designing and preparing “refined” data often involves a significant amount of transformation. These transformations are often guided by the range of analyses that you plan to conduct on the data, and by the metadata-generating activities undertaken during the raw data stage. If you previously identified quality and consistency issues with the dataset’s structure, granularity, accuracy, time, or scope, those issues should be remedied during the refined data stage. We will address each of these metadata-related issues in turn, and discuss how you can design your refined data to resolve or mitigate such issues.

Addressing structural issues

Most visualization and analysis tools expect tabular data, meaning that every record has the same fields instantiated in the same order. Depending on the structure of the raw data, converting data into tabular format can require significant transformations. Furthermore, for modeling and forecasting purposes, you might need to convert categorical data to separate indicator values for category value; for example, you might need to expand a record field encoding gender into multiple fields corresponding to characteristics like “is male” or “is female.”

Addressing granularity issues

From a granularity perspective, it is prudent to build refined datasets at the finest resolution of records that you plan to analyze. For example, let’s assume that your sales records show a significant increase in the average amount of a sales transaction, and that you can identify a subset of users who form the driver of this shift (this group makes significantly larger purchases relative to other customers, and removing them from the average calculation produces an average sale amount consistent with the recent past). You might want to understand what exactly differentiates these customers from the rest: Are they buying more expensive items? More items than average? Do they shop more often? Do they seem unaware of sales and coupons? Chances are, answering these questions will require the records of actions leading up to the sales transaction; in this case, the records with the finest granularity are most ideal for your analysis.

However, if the majority of the analyses focus on records of a coarser granularity, (e.g., customer segments or demographic groups, and lifetime purchase totals), it also might make sense to store a version of the dataset at this granularity, as well. Retaining multiple versions of the same dataset with different granularities can help streamline downstream analyses based on groups of records.

Addressing accuracy issues

Another key objective in designing and creating refined datasets is to remedy known accuracy issues.

The main strategies used to handle inaccuracies are to: (a) remove records with inaccurate values (provided the inaccuracies can be detected); (b) retain records with inaccurate values but mark them as inaccurate (which still allows some analyses to be conducted over the dataset); or (c) replace the inaccurate values with default or estimated values in a process known as imputation. As an example, for numerically distributed values like the dollar amount of sales transactions, you might determine that extremely large values are inaccurate. These extreme values might be replaced by a maximum value to ensure that aggregate calculations like average and standard deviation are not overly biased.

Conflicting information between fields (e.g., multiple addresses or significant deviation between a date-of-birth field and an age field) or between a field and applicable business logic (e.g., a transaction amount too large to be possible given constraints in the transaction process) are primary accuracy issues that should be addressed during the refined data stage. In some cases, generally when the percent of records with inaccurate values is small and unlikely to be significant, the appropriate remedy is to remove affected records. For many analyses, removing these records will not materially affect the results. In other cases, the best approach might be to reconcile conflicting information; for example, recalculating customer age using date-of-birth and the current date (or the dates of the events you want to analyze).

In many cases, resolving conflicting or inaccurate data fields in your refined data is best done using an explicit reference to time. Consider the multiple addresses problem in a customer dataset. Perhaps each address is (or was) correct, representing the various home addresses a person has had throughout her life. Assigning date ranges to the addresses might resolve the conflicts. Similarly, a transaction amount that violates current business logic might have occurred prior to that logic being enforced, in which case you might want to keep the transaction in the dataset to preserve the integrity of historical analyses.

The most useful notion of “time” can often require some care. For example, there might be a time when an action occurred, and a time when it was recognized. This is particularly common when dealing with bank transactions. In some cases, an abstract version number might be more appropriate than a timestamp. For example, for data generated by software, it might be more important to record the version of the software that was used rather than the time when the software ran. Similarly, in scientific experiments it might be more useful to know the version of a data file that was analyzed rather than the time that the analysis ran. In general, the appropriate choice of time or version often depends on the specifics of an analysis; therefore, it’s best to preserve (and document!) all the timestamps and version numbers that are available for a record.

Addressing scope issues

Stepping back from individual record field values, it is also important to design the scope of your refined datasets so that these datasets include the full required set of records and record fields. For example, suppose that your customer data is split into multiple datasets (one containing contact information, another containing transaction summaries, etc.) but most of your analyses involve all of these fields. It might make sense to create a fully blended dataset with all of these fields to streamline your analyses.

Perhaps the most important scope-related challenge is ensuring your refined datasets have understood population coverage. This means that a dataset should accurately express the relationship between the set of things represented by records in the dataset (people, objects, etc.) and the wider population of those things (e.g., all people and all objects).

Ideally, your dataset contains one and only one record representing each member of the population of things you want to analyze, but more likely, your dataset will contain a subset of records from the complete population. If the subset is a true, random subset of the wider population, your dataset can be analytically used to infer insights about the population as a whole. If, however, the subset of represented things in your dataset is not random—that is, it exhibits some kind of bias—you might be restricted on the kinds of analyses you can validly conduct. It’s beyond the scope of this book to discuss the statistical implications of bias between your dataset and the population of things it partially represents, but this is an important note to be aware of when designing refined datasets.

Refined Stage Analytical Actions

To complete our discussion of the refined data stage, we’ll describe its two core analytical actions: ad hoc reporting analyses, and exploratory modeling and forecasting.

Reporting is the core action for answering specific questions using your data. You can think of business intelligence analytics and dashboarding as specific forms of reporting. These analyses are primarily retrospective—they use historical data to answer questions about the past or present. The answer to those questions might be as simple as a single number or statistic, or as complex as a full report with additional discussion and description of the findings. The nature of the initial query constrains the consumption of the output—there likely won’t be an automated system that can consume the output and take direct action as a result. Instead, the results will provide indirect value by informing and influencing people.

There are many types of ad hoc analyses, ranging from straightforward questions that can be answered in a succinct and definitive manner (e.g., how many customers purchased item X last week, or, what were the top three most viewed documents/pages on our website last month?) to open-ended investigations that might last months or years (e.g., identify the key factors driving the customer trend of switching from desktop to mobile devices). A common starting point for ad hoc analyses is an anomaly in a regular report. Perhaps sales spiked up more than expected or there was a dip in transactions from a particular product line or region of stores. If totally unexpected, the anomaly needs to be assessed from a number of different perspectives. Is there a data reporting or data quality problem? If the data is valid (i.e., the anomaly reflects a change in the world and not just in the representation of the world contained in the dataset), can the anomaly be isolated to a subpopulation? What other changes are correlated with the anomaly? Are all of these changes linked via causal dependencies to one another or to a common root change?

Unlike ad hoc analyses, which are primarily retrospective, modeling and forecasting analyses are primarily concerned with the future. These analyses ask, “What do we expect to happen given what we have observed in the past?” In the case of forecasting, the explicit objective is to predict future events: total sales in the next quarter, percent of customer churn next month, likelihoods of each customer renewing their contracts, and so on. In many cases, these predictions are built on models of how the target prediction depends on and relates to other measurable aspects in your dataset. For some analyses, the useful output is not a prediction (or set of predictions), but the underlying model itself.

In most cases, modeling as an explicit activity is an attempt to understand the relevant factors that drive the behavior (whether it is customer behavior, market movements, shifts in the existence or nature of relationships, frequencies or types of events, etc.) that interests you. Even though modeling is a common prerequisite activity for distinguishing correlated factors from causal factors, it is important to note that more causal analysis will also require some amount of carefully designed experimentation; for example, holding some factors constant while perturbing others and/or modifying the underlying system to decouple or realign factors so that you can assess whether those factors truly drive the behavior.

As an example, recall in Chapter 1 that we talked about Facebook’s user growth and their reliance on a measurable factor to the Facebook user experience, the number of friends a user has, to drive user retention. Simple analysis of the relationship between number of friends and long-term user retention would show that it is not a causal relationship. There are long-time users of Facebook who have only a few friends, and there are Facebook users who had many friends who have churned from the platform. Nonetheless, these factors are strongly correlated. More important, the product changes that Facebook made to increase the number of friends that users have demonstrably moves the underlying causal factors that do drive user retention.

Production Data Stage Actions: Create Production Data and Build Automated Systems

After you have refined your data and have begun generating valuable insights from that data, you will naturally start to separate out the analyses that need to be regularly refreshed from the ones that were sufficient as one-off analyses. It’s one thing to explore and prototype (which is the focus of activities in the refined data stage), but wrapping those initial outputs in a robust, maintainable framework that can automatically direct people and resources is a whole other ballgame. This takes us into the production data stage.

A solid set of initial insights often leads to statements like: “We should track that measure all the time,” or “We can use those predictions to expedite shipping of certain orders.” The solutions to each of these statements involve “production systems”; that is, systems that operate in a largely automated way and with a well-defined level of robustness. At a minimum, creating production data requires further optimizations to your refined data (Figure 2-5), and then engineering, scheduling, and monitoring the flow of that optimized data into regular reports and data-driven products and services.

Creating Optimized Data

So, what does it mean to optimize your data? In some sense, optimizing data is not unlike designing refined data. You can think of optimized data as the ideal form of your data; it is designed to simplify any additional downstream work to use the data. Unlike refined data, however, the intended use of optimized data should be highly specified.

These specifications go beyond just the scope and accuracy of the values in the data. There are also specifications related to the processing and storage resources that will need to be applied to work with the data on an ongoing basis. These constraints will often dictate the structure of the data, as well as the ways in which that data is made available to the production system. In other words, although the goal of refining data is to support the widest set of analyses as efficiently as possible, the goal of optimizing data is to robustly and efficiently support a very narrow set of analyses.

Designing Regular Reports and Automated Products/Services

Building regular reports and data-driven products and services requires more than just wiring the data into the report generation logic or the service providing logic. One major source of additional work comes from monitoring the flow of data and ensuring that requisite structural, temporal, scoping, and accuracy constraints remain satisfied over time. The fact that data is flowing in these systems implies that new (or updated) data will be processed in an ongoing manner. New data will eventually vary from its historical equivalents (maybe you have updated customer interaction events, or the latest week’s sales data). Structural, temporal, scoping, and accuracy constraints define the boundary around permissible variation (e.g., minimum and maximum sales amounts or coordination between record fields like billing address and currency of transaction).

Within the constraints, the reporting and product/service logic must handle the variation. This deviates from exploratory analytics that can, for speed or simplicity, use logic specific to the dataset being analyzed. For production reporting and products/services, the logic must be generalized. Common dataset variations that drive changes to the data wrangling logic include extensions to the value ranges (e.g., current dates or redefining regions or customer segments); new accuracy issues (e.g., previously unseen misspellings); record fields that have been removed or emptied (e.g., for legal compliance purposes, certain information about customers like age or gender might be redacted); appearance of duplicate records; or disappearance of a subset of records (e.g., due to a change in customer segment names, one or more groups might be dropped).

Of course, you could tighten the boundary of permissible variations to exclude things like duplicate records or missing subsets of records. If so, the logic to catch and remedy these variations will likely live in the data optimization action.

Data Wrangling within the Workflow Framework

In this chapter, we have described the characteristic actions and movement of data through the raw, refined, and production stages. Throughout that discussion, we touched on the kinds of data and outputs involved in each stage. But there’s still a piece missing from our discussion: what is data wrangling, and how does it relate to our workflow framework?

Fundamentally, data wrangling is the process involved in transforming or preparing data for analysis. If you refer back to our earlier workflow diagrams, data wrangling occurs between the stages; it is the set of actions that allows you to move from raw data to refined data, or from refined data to optimized, production data. Sometimes (particularly in the transition from the raw stage to the refined stage) data wrangling can resemble traditional ETL processes. We consider ETL to be one type of data wrangling, specifically a type of data wrangling managed and overseen by an organization’s shared services or IT organization. But data wrangling can also be handled by business users in desktop tools like Excel, or by data scientists in coding languages like Python or R.

We see data wrangling as a core task within every action in the framework. This is not to say, however, that data wrangling tasks will be identical throughout this workflow framework. These wrangling tasks will differ, particularly in the kinds of transformations that are applied. In Chapter 3, we dive into the specifics of data wrangling, describe the various types of data wrangling transformations, and explain how data wrangling tasks differ across our workflow framework.

¹ https://hbr.org/2013/12/how-google-sold-its-engineers-on-management

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. A Data Workflow Framework

Chapter 2. A Data Workflow Framework

How Data Flows During and Across Projects

Figure 2-1. Data value funnel

Connecting Analytic Actions to Data Movement: A Holistic Workflow Framework for Data Projects

Figure 2-2. A holistic workflow framework for data projects

Raw Data Stage Actions: Ingest Data and Create Metadata

Figure 2-3. Primary action and output actions in the raw data stage

Ingesting Known and Unknown Data

Creating Metadata

Structure

Granularity

Accuracy

Temporality

Scope

Refined Data Stage Actions: Create Canonical Data and Conduct Ad Hoc Analyses

Figure 2-4. Primary action and output actions for the refined data stage

Designing Refined Data

Addressing structural issues

Addressing granularity issues

Addressing accuracy issues

Addressing scope issues

Refined Stage Analytical Actions

Production Data Stage Actions: Create Production Data and Build Automated Systems

Figure 2-5. Primary action and output actions for the production data stage

Creating Optimized Data

Designing Regular Reports and Automated Products/Services

Data Wrangling within the Workflow Framework

Table of Contents for 2. A Data Workflow Framework

Create new playlist

Sign In

Sign Up

Chapter 2. A Data Workflow Framework

How Data Flows During and Across Projects

Figure 2-1. Data value funnel

Connecting Analytic Actions to Data Movement: A Holistic Workflow Framework for Data Projects

Figure 2-2. A holistic workflow framework for data projects

Raw Data Stage Actions: Ingest Data and Create Metadata

Figure 2-3. Primary action and output actions in the raw data stage

Ingesting Known and Unknown Data

Creating Metadata

Structure

Granularity

Accuracy

Temporality

Scope

Refined Data Stage Actions: Create Canonical Data and Conduct Ad Hoc Analyses

Figure 2-4. Primary action and output actions for the refined data stage

Designing Refined Data

Addressing structural issues

Addressing granularity issues

Addressing accuracy issues

Addressing scope issues

Refined Stage Analytical Actions

Production Data Stage Actions: Create Production Data and Build Automated Systems

Figure 2-5. Primary action and output actions for the production data stage

Creating Optimized Data

Designing Regular Reports and Automated Products/Services

Data Wrangling within the Workflow Framework

Table of Contents for
2. A Data Workflow Framework