Chapter 3. The Dynamics of Data Wrangling

In Chapter 2, we introduced a framework capturing the variety of actions involved in working with data. Each of these actions involves some amount of data wrangling. In this chapter, we describe the dynamics of data wrangling, the breadth of transformations and profiling required to wrangle data, and how these aspects of data wrangling vary by action in our framework.

Data Wrangling Dynamics

Data wrangling is a generic phrase capturing the range of tasks involved in preparing your data for analysis. Data wrangling begins with accessing your data. Sometimes, access is gated on getting appropriate permission and making the corresponding changes in your data infrastructure. Access also involves manipulating the locations and relationships between datasets. This kind of data wrangling involves everything from moving datasets around a folder hierarchy, to replicating datasets across warehouses for easier access, to analyzing differences between similar datasets and assessing overlaps and conflicts.

After you have successfully accessed your data, the bulk of your data wrangling work will involve transforming the data itself—manipulating the structure, granularity, accuracy, temporality, and scope of your data to better align with your analysis goals. All of these transformations are best performed with tools that provide meaningful feedback (so that the manipulator is assured that the manipulations were successful). We refer to this feedback as profiling. In many cases, a predefined (and, hence, somewhat generic) set of profiling feedback is sufficient to determine whether an applied transformation was successful. In other cases, customized profiling is required to make this determination. In either event, the bulk of data wrangling involves frequent iterations between transforming and profiling your data.

A final set of data wrangling tasks can be understood as publishing. Publishing is best understood from the perspective of what is published. In some cases, what is published is a transformed version of the input datasets (e.g., in the design and creation of “refined” datasets). In other cases, the published entity is the transformation logic itself (e.g., as a script that generates the range of statistics and insights in a regular report). A final kind of publishing involves creating profiling metadata about the dataset. These profiling reports are critical for managing automated data services and products.

Figure 3-1 illustrates the simple relationship between these data wrangling steps. As just mentioned, wrangling begins with access. From there, the bulk of time and energy is spent transforming and profiling the results of the transformation. Finally, the desired output is published for downstream consumption. Realistically, data wrangling is far more iterative. In addition to iterations between transforming and profiling the data, there are less frequent iterations that return to accessing data. Likewise, during or soon after publishing a result, you might realize that the output is not exactly correct and you need to apply additional transformations or expose some additional profiling results. These iterations are captured in Figure 3-1.

Figure 3-1. A simple diagram illustrating the basic steps of data wrangling

The work begins by obtaining access to your data. You might already have the dataset in a file and need to do little more than double-click to open it. In other cases, you might need to submit a request for access and obtain the necessary credentials. With the data in hand, the bulk of data wrangling involves iterating between applying transformations and assessing the impact of those transformations through profiling. After you have modified the dataset as desired, or authored a robust data transformation script, or produced the profiling statistics and visualizations that showcase aspects of the dataset, any or all of these outputs must be published.

Additional Aspects: Subsetting and Sampling

There are two additional aspects to the dynamics of data wrangling that we believe are vital to finding efficiencies in your data wrangling practice: subsetting data and sampling data. Both are applicable in certain circumstances, versus the general steps and dynamics we discussed earlier, which are broadly applicable.

First, consider the case in which your dataset contains a heterogeneous set of records, differing either in structure (e.g., some records contain more or different fields from the rest) or in granularity (e.g., some records correspond to customers, whereas others correspond to accounts). Faced with this heterogeneity, the best wrangling approach is to split up the original dataset and wrangle each subset separately; then, if necessary, merge the results together again. Process efficiencies are fundamentally rooted in the ability to apply the same processing mechanism (in this case, data transformation logic) to many, similar inputs. At one scale, this amounts to authoring data transformation steps that operate effectively across every record in your dataset. At a wider scale, this amounts to authoring a data transformation script that operates effectively across multiple, similarly structured datasets. This case requires subsetting data for the most efficient data wrangling process.

Now consider the case in which your dataset is too large to manually review each record or when your dataset is so large that even simple transformations require prohibitively long timeframes to complete (or speeding up these transformations would have prohibitively large resource costs); in other words, when you are working with big data. In this case, the iterative process of transforming and profiling your data is materially hampered by the time required to compute and execute transformations.

Suppose that you make a small change to a derived calculation or you change a rule to group a few customer segments into a wider segment. Now you apply this transformation and must wait a minute, or 10 minutes, or half a day to see what the results might look like. Understandably, data wrangling work will dominate your analysis workflows and you won’t get through many analyses. The critical approach to speeding up your data wrangling is to work with some samples of the entire dataset that you can transform and profile at interactive time scales (ideally within 100 milliseconds, but occasionally up to a few seconds). Unfortunately, working with samples to speed up your data wrangling is not as straightforward as it sounds.

The complexities of data sampling relative to analysis are well discussed in statistics and surveying texts like Leslie Kish’s Survey Sampling. Sufficient for our discussion here is to point out the obvious connection between sampling and profiling. Again, our objective is to speed up the task of data wrangling by providing profiling feedback from a sample of the dataset (which can be processed at interactive speeds) versus the entire dataset.

To understand the importance of sampling, consider a simple transformation involving the calculation to determine the length of time each of your customers has been using your product or service based on the date each one registered for it. Chances are, you have some sense of what these ages should be: you started your business 11 years ago, so no customer should show a duration of more than 11 years. You had a big increase in customers about 3 years ago, so you’d expect to see a corresponding bump in the overall age distribution around 3. And so on. These expectations point to a couple of sampling techniques that would be useful for assessing whether your transformation to calculate customer age is working correctly. Namely, you want a sample that contains extreme values (customer records with the earliest and latest registration dates) and that randomly samples over the rest of the records (so that overall distributional trends are visible).

Consider a more complex situation that involves case-based transformations based on record groups; for example, you need to convert transaction amounts to US dollars, and your dataset contains transactions in Euros, GB Pounds, and so on. Each reporting currency requires its own transformation. To assess that all of the currency transformations were applied correctly, you need to profile results covering all of the currencies occurring in your dataset. Samples that cover all groups, or “strata,” are often referred to as stratified samples; they provide representation of each group, even though they might bias the overall trends of the full dataset (by overrepresenting small groups relative to large ones, for example). There are numerous techniques for extracting different kinds of samples from large datasets (e.g., see Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Cormode et al.), and some software packages and databases implement these methods for you.

With an understanding of the basic steps in data wrangling—access, transformation, profiling, and publishing—and how these steps can incorporate aspects of sampling to handle big datasets and split/fix/merge strategies for heterogeneous datasets, we turn our attention now to the core types of transformations and profiling.

Core Transformation and Profiling Actions

The core tasks of data wrangling are transformation and profiling, and the general workflow involves quick iterations (on the order of seconds) between these tasks. Our intent in this section is to provide a basic description of the various types of transformation and profiling. In later chapters, we will dive into explicit examples of transformation and profiling.

Let’s begin our discussion by exploring the transformation tasks involved in data wrangling. Table 3-1 describes the core types of data wrangling transformations that you might need to apply to your data.

Table 3-1. Core data wrangling types
Core transformation type Description
Structuring Actions that change the form or schema of your data
Enriching Actions that add new values to your dataset
Cleansing Actions that fix irregularities in your dataset

The core types of transformation are structuring, enriching, and cleansing. Structuring primarily involves moving record field values around, and in some cases summarizing those values. Structuring might be as simple as changing the order of fields within a record. More complex transformations that restructure each record independently include breaking record fields into smaller components or combining fields into complex structures. At the interrecord level, some structuring transformations remove subsets of records. Finally, the most complex interrecord structuring transformations involve aggregations and pivots of the data. Aggregations enable a shift in the granularity of the dataset (e.g., moving from individual customers to segments of customers, or from individual sales transactions to monthly or quarterly net revenue calculations). Pivoting involves shifting records into fields or shifting fields into records.

Whereas structuring transformations move or aggregate existing values from a single dataset, enriching transformations add fundamentally new values from multiple datasets. The quintessential structuring transformations are joins and unions. Joins combine datasets by linking records. Unions blend multiple datasets together by matching up records from two different datasets and concatenating them “horizontally” into a wider table that includes attributes from both sides of the match.

Beyond joins and unions, another common class of enriching transformations inserts metadata into your dataset. The inserted metadata might be dataset independent (e.g., the current time or the username of the person transforming the data) or specific to the dataset (e.g., filenames or locations of each record within the dataset). Yet another class of enriching transformations involves the computation of new data values from the existing data. In broad strokes, these kinds of transformations either derive generic metadata (e.g., time conversion or geo-based calculations like latitude-longitude coordinates from a street address or a sentiment score inferred from a customer support chat log) or custom metadata (e.g., mineral deposit volumes inferred from rock samples or health outcomes inferred from treatment records). Chapter 6 discusses specific examples of enriching transformations.

The third type of transformation cleans a dataset by fixing quality and consistency issues. Cleaning predominately involves manipulating individual field values within records. The most common variant fixes missing (or NULL) values. We explore specific examples of cleansing transformations in Chapter 7.

Switching gears, the core types of profiling are distinguishable by the unit of data they operate on: individual values or sets of values. Table 3-2 provides a description of the two types of core profiling.

Table 3-2. Two types of core profiling
Core profiling type Description
Individual values profiling Understanding the validity of individual record fields
Set-based profiling Understanding the distribution of values for a given field across multiple records

Profiling on individual field values involves two kinds of constraints: syntactic and semantic. Syntactic constraints focus on formatting; for example, a date value should be in MM-DD-YYYY format. Semantic constraints are rooted in context or proprietary business logic; for example, your company is closed for business on New Year’s Day so no transactions should exist on January 1 of any year. Ultimately, this type of profiling boils down to determining the validity of individual record field values, or, by extension, entire records.

Set-based profiling focuses on the shape and extent of the distribution of values found within a single record field, or in the range of relationships between multiple record fields. For example, you might expect retail sales to be higher in holiday months than in nonholiday months; thus, you could construct a set-based profile to confirm that sales are distributed across months as expected. We will explore specific examples of set-based profiling and individual profiling in Chapter 4.

So far, we’ve provided an overview of the basic types of transformations and profiling. Soon after doing the work, however, your focus will likely shift to include the second-order concern of doing the work well. In other words, in addition to putting a script of transformation logic together and profiling the steps of that script as you go to make sure they operate correctly on the initial data source(s), you want to optimize that script to run efficiently and robustly. Furthermore, over time, as new data mandates edits to the transformation script or you find more optimal ways to author portions of the transformation script, you’ll likely want to track changes to the script and, possibly, manage multiple versions of it for legacy and capability purposes. Supporting these changes might require some additional, customized profiling information, tracking statistics across variations of the dataset.

Data Wrangling in the Workflow Framework

Data wrangling can be a major aspect of every action in your workflow framework (refer to Figure 2-2). In this section, we discuss each action in turn, describing how data wrangling commonly fits into the action.

Ingesting Data

As we discussed in Chapter 2, ingesting data into the raw data stage can involve some amount of data wrangling. Loading the data into the raw data stage location might require some nontrivial transformation of the data to ensure that it conforms to basic structural requirements (e.g., records and field values encoded in a particular formats). The extent of the constraints to load the data will vary by the kind of infrastructure of your raw data stage. Older data warehouses will likely require particular file formats and value encodings, whereas more modern infrastructures like MongoDB or HDFS will permit a wider variety of structures on the ingested data (involving less data wrangling at this stage).

In either event, the explicit goal when loading raw data is to perform the minimal amount of transformations to the data to make it available for metadata analysis and eventual refinement. The general objectives are “don’t lose any data” and “fixing quality problems comes next.” Satisfying these objectives will require limited structuring transformations and enough profiling to ensure that data was not lost or corrupted in the ingestion process.

Describing Data

Assessing the structure, granularity, accuracy, temporality, and scope of your data is a profiling-heavy activity. The range of profiling views of your data required to build a broad understanding of your data will also require an exploratory range of transformations. Most of the exploratory transformations will involve structuring: breaking out subcomponents of existing values to assess their quality and consistency, filtering the dataset down to subsets of records to assess scope and accuracy, aggregating and pivoting the data to triangulate values against other internal and external references, and so on.

Assessing Data Utility

Assessing the custom metadata of a dataset primarily involves enriching and cleaning transformations. In particular, if the dataset is a new installment to prior datasets, you will need to assess the ability to union the data. Additionally, you will likely want to join the new dataset to existing ones. Attempting this join will reveal issues with linking records between the datasets: perhaps too few links are found, or, equally problematic, there are too many duplicative links. In either case, by treating your existing data as a baseline standard to which the new dataset must adhere or align, you will likely spend a good amount of time cleaning and altering values in the new dataset to tune its overlap with existing data. As the new data is blended in with the old, set-based profiling will provide the basic feedback on the quality of the blend.

Designing and Building Refined Data

Building refined datasets that are broadly useful across a broad range of ad hoc analyses and deeper modeling and forecasting explorations requires the breadth of transformation and profiling types. Structuring the data to align with the granularity and scope of intended analyses will save time. For example, if most of your planned analyses are at a granularity (e.g., weekly total sales) that differs from the granularity of the raw data (e.g., individual sales transactions), it likely makes sense to apply aggregation or pivot transformations. In terms of enrichment, if many of your analyses involve multiple data sources, it makes sense to build blended datasets using enrichment transformations like joins and unions. Similarly, depending on how frequently your analyses require nontrivial derivations like smoothed time-series values or sentiment scores, it might make sense to build these into the refined datasets (as opposed to requiring each analysis to rebuild these enrichments as needed). Finally, cleaning the data is key to building broadly usable datasets. You need to flag inaccurate or inconsistent values as such (at a minimum) or replace them with more accurate/consistent versions. Likewise, many analyses will require missing data to be filled in with reasonable estimated values.

In terms of profiling, all types are required when building refined datasets. To ensure the quality and consistency of individual record values, profiling at this level should be aggressively applied across all record fields. Initially, for many fields, individual value profiling will enforce little more than syntactic constraints because semantic constraints on specific values will be unknown. As more and more ad hoc analyses are completed, and a better understanding of the value of the underlying data emerges, additional semantic profiling checks might be added. Similarly, as a deeper understanding of the dataset emerges through its use in many analyses, the richness of set-based profiling will increase. Initially, set-based constraints might enforce simple range or loose distribution checks. Over time, expected correlations between fields and trends in the changes of the distribution of field values (e.g., steady increasing of median sales prices) might be profiled and enforced.

It is difficult to overemphasize the criticality of building good refined datasets. From a governance perspective, these datasets will be the source of most of the insights that deliver value to you and your organization. The validity and consistency of these insights will depend on the quality of your refined datasets. In no small way, the trust an organization will have in the use of data to drive its decisions and operations will depend on the quality of these refined datasets.

Ad Hoc Reporting

Starting from refined datasets, reporting primarily involves structuring (or restructuring) input data. Perhaps you are exploring the impact of changing region boundaries and want to look at historical data to see if the new regions are more balanced in terms of traffic, sales, costs, and so on. Many of the necessary data transformations will involve pulling subcomponents out of record fields (for finer-grained analysis), filtering out nonrelevant records, and aggregating or pivoting metrics around subgroups of records. Additionally, depending on how your refined datasets are designed, ad hoc reporting might involve enrichment transformations, as well. Relative to the raw data ingested in the prior stage, much of the enrichment focused on joins and unions will already be done (though certainly not all). Beyond blending, data wrangling for ad hoc reporting might also require the derivation of new data values, perhaps involving the insertion of metadata like the current date, or calculating complex values like arbitrary percentiles or winsorized averages.

Exploratory Modeling and Forecasting

Like ad hoc reporting that builds from refined datasets, exploratory modeling and forecasting will employ a significant amount of structuring transformations. In addition to filtering, aggregating, and pivoting of records, it is common to pivot categorical record fields into separate indicator fields. This enables modeling techniques like regression. Furthermore, if your modeling analysis is focused on assessing the relative importance of various record fields, the fields might need to be normalized so that their corresponding model weights can be readily compared.

One of the benefits of exploring your data by using modeling and forecasting techniques is that these methods have the side effect of indicating when certain data points (record field values, or, by extension, entire records) are outliers; that is, they appear anomalous relative to the majority of the data. In some cases, these outlier data points might contain inaccurate values, instigating some additional data wrangling efforts focused on cleaning the data. In other cases, the outliers might represent valid data points that necessitate a change in how you understand the data or the processes that produce the data (e.g., most customers might spend only a few dollars per transaction, so if someone spends a few thousand dollars in a single transaction, what does it mean?).

Beyond building a single model or forecast, it is useful in these exploratory efforts to assess the robustness of the model or forecast—robustness relative to changes in the values of the input data (record field values), robustness relative to missing or deleted records, and so on. These kinds of robustness analyses involve transforming the original data (using structuring and cleaning type transformations) and then passing the modified data through the modeling and forecasting engine.

Building an Optimized Dataset

Similar to designing and building refined datasets, designing and building an optimized dataset involves the breadth of transformation types—structuring, enriching, and cleaning—and the breadth of profiling types—assessing both individual values and sets of values. The primary difference is in the balance of transformation to profiling. Whereas building refined datasets requires a fairly even mix of transformation and profiling, building optimized datasets requires significantly more profiling. It isn’t enough to ensure that the transformation logic is correct for the specific dataset you just created. The transformation scripts that generate optimized datasets need to be automatically applied to regularly updating input data. Hence, you must use profiling to forecast the robustness of the transformation scripts to future variants of the data as well as to track the correctness of the optimized data each time the scripts are applied. Many of these profiling tasks will involve checking the distributions of values for various subsets of records—assessing both the range of values in the subsets as well as the shape of the distribution of values.

Regular Reporting and Building Data-Driven Products and Services

Analogous to the similarities and differences between building refined versus optimized data, regular reporting (relative to ad hoc reporting) and data-driven product and services (relative to exploratory modeling and forecasting) require similar transformations but more profiling. The driver for the additional profiling is the requirement that the same transformation scripts should work efficiently and robustly across evolving input data.

Figure 3-2 summarizes this section, highlighting the general amounts of different types of transformation and profiling across the actions in our workflow framework. In the figure, the bolded fonts represent the most frequently used transformation and profiling tasks for each action.

In the next few chapters, we discuss how you can apply the specific types of transformation and profiling actions to a sample data project.

Figure 3-2. The relative amounts each type of transformation and profiling across the actions in our workflow framework
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset