Chapter 6. Transformation: Enriching

Enriching transformation actions result in the net addition of information to your dataset. When enriching your dataset, you insert additional records or fields from other related datasets, or you use formulas to calculate new fields.

You might wonder how enriching transformations differ from structuring transformations (discussed in Chapter 5). Although both types of transformations can involve creating new fields or records, structuring transformations create new fields or records based on data already present in the dataset. Enriching transformations, in contrast, create new fields or records using new data—information that was not previously present in the dataset in any form.

There are three primary types of enriching transformations:

  • Unions

  • Joins

  • Deriving new fields

We discuss each type of enriching transformation in this chapter.

Unions

Unions append additional records to an existing dataset. In other words, when you perform a union, you are taking two related datasets and stacking them vertically to create a single dataset.

Why might you want to perform a union? Let’s imagine that you work for an organization that receives monthly orders from your clients. At the end of each quarter, you need to produce a summary analysis that records the total number of orders placed by each client over the previous three months. Because each month’s orders are contained in a separate dataset, you need to combine them into a single dataset so that you can perform your analysis.

The simple case is when records from both the old and new datasets have matching structures (i.e., matching layout of fields for each record). In this case, a union amounts to little more than a concatenation of datasets.

The more complex—and common—case is when the records from the current dataset have a different structure from the records in the new dataset. Generally, the difference is minor: most of the fields are in the same layout and there are only a few fields that differ between the old and new data records. In this case, some simple logic can dictate whether the unmatched fields should be coerced into the same field or kept in separate fields with null or empty values inserted into the records from the other dataset.

Joins

Joins are the most common enrichment action. The most common form of a join involves linking records from one dataset to records from the other dataset via exact matches of a single field in each of the dataset records. The field used in the match is often referred to as the key field in both datasets. For example, consider a dataset containing customer profile information with a key field corresponding to a customer_id. In a second dataset, suppose that we have customer transaction information along with a customer_id key field. By joining records on customer_id matches, we can now analyze how transaction activity relates to profile information.

In most tools, the link between two datasets is based on exactly matching one or more fields between the records. Some tools additionally support fuzzy matching of fields. This allows records with misspellings or other minor differences to be linked.

There are four types of joining/blending logic. They differ in how they handle the variety of situations that arise when matching records for the two datasets. Obviously, when records match from the joining datasets, we simply append record fields (or a subset of records fields). It is common to think about the case in which every record from each dataset has one and only one matching record in the other dataset. What happens when a record from one dataset matches multiple records from the other dataset? Or when records from one dataset have no matching records from the other dataset?

Let’s take a closer look at the four types of joins:

Inner
Inner joins only produce a record when there are matching records from each dataset being blended. Note that if there are duplicate keys, the output records are similarly duplicated.
Left outer
This type of join retains all records from the left (or initial) dataset, even if there is no matching record in the right (or incoming) dataset.
Right outer
These joins retain all records from the right (or incoming) dataset, even if there is no matching record in the left (or initial) dataset.
Full outer
Full outer joins retain all records from both datasets, even if they have a corresponding match.

Inserting Metadata

An important form of enrichment involves adding metadata into the dataset. Common metadata to add include the filenames of the source data, byte offsets and/or record numbers, current date and/or time, creation/update/access timestamps, and record and/or record field lineage.

Derivation of Values

A final kind of enrichment involves the derivation of new values. There are two basic kinds of derivation: generic and proprietary.

Generic

Generic derivations apply to many datasets. For example, most datasets need to explicitly address time—either by encoding temporal information within records or across records as metadata. A common data wrangling action involves deriving additional date-time information; for example, day of the week or season. Or, when data is collected across multiple time zones, it can be useful to derive both a local and global (e.g., UTC) timestamp for each event. These derived values can be crucial to an analysis that deals with weekly or yearly cycles, experiential time, and absolute sequencing of events (e.g., for anomaly detection).

Another common domain for many analyses involves geography or spatial encodings. Sometimes, it is as simple as abstracting an address into a ZIP code or city. Or, utilizing many available services, you could also convert an address to latitude and longitude coordinates. Slightly more involved derivations might convert an address to a more customized region relevant for marketing or sales activity. These regions might be drawn, for example, by calculating the shortest driving distance between a customer and all available nearby stores. None of the three data wrangling tools support geographic and spatial encodings; you would need to create custom code to produce these features.

A third generic domain for derivation functionality concerns text. Generally referred to as Natural Language Processing, or NLP, functionality in this domain could take raw text and derive some kind of overall sentiment analysis; extract out references to people, events, or things; or identify topics that capture the text’s semantics. You might also want to translate text from one language to another. When it comes to NLP, a key concern is what defines your baseline, or reference language; for example, are you working with only English text, or are you working with terminology specific to your industry? Choosing an appropriate reference language will determine how you treat jargon or slang, for example.

Finally, a fourth generic domain for derivation functionality concerns basic numeric calculations. Some of these calculations are simple; for example, summing the list price and taxes of a transaction to produce a final price. Others are slightly more sophisticated, utilizing aggregation or sequential processing mechanisms. An example would be z-normalizing a numeric field by subtracting the mean value of the field and then dividing by the standard deviation of the field.

Some of the generic derivation functionality is domain specific. For example, you might have different region definitions if you are a mining company versus a consumer packaged goods retailer. Similarly, if you are performing NLP derivations, you can target jargon or terminology specific to your industry (like silicon manufacturing, pharmaceuticals, fashion, etc.).

Legal requirements around report compliance and privacy drive additional domain-specific derivations. For example, in the healthcare domain, laws pertaining to patient privacy and confidentiality require certain record fields to be removed and replaced with derived ones. Also, given a history of treatments, one might derive record fields indicating whether certain future treatments are viable or not. Similarly, in finance, there are domain-specific derivations related to the time series of transaction data that drives that industry. Specific models, such as Black-Scholes, can be applied to historical time series of trading data to predict future value. These predictions can be analyzed along with other predictions to identify the optimal model to use or to build a composite prediction.

Proprietary

Further along the generic-to-specific spectrum are proprietary derivations. In this case, individual organizations might have custom models they use to make predictions, for example. Or, they might have highly customized derivation calculations that best capture, for example, the health of their customer base or the likelihood of a customer leaving or upgrading.

In many big-data data wrangling systems, proprietary functionality is often encoded as User-Defined Functions (UDFs). Moreover, these UDFs are often supported in a variety of computational languages: R, Python, VisualBasic, Java, and so on. Whereas many UDFs contain nontrivial calculations, others make calls to services (like Google’s geocoding API, which converts addresses to latitude and longitude coordinates).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset