Chapter 5. Transformation: Structuring

Overview of Structuring

You might remember our discussion of structure as a metadata element from Chapter 2. Structuring as a transformation action involves changing your dataset’s structure or granularity. In other words, structuring consists of any actions that change the form or schema of your data.

At a high level, there are two sets of structuring actions that you might need to apply to your datasets. The first group of structuring transformations involves manipulating individual records and fields. We call this intrarecord structuring. Intrarecord structuring transformations roughly fall into three buckets:

  • Reordering record fields (moving columns)

  • Creating new record fields through extracting values

  • Combining multiple record fields into a single record field

The second group of structuring transformations involves operating on multiple records and fields at once. We call this interrecord structuring. These types of transformations fit roughly into two types:

  • Filtering datasets by removing sets of records

  • Shifting the granularity of the dataset and the fields associated with records through aggregations and pivots

We will discuss each set of structuring transformations in turn so you can understand when you might want to apply these operations to your datasets.

Intrarecord Structuring: Extracting Values

Extraction involves creating a new record field from an existing record field. Frequently, this involves identifying a substring in an existing column and placing that substring into a new column.

Positional Extraction

The simplest form of substring extraction works by specifying the starting position and ending position that correspond to the substring that you want to extract from a set of record fields. This is called positional extraction.

When you’re working with data, extracting substrings based on a consistent position is common when dealing with date-time fields or fixed-width fields. Both of these field types have known elements located at specific positions, so there is generally little to no inconsistency in the structure of the field.

Note

To follow along yourself, download the “Contributions by Individuals” file from election year 2015-2016. You can find the file at http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2015_2016.

You also might need to reference the data dictionary for this file to understand the permissible values in each field. You can find the data dictionary at http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml.

Let’s look at an example of a field for which positional extraction might be a valuable structuring technique. In the Individual Contributions dataset, column 14 contains the transaction date for each campaign contribution. In this example, we want to extract the day of the month for each individual contribution into a new column. This will allow us to create a field that we can use to determine if individual campaign contributions were more frequent in certain times of the month.

The following table shows four example record fields from column 14:

column14
03102015
03302015
03302015
03022015

If you look at the source data, you can see that the discrete elements of the field always align: the month occurs first, the day of the month follows after exactly two characters, and the year follows after exactly four characters. Each individual record field is homogeneously structured. This means that positional extraction will allow us to easily identify the block of text that represents the month of the contribution.

To identify the starting and ending positions of the substring that represent the day of the month, we can map each individual character in the source record field to a position. You can see how this will work here:

0 3 1 0 2 0 1 5
Position 1 2 3 4 5 6 7 8

By counting the individual characters, including spaces, we can see that the day of the month starts at position 3 and ends at position 4.

In a sentence, we could describe our desired transformation by saying, “From source column 14, extract the characters located from position 3 to position 4.” If we were to perform this transformation on the dataset, we would produce the following output:

column14 Day of Month
03102015 10
03302015 30
03302015 30
03022015 02

A more complex version of positional substring extraction can pluck a substring from a record field when the starting and ending positions of the substring differ from record to record. Address fields are an excellent example of record fields for which complex positional extraction can be utilized effectively.

To perform this type of positional extraction, you will want to use functions that can search for a particular sequence of characters within the original record field. These functions return either the start position of the searched-for sequence or the length of the sequence. You can then pass the values returned by these functions through to one of the basic positional extraction functions. This will produce a complex nested function.

Looking again at the Individual Contributions dataset, you can see that column 8 contains the name of the person or organization who made each campaign contribution. For records that represent individual people, commas separate the person’s first name and last name. A sample of data from column 8 is shown here:

column8
ARNOLD, ROBERT
BICKLE, DON
ROSSMAN, RICHARD
LLEWELLYN, CHARLES

You can see that the last name element has an inconsistent length in each record. In the first record, the last name is 6 characters long (ARNOLD), whereas in the third record, the last name is 7 characters long (ROSSMAN). Simple positional extraction would not work in this case because the ending positions of the last name differ from record to record. However, because the first name and last names in each record are all separated by a common delimiter—the comma—we can use complex positional extraction functions to identify the position of the comma and then extract the appropriate substring.

Pattern Extraction

Pattern-based extraction is another common method that you can use to extract substrings into a new column. This method uses rules to describe the sequence of characters that you want to extract. To explain what we mean, let’s look at another sample of data from the Individual Contributions dataset. According to the FEC’s data dictionary, column 20 contains free text that describes each contribution. A sample of data from this column is below:

column20
P/R DEDUCTION ($296.67 MONTHLY)
P/R DEDUCTION ($326.67 MONTHLY)
* EARMARKED CONTRIBUTION: SEE BELOW
P/R DEDUCTION ($1000.00 MONTHLY)

In this case, we want to extract the monthly contribution amount into a new column. It’s fairly easy to describe the desired transformation in a sentence: “From column 20, extract the first sequence of digits, followed by a period, followed by another sequence of digits.” In this sentence, the pattern that defines the street name reads, “first sequence of digits, followed by a period, followed by another sequence of digits.” You can often use regular expressions to represent patterns in code. Regular expressions are also supported by most data wrangling software products.

As you can see, pattern-based extraction can be a useful method to identify substrings that conform to the same generic pattern but are not identical.

Complex Structure Extraction

Sometimes, when you are wrangling data, you might need to extract elements from within complex hierarchical structures. We commonly see this type of complex structure extraction required when wrangling JSON files or other semistructured formats. (You can refer to our discussion of structure in Chapter 2 for a refresher on the differences between structured and semistructured data.) JSON-formatted data often originates from automated systems; if you are working with machine-generated data, it’s likely that your data contains JSON structures.

Users who are wrangling JSON data generally encounter two types of complex structures: maps and arrays. These structures are common in semistructured data because they allow datasets to include a variable number of records and fields. They are described here:

JSON array

A JSON array represents an ordered sequence of values. JSON arrays are enclosed in square brackets. Elements in arrays are separated by commas and enclosed in double quotes.

Example array: [“Sally”,”Bob”,”Alon”,”Georgia”]

JSON map

A JSON map contains a set of key-value pairs. In each key-value pair, the key represents the name of a property and the value represents the value that describes that property. JSON maps are enclosed in curly brackets. Key-value pairs are separated by commas. Keys and values are both enclosed in double quotes.

Example map: {“product”:”Trifacta Wrangler”,”price”:”free”,”category”:”wrangling tool”}

In a given dataset, an array in one record might be a different length from an array in another record. You might see this in a dataset containing customer orders, where each record represents a unique customer’s shopping cart. In the first record an array of orders might include two elements, whereas in the next record, an array of orders might include three elements.

Similarly, maps also support variability across records. Looking at the shopping cart example, each cart might contain a variety of possible properties—say, “gift_wrapped”, “shipping_address”, “billing_address”, “billing_name”, and “shipping_name”. Ideally, every record will contain all of the possible properties. However, it’s more likely that some shopping carts only contain a subset of possible properties. Representing the properties and their associated values in a JSON map allows us to avoid creating a very sparsely populated table.

Of course, JSON format, although ideal for storing data efficiently, is often not structured ideally for use in analytics tools. These tools commonly expect tabular data as input. Consequently, when working with a JSON array or map, you might need to pluck a single element into a new column, or fold the multiple elements contained in an array down into multiple records. This will allow you to convert JSON-formatted data into the rectangular structure needed for downstream analytics.

Intrarecord Structuring: Combining Multiple Record Fields

Combining multiple fields is essentially the reverse of extraction. When you are wrangling data, you might need to create a single field that merges values from multiple related fields.

As an example, let’s return to the Individual Contributions dataset. This dataset contains two related columns: column 9 (city) and column 10 (state).

You can see that the city column adds additional detail to the state column. We want to combine the data from these two columns into a single column, and then separate the city and state with a comma. Combining the data from the these two fields can be useful if your downstream analysis wants to consider this data as part of a single record field.

Our desired output will look like the following column:

City State
MCPHERSON, KS
FREDERICK, MD
BROKEN BOW, NE
HAWTHORNE, CA

Interrecord Structuring: Filtering Records and Fields

Filtering involves removing records or fields from a dataset. Although filtering is often utilized in cleaning transformations designed to address dataset quality (which we discuss further in Chapter 7), you also can use it to alter the granularity of a dataset by changing the types of records and fields represented in a dataset.

For example, the Individual Contributions dataset contains a column that represents the type of entity that made each donation. Based on the FEC data dictionary, this field contains eight distinct values: CAN, CCM, COM, IND, ORG, PAC, and PTY. Based on this column, we could say that the granularity of the dataset is fairly coarse. After all, records can belong to one of eight distinct groups.

Let’s assume that we are interested in analyzing only campaign contributions that originated from individuals (represented in the entity column by “IND”). We will need to filter our dataset so that it includes records that contain only the value “IND” in column 7. Performing this operation will produce a dataset with a finer granularity because each record will now belong to only a single category of values from the entity type column. This type of filtering is called record-based filtering.

Another type of filtering that is commonly used as a structuring operation is field-based filtering. This type of filtering affects the number of fields, or columns, in your dataset.

Interrecord Structuring: Aggregations and Pivots

Aggregations and pivots are structuring operations that enable a shift in the granularity of a dataset. For example, you might start with a dataset of sales transactions and want total sales amounts by week or by store or by region—a fairly straightforward aggregation involving the summation of record fields. A more complex pivot might involve extracting the items purchased out of the transaction records and building a dataset in which each record corresponds to an item.

For example, consider a dataset composed of individual sales transactions, where each transaction record contains a field listing the products that were purchased. You can pivot this dataset such that each product becomes a record with fields describing the product and an aggregated count field indicating the number of transactions involving this product. Alternatively, you could pivot the same dataset to count the number of transactions per product where the product was purchased alone, with one additional product, with two additional products, and so on.

To coordinate our discussion in this section, we can organize aggregations and pivots into three progressively more complex groups. These groups can be characterized by the relationship between records and record fields between the input dataset (prior to applying the transformation) and the output dataset.

Simple Aggregations

In the first group, simple aggregations, each input record maps to one and only one output record, whereas each output record combines one or more input records. For simple aggregations, the output record fields are simple aggregations (sum, mean, min, list concatenation, etc.) of the input record fields.

We can perform a basic aggregation on the Individual Contributions dataset. Perhaps we want to manipulate the granularity of the dataset so that each row summarizes the campaign contributions made to each campaign committee. We are interested in creating three new columns:

  • One column that contains the average contribution made to each campaign committee
  • One column that contains the sum of contributions made to each campaign committee
  • One column that counts the number of contributions made to each campaign committee

In this example, we will be performing this basic aggregation on the following limited sample of data from the Individual Contributions dataset:

Column 1 Column 15
C00004606 750
C00004606 1000
C00452383 225
C00452383 50

Remember, based on the FEC’s data dictionary, column 1 contains the campaign committee and column 15 contains each contribution amount. After aggregating the values in column 15 by campaign committee, we end up with the following output:

Column 1 Sum of Column 15 Mean of Column 15 Count of Column 15
C00004606 1750 875 2
C00452383 275 137.50 2

Column-to-Row Pivots

In the second group, column-to-row pivots, each input record maps to multiple output records, and each output record maps to one and only one input record. Input record field values become the defining characteristics of the output records. In other words, the output records contain a subset of the input record fields.

This type of column-to-row pivot is commonly referred to as “unpivoting” or “denormalizing” data. It is particularly useful when your source data contains multiple columns that represent the same type of data. For example, you may have a transactions file that contains the total sales numbers per region, per year. The data could be formatted as shown in the following table:

Region 2015 2016
East 2300 2453
West 9866 8822
Midwest 2541 2575

Note that in this example, the sales figures for each year are contained in a different column. We want to restructure this dataset so that a single row contains the sales for a single unique combination of region and year. The result of this column-to-row pivot will look like the following table:

Region Year Sales
East 2015 2300
East 2016 2453
West 2015 9866
West 2016 8822
Midwest 2015 2541
Midwest 2015 2575

Generally, column-to-row pivots are used when you want to create a dataset that allows you to more easily summarize values. Compared to the original sales dataset in our example, the resulting dataset is structured to facilitate calculations across years and regions.

Row-to-Column Pivots

In the final group, output records sourced from multiple input records and input records might support multiple output records. Output record fields might involve simple aggregations (e.g., sum or max) or involve more complex expansions based on the field values. This type of pivot is called a row-to-column pivot.

As an example, let’s return to the Individual Contributions dataset. We want to create a refined dataset that shows the sum of contributions made to each campaign committee, broken out by contribution type. In this case, we want to create one new column for each contribution type.

A subset of the Individual Contributions dataset contains the following data:

Column 1 Column 7 Column 15
C00004606 IND 750
C00004606 IND 1000
C00492116 PAC 45000
C00492116 PAC 15000
C00492116 IND 250
C00452383 750 50

Based on the FEC’s data dictionary, column 1 represents the campaign committee, column 7 represents the contribution type (individual, political action committee, corporate, and so on), and column 15 represents the contribution amount. After performing a row-to-column pivot, our subset of the Individual Contributions dataset will look like the following table:

Column 1 Sum of IND Contributions Sum of PAC Contributions
C00004606 1750 0
C00492116 60000 1000

As a result of the row-to-column pivot, we have created a new column for each of the unique values in source column 7. Each row in this new dataset summarizes contributions made to a single campaign committee.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset