Chapter 8. Roles and Responsibilities

There are a number of common job titles and roles for people who work with data. In our experience, the most common are data engineer, data architect, data scientist, and analyst. In this chapter, we provide a general overview of each of these roles, which actions in the workflow they tend to be responsible for, and some best practices for managing the hand-off between these roles and for driving the long-term success of your data practices.

Skills and Responsibilities

We’ll provide a basic overview of the four common job roles that we encounter when working with data. Of course, in smaller organizations or in personal projects, a single person can end up developing and applying all of the skills and responsibilities that we’ll discuss. However, it’s more common to split them into separate job roles.

Our discussion is oriented on two axes (see Figure 8-1). The first axis is focused on the primary kind of output produced by someone in the role. The second axis is focused on the skills and methods utilized to produce that output. We’ll discuss each role in turn.

Image
Figure 8-1. Relative positions of the four key data wrangling user profiles based on output (internal or external) and skills (technically focused or business focused)

Data Engineer

Data engineers are responsible for the creation and upkeep of the systems that store, process, and move data, as shown in Figure 8-2. In addition to instantiating and maintaining these systems, many data engineers focus on the efficiency and extensibility of these systems, ensuring that they have sufficient capacity for existing and exploratory or future workloads.

Image
Figure 8-2. Blue highlights identify the primary actions for a data engineer in our data wrangling workflow framework

The ability to create and maintain these systems requires some background in system administration. More importantly, as these systems are primarily designed to work with data, data engineers require fairly deep background in common data processing algorithms and implementation of these algorithms across various systems and tools.

Data Architect

As Figure 8-3 illustrates, data architects are responsible for the data in the “refined” and “production” stage locations (occasionally they are responsible for the “raw” data, as well). Their objective is to make this data accessible and broadly usable for a wide range of analyses. In addition to staging the data itself, data architects often create catalogs for this data to improve its discoverability and usability. Further optimizations to the data and the catalog involve the creation of naming conventions and standard documentation practices, and then applying and enforcing these practices.

Image
Figure 8-3. Red highlights identify the primary actions for a data architect in our data wrangling workflow framework

In terms of skills and methods, data architects often work through a user-requirements-gathering process that moves from documented data needs and requests, to an abstract model of where to source the data and how to organize it for broad usability, to the concrete staging of the data by designing data schemas and alignment conventions between the schemas. Designing the structure, ensuring the quality, and cataloging the relationships between these dataset builds on fluency in the data access and manipulation languages of a broad set of data tools and warehouses (e.g., variants of SQL, modern tools like Sqoop and Kafka, and more analytics systems like those built by SAP and IBM). Additionally, data architects often employ standard database conventions like First, Second, and Third Normal Forms.

Data Scientist

Data scientists are responsible for finding and verifying deep or complex sets of insights (Figure 8-4). These insights might derive from existing data using advanced statistical analyses or from the application of machine learning algorithms. In other cases, data scientists are responsible for conducting experiments, like modern A/B tests. In some organizations, data scientists are also responsible for “productionalizing” these insights.

Image
Figure 8-4. Orange highlights identify the primary actions for a data scientist in our data wrangling workflow framework

There are two primary types of data scientists: those who are more statistics focused and those who are more engineering focused. Both are generally tasked with finding deep and complex insights. Where they differ is in the last couple of just-listed responsibilities. In particular, more statistics-focused data scientists will often focus on A/B testing, whereas those who are more engineering-focused will often concentrate on prototyping and building data-driven services and products.

The ability to identify and validate deep or complex insights requires some familiarity with the mathematics and statistics algorithms that can reveal and test these insights. Additionally, data scientists require the skills to operate the tools that can apply these algorithms, such as R, SAS, Python, SPSS, and so on. For statistics-focused data scientists, background in the theory and practice of setting up and analyzing experiments is a common skill. For engineering-focused data scientists, skills around software engineering are required, not just familiarity with a variety of programming languages, but also with best practices around building complex applications.

Analyst

Analysts are responsible for finding and delivering actionable insight from data, as depicted in Figure 8-5. Whereas data scientists are often tasked with exploratory, open-ended analyses, analysts are responsible for providing a business or organization with critical information. In some cases, these take the form of top-line metrics, KPIs, to drive or orient the organization. Sometimes these metrics are delivered in reports (both regular and ad hoc); other times they are delivered as general talking points to help justify a decision or course of action. The line between an analyst and a data scientist can be blurry. In many situations, analysts will pursue deeper analysis. For example, in addition to identifying correlated indicators of KPI trends, they might perform causal analysis on these indicators to better understand the underlying dynamics of the system.

Image
Figure 8-5. Green highlights identify the primary actions for a data analyst in our data wrangling workflow framework

Along those lines, in addition to good mathematics and statistics backgrounds, analysts are often steeped in domain expertise. For most organizations, this amounts to a deep understanding of the business and marketplace. More generically, analysts are strong systems thinkers; they are able to connect insights that might be corelevant and then propose ways to measure the extent of their relationships.

Roles Across the Data Workflow Framework

The workflow framework we described in Chapter 2 is comprised of the following actions:

  1. Ingesting data

  2. Describing data

  3. Assessing data utility

  4. Designing and building refined data

  5. Ad hoc reporting

  6. Exploratory modeling and forecasting

  7. Designing and building optimized data

  8. Regular reporting

  9. Building products and services

Data engineers, with their focus on data systems, generally drive the data ingestion and data description in the raw data stage. Analysts, who possess the requisite business and organizational knowledge, are often responsible for generating proprietary metadata. In some organizations, for which the data is particularly complex or messy, the generation of metadata might also be the responsibility of data scientists.

Moving to the refined data stage, data architects are often responsible for design and building the refined datasets. Data engineers might be involved if the data storage and process infrastructure requires modification and monitoring to produce the refined datasets. With the refined datasets in hand, analysts are typically responsible for ad hoc reporting, whereas data scientists focus on exploratory modeling and forecasting.

The production data stage parallels the refined data stage. Data architects and data engineers are responsible for designing and building the optimized datasets. Analysts, with the help of the data engineers, drive the reporting efforts. Data scientists, also with the help of data engineers, work to deliver the data for products and services.

Though we have been selective in our associations between job roles and actions, the reality of most organizations is that people help out wherever they can. Although data engineers typically have the deepest data systems knowledge, data architects the best data cataloging and design knowledge, analysts the most comprehensive domain understanding, and data scientists the deepest statistics and machine learning background, these skills are exclusive and many data projects can be sufficiently progressed with cursory knowledge of some of these areas.

Organizational Best Practices

In the remainder of this chapter, we discuss some best practices for coordinating the efforts across these job roles. These best practices come from a combination of our own efforts to wrangle data efficiently and from our observations of how high-functioning organizations manage their data projects.

Perhaps the most important best practice is providing wide access to your data. Of course, we are not suggesting that you provide broad access to overwrite capabilities. Rather, within of legal boundaries, everyone in your organization should have the ability to analyze the data you have. Per our discussion of driving broad data-driven value creation, your organization will benefit from opening-up access and allowing as many people as possible to find valuable insights. This initiative is more popularly referred to as data democratization or self-service data analytics. Some argue against wider access to data on the grounds that the infrastructure to support it is costly (we have seen that the generated value more than compensates for the additional costs) and that people will often find conflicting insights (which can slow down the organization while you sort them out). For this last concern, we offer two suggestions. First, build robust refined datasets and drive the majority of your analytics efforts to source from them. This will mitigate superficial conflicts. Second, embrace the remaining conflicts and build practices that address them directly. With superficial conflicts minimized, the conflicts that remain should largely represent different views on how to measure or interpret the data. It is to the benefit of your organization to uncover these differing perspectives and to find constructive ways to coordinate them. Robust insights will survive these interrogations, and your organization should trust its use in making decisions and driving operations more as a result.

In conjunction with providing wider access to your data, you should implement mechanisms that can track the use of your data. This will improve your ability to resolve conflicting insights. It will also help you to determine the cost benefits of altering your refined datasets (e.g., by adding a dataset that many people are using, but sourcing from the raw stage; or by adding additional blended datasets used by many analyses). As your organization relies more and more on complex data enrichments resulting from inferences or predictions (e.g., you might regularly predict the likelihood that a customer will churn, and then use this churn prediction value to drive business operations), the ability to track the movement of your data will become critical to enhancing or protecting the utility of the inferences and predictions. In particular, the inferred values will begin to shift the operations of the organization, which will shift the data that is collected. As the data shifts, it is shifting, in part, relative to the inferences. However, if the inference-producing logic assumed that the data did not have this feedback aspect to it, the inferences might become inaccurate or biased over time. These, and related issues, are discussed in Scully et al., Hidden Technical Debt in Machine Learning Systems paper.1

Building on the idea of providing wider access to your data is finding a common data manipulation language for everyone to use. This is critical to support collaboration on analyses. At a minimum, people who want to work together on the same analysis need to share the basic tools of the analysis. Over time, people picking up old analyses to refresh or extend them will benefit from being able to immediately rerun the prior analysis and then to work within the same language to make their modifications. Today, many organizations rely on Excel or some flavor of SQL as common data manipulation languages and tools.

Another aspect of using a common data manipulation language is the ability to easily transition exploratory analyses to a production version or systems. Historically, many organizations have allowed exploratory analyses to be conducted in one set of tools and languages (Excel, R, Python, SAS, etc.), whereas production versions of these analyses (for regular reporting or for data-driven services and products) are often built in a more basic software engineering framework. More effective organizations have shifted toward tools and languages that support the “productionalization” (or “operationalization”) of exploratory logic more directly. Most of these newer tools can simply wrap exploratory scripts in a scheduling and monitoring framework.

One final best practice: consider building a rotation program that allows people to take on the different roles associated with working with data. Superficially, it will increase the breadth of skills that your organization can take advantage of. More fundamentally, a rotation program can build empathy and trust across these job roles, leading to cleaner hand-offs of projects that span multiple groups.

Having now covered a framework for understanding data projects, how data wrangling works and fits into these projects, and how different job roles also work together in these projects, we turn our attention to data wrangling tools and languages. In line with our best practice recommendation around finding a common language, we’ll cover the two most common wrangling tools and languages today: Excel and SQL. We’ll also discuss a more recently developed data wrangling tool: Trifacta Wrangler. In Chapter 9, we provide a basic overview of these three tools. In subsequent chapters, we provide hands-on examples illustrating how they can be used to perform the variety of transformations and profiling involved in wrangling data.

1 Published at the Neural Information Processing Systems (NIPS) conference in 2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset