In the current century, one of the core functions of all companies is to retrieve data from its source and get it into the hands of your company's analysts, decision makers, and data scientists. This data flow allows businesses to make decisions, supported by empirical evidence, quickly and with confidence. The capability also gives businesses a robust process for delivering the data flows with a significant advantage over their competition.
Creating robust data flows requires that end users find the datasets and trust the raw data source. End users need to know what transformations were applied to the dataset to build trust. They also need to know who to talk to if their needs change. Alteryx gives data engineers and end users a single unified place to create data pipelines and discover data resources. It also provides the context that gives end users confidence when making decisions based on any of those datasets.
This book will describe how to build and deploy data engineering pipelines with Alteryx. We will learn how to examine to apply DataOps methods to build high-quality datasets. We will also learn the techniques required for monitoring the pipelines when they are running in an automated production environment.
This chapter will introduce the Alteryx platform as a whole and the major software components within the platform. Then, we will see how those components fit together to create a data pipeline, and how Alteryx can improve your development speed and build confidence throughout your data team.
Once we understand the Alteryx platform, we will look into Alteryx Designer and familiarize ourselves with the interface. Next, we will set a baseline for building an Alteryx workflow and use Alteryx to create standalone data pipelines.
Next, we will investigate the server-based components of the Alteryx platform, Alteryx Server, and Alteryx Connect. We will learn how Alteryx Server can automate the pipeline execution, scale the efforts and work of your data engineering team, and serve as a central location where workflows are stored and shared. We will also learn how Alteryx Connect is used to find data sources throughout an enterprise, build user confidence with data cataloging, and build trust in the data sources by maintaining the lineage.
Finally, we will see how this book can help your data engineering work and link each part of the data engineering pipeline with the Alteryx platform applications.
In this chapter, we will cover the following topics:
The Alteryx platform is the Alteryx software suite that combines processing, managing datasets, and analysis. While a lot of focus in the Alteryx community tends to be on the business user analyst, a data engineer's benefits are extensive. Alteryx as a whole allows for both code-free and code-friendly workflow development, giving it the flexibility to quickly transform a dataset while having the depth to make complex transformations using whatever tool or process makes the most sense.
In this section, we will learn about the following:
The Alteryx platform is a collection of four software products:
Important Note
Alteryx software products have Alteryx as part of the name. Generally, the name Alteryx is dropped from the name in discussions and that will often happen throughout this book.
Because the data science deployment falls into Machine Learning Operations (MLOps), it isn't a core component of the Data Operations (DataOps) process. Thus, while you might have some interactions with the model deployment as a data engineer, we will be focusing on extracting and processing the raw datasets rather than the model management and implementation that Promote supports. As such, the Promote software will be beyond the scope of this book.
Now that we know what the Alteryx platform is and what software is available, we can look at how Alteryx will fit into a business case.
The Alteryx platform is all about creating a process where iteration is easy. All too often, when integrating a new data source, you won't always know the answer to the following questions until late in the process:
Trying to develop a workflow to answer these questions with a pipeline focused on writing code, common areas of frustration appear when trying to iterate through ideas and tests. These frustrations include the following:
The visual nature of Alteryx lets you quickly think through the pipeline, and see what transformation is happening where. When errors appear in the process, the tool will highlight the error in context.
It is also easy to trace specific records back through the process visually. This tracing renders straightforward the process of identifying when a transformation takes place that results in a logical error.
The Alteryx platform's key benefits to a data engineer arise in three major cases:
These benefits fall under an overarching theme of making it easier to get new datasets to the end user. For example, suppose the development time, debugging, and documentation can all be made simpler. In that case, responding to requests from analysts and data scientists becomes something to take pride in rather than dreading.
The Alteryx platform supports the speed of development with two fundamental features:
The visual development process helps a data engineer by allowing them to lay out the pipeline onto the Alteryx canvas. Of course, you can create the pipeline from scratch, which is often the case if little information about the end destination is available. Still, you can build the pipeline from a data flow chart with the principal steps preplanned.
This translation process uses the transformation tools that provide the building blocks for a workflow. By aligning those tools with a logical grid across (or down) the Designer canvas, you can see each step in the pipeline. Such an arrangement allows you to focus on each step to identify when the data might diverge for a particular process and add any intermediate checks.
The other benefit is speed – the fact that the Alteryx engine performs the operations quickly. One of the reasons for this performance is that transformations take place in memory and with the minimum memory footprint required for any particular change.
For example, when a column with millions of records has a formula applied, only the cells (the row and column combination) that are processed are needed in memory. The result is that the transformations that Alteryx does are fast.
The location of the dataset is often the only limit to Alteryx's in-memory performance. For example, opening a large Snowflake or Microsoft SQL Server table in Alteryx can become bottlenecked by network transfers. In these cases, the InDB tools can perform calculations on the remote database to minimize the problem and reduce the volume of data transferred locally.
The next significant benefit is the inherent iterative workflow that Alteryx development uses. When building a data pipeline, the sequencing of the transformations is vital to the dataset result.
This iterative process allows you to do the following:
The iterative process allows the data engineer to test changes quickly without worrying about how long it will take to compile or if you haven't noticed a typo in the SQL script.
Each tool in Alteryx will automatically document itself with annotations. For example, a formula tool will list the calculations taking place.
This self-documenting provides a good starting point for the documentation of the overall workflow. You can supplement these annotations by adding additional context. The further context can be renaming specific tools to reflect what they are doing (which also appears in the workflow logs). Add comment sections to the canvas or grouping processes with tool containers.
We now understand why the Alteryx platform is a powerful tool for data engineering and some of its key benefits. Next, we need to gain a deeper insight into the benefits that using Alteryx Designer can bring to your data engineering development.
We have covered at a high level what the benefits of the Alteryx platform are. This section will look a bit closer at Alteryx Designer and why it is suitable for data engineering.
As mentioned previously, Designer is the desktop workflow creation tool in the Alteryx platform. You create the data pipelines and perform advanced analytics in Designer. Designer can also create preformatted reports, upload datasets to API endpoints, or load data into your database of choice.
Here, we will answer some of the questions that revolve around Designer:
Answering the preceding questions will give you a basic understanding of why Designer is a good tool for building your data pipelines and the basis for the DataOps principles we will talk about later.
Alteryx Designer utilizes a drag-and-drop interface for building a workflow. Each tool represents a specific transformation or process. This action and visibility of the process allow for a high development speed and emphasize an iterative workflow to get the best results. Throughout the workflow, you can check the impact of the tool's changes on the records and compare them to the tool's input records.
If you open a new Designer workflow, you will see the following main interface components:
These components are shown in the following screenshot:
Each of these sections provides a different set of information to you while building a workflow.
The Canvas gives a visual representation of the progress of a workflow, the configuration page allows for quick reference and the changing of any settings, and the results window provides a preview of the changes made to the dataset.
This easy viewing of the entire pipeline in the canvas, the data changes at each transformation, and the speedy confirmation of settings in the workflow allow for rapid iteration and testing. As a data engineer, getting a dataset to the stakeholder accurately and quickly is the central goal of your efforts. These Designer features are focused on making that possible.
The default orientation for a workflow is left to right, but you can also customize this to work from top to bottom. Throughout this book, I will describe everything in this context, but be aware that you can change it.
Accessing Online Help
When working in the Designer interface, you can access the online help by pushing the F1 button on your keyboard. Additionally, if you have a particular tool selected when you push the F1 button, you will navigate to the help menu for that specific tool.
Let's build a simple workflow using the tools in the Favorites tool bin. We will complete the following steps and create the completed workflow shown in Figure 1.2:
You can look at the example workflow in the book's GitHub repository here: https://github.com/PacktPublishing/Data-Engineering-with-Alteryx/tree/main/Chapter%2001.
Using an Input Data tool, we can connect to the Cust_wTransactions.xls dataset. This dataset is one of the Alteryx Sample datasets, and you can find this in the Alteryx Program folder, located at C:ProgramFilesAlteryxSamplesdataSampleDataCust_wTransactions.xls.
In step 2 of the process, we create a field with the following steps:
The workflow of the preceding steps can be seen in the following screenshot:
The third step in the process is to summarize the results to find the average speed per customer in each city as follows:
The configuration for the summary described is shown in the following screenshot:
The final step in our workflow is to view the results of the processing. We can use the Browse tool to view all the records in a dataset and see the full results.
The process we have looked at works well on smaller datasets or data in local files. It is less effective when working with large data sources or when the data is already in a database. In those situations, using InDB tools is a better toolset to use. We will get an understanding of how to use those tools in the next section.
The InDB tools are a great way to process datasets without copying the data across the network to your local machine. In the following screenshot, we have an example workflow that uses a sample Snowflake database to process 4.1 GB of data in less than 2 minutes:
You can look at the example workflow in the book's GitHub repository here: https://github.com/PacktPublishing/Data-Engineering-with-Alteryx/tree/main/Chapter%2001.
This workflow entails three steps:
When looking at the visual layout, we see the generation of the query, where the logic branches off, and how we merge the logic back onto the dataset. The automated annotations all provide information about what is happening at each step. At the same time, the tool containers group the individual logic steps together.
We will look at how to use the InDB tools in more detail in later chapters, but this workflow shows how complicated queries are run on large datasets while still providing good performance in your workflow.
Building better documentation into your workflow improves the usability of the workflow. Therefore, adding this documentation is considered the best practice to employ when developing a workflow. We will explore how we can apply the documentation in the next section.
Applying Designer best practices makes your data engineering more usable for you and other team members. Having the documentation and best practices implemented throughout a workflow embeds the knowledge of what the workflow components are doing in context. It means that additional team members, or you in the future, will be able to open a workflow and understand what each small section is trying to achieve.
The best practices fall into three areas:
These three areas all focus on making a workflow decipherable at a glance and quickly understandable. They give new data engineers the information they need to understand the workflow when adopting or reviewing a project.
With a completed workflow, the next step will be making the workflow run automatically. We also need to make the datasets that the workflow creates searchable and the lineage traceable. We will use Alteryx Server and Alteryx Connect to achieve this, which we will look at next.
Once you have successfully created a data pipeline, the following process is to automate its use. In this section, we will use Alteryx to automate a pipeline and create discoverability and trust in the data.
The two products we will focus on are Alteryx Server and Alteryx Connect. Server is the workflow automation, scaling, and sharing platform, while Connect is for data cataloging, trust, and discoverability.
Server has three main capabilities that are of benefit to a data engineer:
Connect is a service for data cataloging and discovery. Data assets can be labeled by what the data represents, the field contents, or the source. This catalog enables the discovery of new resources. Additionally, the Data Nexus allows a data field's lineage to be traced and builds trust with users to know where a field originated from and what transformations have taken place.
Once we have created a pipeline, we may want to have the dataset extracted on a regular schedule. Having this process automated allows for more robust implementation and makes using the dataset simpler to use.
Orchestrating a data pipeline with Alteryx Server is a three-step process:
This three-step process is deceptively simple and, for this introduction, only covers the most straightforward use cases. Later, in Chapter 10, Monitoring DataOps and Managing Changes, we will walk through some techniques to orchestrate more complex, multistep data pipelines. Still, those examples fundamentally come back to these three steps mentioned above.
In the following screenshot, we can see how we can define the time frame for our schedule on the Server Schedule page:
On this page, we can define the frequency of a schedule, the time the schedule will occur, and provide a reference name for the schedule.
The final piece of your data engineering puzzle is how will users find and trust the dataset you have created? While you will often generate datasets on request, you also find that users will come to you looking for datasets you have already made, and they don't know they exist.
Connect is a data cataloging and discoverability tool for you to surface the datasets in your organization and allow users to find them, request access, and understand what the fields are. It is a central place for data definitions and allows searching in terms of how content is defined.
Now that you know the basics of using Alteryx, we can investigate how Alteryx applies to data engineering. Data engineering is a broad topic and has many different definitions, depending on who is using it. So, for the context of this book, here is how I define data engineering:
Data engineering is the process of taking data from any number of disparate sources and transforming them into a usable format for an end user.
It sounds simple enough, but this definition encapsulates many variables and complexity:
Chapter 2, Data Engineering with Alteryx, will expand on what this definition means. It will also explain how Alteryx products cover all the steps needed to deliver that definition.
So far in this introduction, we have talked about how the parts of Alteryx can help the data engineering process independently. However, each Alteryx element also works together to build a complete, end-to-end data engineering process.
There is a common set of processes that are required when completing a data engineering project. These processes are shown in the next diagram along with what Alteryx software is usually associated with that process:
The preceding screenshot shows Designer overlapping the data sources and transformation aspects of the processes, Server overlays the automation (which performs some of the transformations), and Connect covers the discovery section of the process.
Chapter 2, Data Engineering with Alteryx, will introduce a complete data engineering example and the DataOps principles that support data engineering in Alteryx. Finally, Chapter 3, DataOps and Its Benefits, will take the principles introduced and expand on why those principles will benefit data engineering and your organization.
I want to share two example use cases where Alteryx provides an excellent platform for data engineering from my consulting work.
In the first example, my client uses Alteryx Designer to create a series of workflows to collect reference information from a third party. They automate this process on Server to extract the information from the source text files and load them into their data warehouse daily. These resources are then shared with people throughout the company and made discoverable.
The other use case is where a medium-sized business uses Alteryx to collect the core company information from scattered business APIs; finance and billing, social media and web analytics, CRM, and customer engagement. Next, the company automatically consolidates the business resources into the core reporting database. The company then discovers the centralized data sources in Connect while Alteryx populates an additional data catalog for the Business Intelligence tool.
In this chapter, we have learned the parts that make up the Alteryx platform. We have also learned how they can benefit you as a data engineer with faster development, an iterative workflow, and extendable self-documentation.
We examined an example of how to build a workflow with Designer and learned what the InDB tools can do. Finally, we introduced Server and Connect. We learned how Server can automate and scale your data engineering developments. Then we learned that Connect provides a place for user discovery of the datasets you have created.
In the next chapter, we will expand on what a data engineer is for Alteryx and how you can use Alteryx products for data engineering. Then we will introduce DataOps and why this is a guiding principle for data engineering in Alteryx.