In Chapter 2, Data Engineering with Alteryx, we introduced Alteryx products, the DataOps framework, and how Alteryx products are accommodated within the DataOps framework. This chapter will look at the key benefits of applying DataOps and what rewards you will gain by implementing DataOps in your organization.
We will explore the principles of DataOps and investigate how they apply to Alteryx development. We will also look at the specifics of which tools in the Alteryx platform can implement the principles of DataOps.
Throughout this chapter, you will learn how Alteryx can help leverage the DataOps process and apply the principles in an Alteryx pipeline.
In this chapter, we will cover the following topics:
As a reminder, DataOps is the methodology that defines the systems and structures for building data pipelines. The most significant benefits of DataOps are as follows:
These benefits have allowed me to deliver data pipelines to my customers and end-users faster. The speed of delivery enables end users to analyze their datasets and immediately provide the feedback needed to tune that dataset to the result they need. Additionally, the inclusion of testing, reporting, and monitoring has provided confidence in the dataset when delivering the completed project for them to maintain in the future.
For example, when building an integration pipeline for a customer, returning incremental workflow improvements faster allows feedback to be implemented in the next cycle. Next, by running a test suite that identifies our metrics during development, the end user is confident that the dataset delivers on its requirements. Additionally, we can use the test suite in production to ensure that the dataset continues to deliver on the requirements as expected.
These benefits are delivered through automation and reduced duplication. How we can achieve these benefits is explained in the next section.
With DataOps, we can achieve faster cycle times by applying a consistent process in developing the data pipelines. Furthermore, implementing this consistency means that a lot of the setup portion of a project is standardized, and we can minimize the duplication of processes across projects.
The faster cycle time also helps both end users and data engineers to enjoy the process more as they see tangible results sooner, and all parties can progress to the next step in their analysis.
We can achieve actionable insights by standardizing repetitive parts of the process, for example, having a standardized and documented process for connecting external sources to internal databases. Additionally, any data tasks, such as everyday cleansing and control table updates, can be made into a reusable macro to simplify that part of the process.
Implementing robust monitoring processes, such as applying Statistical Process Controls (SPCs), enables end users to be confident in the results of the data pipeline. The SPCs are an automated system for identifying when the variations in your dataset fall outside acceptable limits. These standardized processes allow teams to detect when errors might emerge and catch the outliers in the datasets so that they can be processed appropriately.
The core workflows in Alteryx give an overview of the process at the transformation level. For example, watching the data flow through a workflow, checking the transformations at each point, and monitoring the process controls provide an overview of the entire process.
When you add that to Connect's data lineage features, you can see the entire data flow from source through transformation, stopping at any intermediate staging files prior to final ingestion and consumption. This complete view of the data flow means we can assess the impact of any process changes. Again, this helps deliver the DataOps principles by improving data pipeline transparency and interaction between teams and developers.
Finally, having a centralized platform for the analytics process allows for better management of the entire data estate. Having a view of who is interacting with data assets, which users should have access to those assets, and a centralized process for managing those processes allows for tight control of security and governance of the data pipeline.
With these benefits explained, we can learn what the DataOps principles are. We can also link the application of the DataOps principles to the benefits they will bring when we implement them in an organization.
The DataOps principles are the set of guidelines that help deliver datasets and pipelines more efficiently. According to the DataOps manifesto (https://dataopsmanifesto.org/en/), 18 different principles are recommended. Those principles fall into three main pillars that form the basis of DataOps:
We can see the pillars and a summary of the principles in Figure 3.1, and we will look at each principle in detail later in this chapter:
But for now, we will investigate the themes of each pillar:
Let's look at each of the pillars and the principles that fit into those pillars.
As mentioned, the People pillar is all about the culture that DataOps wants to drive. It focuses on putting the end user or customer at the center of every decision made when delivering a data pipeline.
We can apply the following principles to achieve this culture:
The DataOps principles that constitute the People pillar, described above, focus on creating a communicative culture. With a team of data engineers, data analysts, and data scientists, you can collectively solve any request made to you by the organization.
Getting data reliably to the end user can be as challenging to deliver. Being able to serve data requests quickly and efficiently to the end user is made possible by the Delivery pillar. By following the Delivery pillar, you can make decisions quickly:
You are better able to serve and judge a data request by following these six principles. By implementing the improve cycle times, simplicity, and reuse principles, you can quickly create new datasets and minimize excess work. When a process is rebuilt or overcomplicated, it is harder to deliver a good data product. Improving cycle times means that a more extensive project is delivered in smaller chunks to ensure smooth completion.
The value working analytics principle focuses not on creating a pipeline or what tools you use to achieve the dataset, but on the actual data product delivered.
Finally, the last three principles are the functional parts of the data pipeline delivery. When you can orchestrate a process from ingestion to insight, you control all the different environments involved. You can manage the tool versions and server environment configuration. The Analytics is code principle highlights the fact that the process used to deliver a data product can be examined and checked. The Analytics is manufacturing principle highlights the fact that the pipeline you create is an analytic insights product, which means you can continuously improve the process.
The Confidence pillar is about creating the conditions to support the confidence of your end users in your data pipeline. For example, suppose end users have doubts regarding the datasets due to inconsistent processing or unexplained errors. In that case, the issues erode the value of your pipeline, and those users will not engage with your datasets. The following principles help build the systems needed to ensure that your end users are confident that the data you are providing is accurate and valuable to them:
These six principles combine to provide the systems and controls to support your data pipeline deployments. In addition, they provide the mechanisms for managing the data and pipelines while also emphasizing that processes should be continuously improved when put into production.
If end users don't have confidence in a dataset or developers aren't confident in the environments they are trying to develop, it can result in stagnating datasets and poor pipeline performance. However, you can manage and minimize those risks if you implement these pillars.
Now that we know the principles and the pillars of DataOps, how does DataOps work with Alteryx? Like any tool you can use with DataOps, some principles integrate into the Alteryx platform very easily. In contrast, other principles need to be considered carefully for the best results.
Building a culture with Alteryx to support the People pillar enables the flexibility to approach any problem. You can continually deliver the minimum viable product to the customer and adapt to customers' demands with fast development speeds. Alteryx Development follows an iterative process that matches the DataOps principles. Additionally, lean manufacturing, which also forms the basis of DataOps principles, encourages the continuous improvement that Alteryx development supports.
Another benefit of the Alteryx platform for the People pillar is the code-friendly capabilities of the software. While Alteryx is primarily a code-free, GUI-based tool, features such as the Python tool, the R tool, and the Run command tool allow external processes to be integrated into a workflow easily. This flexibility enables you to leverage a range of tools within the same platform.
Leveraging the team sport and self-organizing principles with Alteryx is about applying the mentality that delivers the best results for an Alteryx developer. This mentality is to work iteratively and take advantage of the community resources available. Additionally, with multiple team members working together, each person can look at the problem and identify any issues and questions that might impede a project's progress.
The principles in the Delivery pillar are focused on getting data to end users as quickly and efficiently as possible. Alteryx can help by leveraging its code-free base practically and delivering quickly through fast iterations and simple, easy-to-understand workflows. Additionally, you can easily reuse processes by packaging them into macros.
How you deliver a project in Alteryx has a strong correlation with the DataOps principles. For example, development is cycled quickly through each iteration, the visual workflow can be easily understood and followed, and you can package any repeated sections or processes for reuse.
The principle of valuing working analytics, where you deliver a complete project in smaller working chunks, is leveraged at each stage in workflow development. For example, when building an Alteryx workflow, the datasets are checked frequently with the browse anywhere feature, examined in-depth with browse tools, and can be sense checked using summary functions, charts, and test checks. In addition, a working pipeline is needed to deliver a dataset to your end user and that pipeline enables you to verify the process used.
We fill the final three principles in the Alteryx platform as part of the underlying make-up of the software. Alteryx manages process orchestration either inside the ecosystem or by leveraging outside tools to deliver the end-to-end pipeline. Each Alteryx workflow contains the details of the specific software version that ran the data pipeline. Because the specifics are held in the Alteryx file's XML recording the pipeline, treating an Alteryx workflow as you would any other code with standard software version control can be executed.
One challenge with treating an Alteryx workflow as code is that because it is a visual interface, the small visible changes that an Alteryx data engineer might make can cause the difference in tracking in version control software challenging to interpret. This challenge stems from deciding whether this change in the XML is essential to the workflow or just visual.
Ensuring the quality and performance of Alteryx pipelines is not created automatically within the Alteryx platform. Recording the workflow runtime information happens automatically, saving the run details in the Alteryx Server database, but it must be deliberately extracted and analyzed. You can employ sub-workflows or sub-processes in the data pipeline to check the quality of the records.
Alteryx can leverage the principle of reducing heroism through the readability of the workflows. This readability means that other data engineers can take over a pipeline relatively quickly. Even when unplanned, this easy workflow handover is supported by the self-documenting nature of an Alteryx workflow. When documentation best practices are adhered to, the handover process is even more straightforward. Even so, you can navigate unplanned handovers with automated documentation.
Alteryx leverages the reproducible development and disposable environment principles because Alteryx does not change the underlying data without explicitly forcing a write process. Because all workflows are non-destructive, developers can test and iterate ideas and try novel processes without affecting the underlying data infrastructure. Additionally, when creating a new environment for an Alteryx developer, the base install contains the functionality required to run any workflow.
The limitation of this in the Alteryx environment is that it doesn't include any custom reusable macros. However, suppose a consistent process is implemented to save and access these reusable processes, such as saving to a consistent network location or cloud storage provider. In that case, you can easily add those macros to any new development environment.
Deploying DataOps with Alteryx starts with understanding the mindset and places where you are using Alteryx. Where we can achieve this has been discussed, but we also need to see how to use each Alteryx program in DataOps projects.
The Alteryx platform components of Designer, Server, and Connect deliver the parts needed to support the DataOps framework pillars and fulfill the principles that make up those pillars. Each of the different software components fits across the People, Delivery, and Confidence pillars, serving the various functions needed to deliver in a DataOps framework.
In Figure 3.2, we have the same pillars from earlier, but with an overlay of where each Alteryx product interacts with the different pillars:
In the diagram, we can see the three DataOps principles and how Designer supports the People and Delivery pillars, Connect enables the Confidence pillar, while Server supports the principles across all pillars.
Building a DataOps pipeline will always start with development in Alteryx Designer. First, the quick iteration that I have been speaking about allows for continuous satisfaction and allows the data engineer to embrace any changes during the project. Next, Designer enables the team sport and self-organization principles. Each person in the team can interact using the same visual design process while bringing their expertise with any specific language or tool and leveraging that in Alteryx.
For the Delivery pillar, you can find the repeated processes that appear in a data pipeline and extract them into a reusable macro. We can make these macros available to other team members so that they don't have to recreate work. It also increases the simplicity of the process as the visual clutter associated with the standardized approach can be simplified to a single tool.
Finally, you can build confidence in the datasets with Designer by designing performance and monitoring solutions. Next, ensuring the quality of datasets is enabled by creating insight dashboards that monitor data processing.
The next component for supporting DataOps is Server. This allows the automation of pipelines created in Designer. This central processing of workflows also manages the process for simple version control and the sharing of workflows and parameterized applications throughout the organization.
Once you save the workflows to Server, any insights you develop in relation to quality control can be automatically populated and monitored in the Server Gallery web interface. Automating running workflows in Server removes the heroism or key man risk that might otherwise be present for an Alteryx data pipeline.
Finally, centralizing workflows to a single server gallery allows silos to be broken across organizations and cross-company teams to be established. Additionally, those cross-company teams can self-organize using collections for sharing and management when Server administrators set the permissions to allow it.
The final orchestrating capability is incorporated into the Alteryx platform by using Connect to share all resources across the organization from Alteryx or elsewhere. In addition, the Connect platform allows for the discovery of reusable workflow components, existing datasets and reports, and any other content assets that might answer the end user's questions without needing a new project.
Showing the lineage of the datasets strengthens users' confidence. When users see the transformations and where other colleagues use the datasets, they gain community validation for a dataset. In addition, the Connect software can reveal the quality monitoring workflows, insights, and dashboards that are already available.
The People pillar is supported by putting users who need access to an asset in contact with its owners. It can build a community knowledge base around those existing assets for users and facilitate communication to answer different questions.
We have talked about the concepts of how Alteryx works with DataOps and what each Alteryx program delivers within DataOps. We can now investigate the steps to deploy a DataOps project with Alteryx.
Now that we know what the DataOps principles are and what parts of Alteryx build the DataOps pipeline, how would you implement DataOps into your company? The best way to demonstrate this would be to introduce an example we will use for our example process throughout the rest of this book.
The data pipeline example we will be following is this:
As a company, we want to enrich our marketing efforts by integrating regularly updated public datasets. We need to identify the source of these datasets (and make sure we have the legal authority to use them) and transform them to match our company areas. Then, we have to make them available to both the data science team for machine learning and to our operational teams across the organization.
This problem statement works well in identifying the process we need to implement, shown in Figure 3.3:
With the method shown in the preceding diagram, we can explore what each step represents:
These general steps are common to any data pipeline. But, of course, we do not address every requirement, and there is plenty of flexibility to go outside this process when needed. And not all steps are necessary for every situation; the dataset might be internal and require modification for a new application. However, this provides a reasonable basis for implementing a DataOps project.
Now that we have our DataOps principles and understand how they work in the three pillars of People, Delivery, and Confidence, we learned the specifics of building a data pipeline with Alteryx and how to implement these principles.
The following section looked at the different processes for accessing data sources on the basis of the DataOps framework. Then, we learned how to transform those datasets with Alteryx and publish them in other storage locations. And finally, we used Alteryx to build reports and visualizations in value extraction processes.
In the next chapter, we will start by accessing internal data sources, whether those data sources are files or database connections. We will then look at how we can access external data sources, either open public or private secured data sources. Then we will look at some of the initial processes for making sure our sourced data is in a raw format for us to use.