“You’re only given a little spark of madness. You mustn’t lose it.”
—Robin Williams,
“Little spark of madness” stand-up, 1977
Nowadays, data is one of the main assets of any company. As a result, each team of analysts is faced with the need to organize data science processes. Snowflake is a smart choice as a data source for storing structured and semistructured data.
In other words, elastic storage and computes allow you to store an unlimited amount of data at no extra cost with the ability to search for insights through data analysis and model building.
Additionally, the platform has integration possibilities with the most popular data analytical solutions.
Snowflake-supported advanced analytics solutions
Apache Spark introduction
Snowflake and Spark connector
Snowflake and Databricks
Snowflake and the Analytics Ecosystem
Popular Analytics Solutions That Work with Snowflake
Tool | Description |
---|---|
Alteryx | Alteryx1 is a self-service data analytics platform. |
Apache Spark | Apache Spark2 is an open source cluster computing framework. |
Databricks | Databricks3 is a cloud-based big data processing company founded by the creators of Apache Spark. |
DataRobot | DataRobot4 is a predictive analytics platform to rapidly build and deploy predictive models in the cloud or in an enterprise. |
H2O.io | H2O.io5 is an open source machine learning and artificial intelligence platform. |
R Studio | R Studio6 is an open source integrated development environment for R. |
Qubola | Qubola7 is cloud-native data platform based on Apache Spark, Apache Airflow, and Presto. |
Snowflake and Apache Spark
Let’s look at an example of how the interaction between the database and the analytical platform works. Apache Spark is the de facto industry standard for big data engineering and analytics. Spark is an open source analytics framework based on a distributed computing cluster. It usually uses engineering data pipelines, including streaming mode, ad hoc analysis, machine learning, graph analysis, and other types of analytics.
Machine learning is becoming increasingly popular in many companies because it can significantly impact many of the company’s business processes. Spark can work for model training and production. We can build a machine learning model using Spark MLlib, which is an internal machine learning library using distributed, highly scaling algorithms. Additionally, Spark works well with Pandas,8 Scikit-learn,9 TensorFlow,10 and other popular statistical, machine learning, and deep learning libraries.
Moreover, you can use Apache MLflow11 for organizing the lifecycle of the model and Apache Airflow12 or similar solutions for building data pipelines.
Data scientists and analysts prefer to use SQL, R, and Python. Data engineers usually use languages such as Python, Java, and Scala. Spark provides an API with a different number of languages, including all these languages. It depends on the assets and knowledge base of your team members; each can be writing Spark code that executes on a distributed cluster of machines. Your company can choose the optimal strategy for deployment because Apache Spark can be set up and deployed on-premises or in the cloud. Most popular cloud providers such as Amazon AWS, Microsoft Azure, and Google supply Spark as a component or service. AWS supports Spark as part of an EMR13 service. Microsoft supports Spark in the Azure Hadoop–based HDInsight14 as well as the Azure Databricks platform. Additionally, Google Cloud Dataproc15 is a managed service based on Hadoop with Spark.
Components of Apache Spark
Component | Description |
---|---|
Spark SQL | Spark SQL is the module for working with structured data by using SQL. It is compliant with the SQL ANSI 2011 specification and also supports Hive QL. |
Spark Streaming | Spark Streaming is a Spark Structured Streaming API that allows you to build a scalable data pipeline solution that works in near-real-time mode. |
MLlib | MLlib is the implementation of machine learning algorithms in a scalable and distributed manner. |
GraphX | GraphX is the module for graph analytics. For instance, you can use graph analysis as part of the implementation of fraud analytics or customer churn analysis. |
Spark Core with Dataframe API | Spark Core is the distributed scalable engine that deploys on different types of big data clusters such as Mesos, Hadoop YARN, and Kubernetes. It provides a high-level abstraction including RDD and Dataframe16 for operating with structured and semi/unstructured data by using the Python, R, and Scala/Java languages. |
Datasource API | The Datasource API provides the ability to deelop connectors for connecting to Apache Spark. Snowflake has already developed such a library. |
Spark Modules | Spark has an ecosystem for the development and distribution of Spark-compatible libraries. |
Connector for Apache Spark
Snowflake provides Apache Spark Connector,17 which allows you to use Snowflake and Spark together. Let’s dive into how it works.
The connector supports two modes of data transfer, depending on whether the internal stage is used or external. Stages can be based on an AWS S3 or Azure Blob Storage container. Internal stages automatically create and drop during data transfer inside the Snowflake session. However, you can choose the external stages if you prefer to manage the data transfer yourself.
Note
Best practice is to use internal transfer. Use external transfer only if there is a need to store data for more than 36 hours.
Interaction Between Spark and Snowflake
Interaction | Description |
---|---|
Spark dataframe | A Spark dataframe is the data structure in the distributed memory of Apache Spark that can be automatically created and based on the data from Snowflake’s table. The schema of the table and the dataframe schema must match. Otherwise, you must specify the mapping.18 In the opposite direction, it is the data structure that stores the data that is written to the table. |
Snowflake JDBC driver | The Snowflake JDBC driver is a high-performance optimized driver developed by the Snowflake corporation. |
Snowflake Spark Connector | Snowflake Spark Connector is a connector that implements the Spark Datasource API for Snowflake and is published as the Maven19 package. |
Snowflake internal/external stages | Snowflake internal/external stages are Snowflake stages used by the data transfer process. |
Snowflake table | A Snowflake table is the source/target table into the Snowflake DB. |
Working with Databricks
Databricks provides the Databricks Unified Analytics Platform,20 which is a data science and data engineering platform with tools for data engineers to build data pipelines, for data scientists to build machine learning models, and for business users to consume real-time dashboards.
In addition to the Spark engine, the platform provides many additional enterprise-level components for building a complete process for gathering insights from data sets, as well as designing and testing machine learning models.
Components of Databricks Unified Analytics Platform
Component | Description |
---|---|
Databricks Workspace | The Databricks Workspace is an environment that provides a hierarchy with notebooks, libraries, dashboards, and experiments with appropriate access to them. |
Databricks Runtime | The Databricks Runtime is a number of components that improve the usability, performance, and security of big data analytics. Apache Spark is an open source analytics framework based on a distributed computing cluster. Delta Lake21 is an open source storage layer with ACID, scalable metadata, and versioning features. It also includes schema evolution features for building data lakes. It can be built on S3, Azure Data Lake Storage, and HDFS.22 ML Libraries is a list of the most popular machine learning libraries such as TensorFlow. ML Flow23 is an open source platform for managing the machine learning lifecycle, including the following: ML Flow Project provides a code packaging format. ML Flow Models provides a model packaging format. You can deploy it to Docker or Azure ML for serving Apache Spark. ML Flow Tracking is a component for tracking experiments, including code, parameters, and metrics. |
Databricks Cloud Services | The platform can be deployed on AWS and Azure. Databricks is software as a service. This means the platform provides the benefits of a fully managed service and reduces infrastructure complexity. |
Using Snowflake and Databricks Together
- 1.
Sign into Azure at azure.microsoft.com.24
Note The minimum requirement for a Databricks cluster is two nodes. Your Azure account has to be “pay as you go.” Please check your account’s limits and quotas25 and pricing details.26
- 2.
Log into your Azure account.
- 3.Create a new Databricks service using Home ➤ Azure Databricks ➤ Create Azure Databricks service. See Figure 12-4.
- 4.
Open a Databricks environment.
- 5.
Create a new small Spark cluster by selecting Cluster ➤ Create cluster. See Figure 12-6.
Set Cluster Name to db_cluster.
Databricks Runtime 5.3 ML means the set of core runtimes, including Apache Spark and Scala; machine learning libraries like Pandas, PyTorch, and TensorFlow; and other popular data science packages.
Next, use Python version 3.
- 6.
Create a new notebook using Azure Databricks ➤ Create a blank notebook, call it snowflake_test, and attach the existing cluster. See Figure 12-8.
- 7.
Connect to Snowflake.
Replace the substitutions according to your Snowflake credentials before executing Listing 12-1.
Connecting to Snowflake’s TPCH_SF1.SNOWFLAKE_SAMPLE_DATA
- 8.
Read data from Snowflake.
- 9.
Write data into Snowflake.
Summary
In this chapter, we covered how Snowflake works with modern analytics solutions. You learned about which popular advanced analytics platforms have deep integration with Snowflake and saw how this is done in practice by running through a quick example of Databricks.
In the next chapter, you will learn about how to migrate a legacy data warehouse system into Snowflake.