Nowadays, companies buy and use many different systems from different companies. Eventually, data engineers face the problem of supporting different data formats for building analytical solutions and maintaining data sources. Data scientists face issues related to delivering valuable insight from semistructured data.
Historically, to load semistructured data into a relational repository, it was necessary to convert it to another format. However, with the NoSQL1 revolution, such databases were used in conjunction with relational databases. Ultimately, relational engines began to support semistructured data.
Another concept that came from big data was the so-called schema-on-read approach. You first load the data as it is without thinking about the schema, and then when data already in the database, you working with this and define schema. A Snowflake database is a full ANSI SQL RDBMS that supports SQL for semistructured data while applying the schema-on-read2 approach. In addition, Snowflake Support automatically converts data into column storage that is better suited for analytical workloads.
Working with JSON, XML, and AVRO
Working with ORC and Parquet
Supported File Formats
Snowflake supports many popular data formats. Table 9-1 lists some of the platforms that are integrated with Snowflake.
Storage engine that supports the most common formats and internal optimization storage processes
Flexible schema data types and the ability to track changes
SQL access for this data
Snowflake-Supported File Formats
Format | Description |
---|---|
JSON | JSON4 is a lightweight data-interchange format based on JavaScript. |
AVRO | AVRO5 is a data serialization format based on binary JSON. |
ORC | Optimized Row Columnar (ORC6) is column-oriented7 format originally developed for Apache Hive. |
Parquet | Parquet8 is the most used column-based format that came from the Hadoop ecosystem. |
XML | Extensible Markup Language (XML9) is a markup language. |
Advanced Data Types
In Snowflake the primary universal data type is VARIANT. You have to use it for working with semistructured data such as XML, JSON, AVRO, Parquet, and ORC. For high efficiency and performance, the Snowflake engine stores binary representations that support semistructured data using column-oriented storage with compression.
This process is completely transparent to the end user. The VARIANT type is a universal container that can store other types including OBJECT and ARRAY. There is a restriction on the maximum size of an object in compressed form, and it should not exceed 16 MB.
Any type of data in Snowflake can be converted to a VARIANT type. The database uses explicit and implicit type conversions. For explicit conversions, use the functions TO_VARIANT(<expr>) or <expr>::VARIANT, where <expr> is an expression of any data type. Implicit conversion is used when you do not explicitly indicate this; for example, this happens when comparing data with different data types. For example, var:json_path >= 7 is cast to var:json_path >= 7::VARIANT.
Note
The VARIANT null value is distinct from the SQL NULL value. VARIANT null is real value that may be in semistructured data instead of SQL NULL. Use the test function IS_NULL_VALUE10 to distinguish them.
OBJECT is a key-value pair, where the key is a nonempty string and the value is a written value of the VARIANT type.
ARRAY is an array, where the index is an integer (from 0 to 2^31-1), and values have the VARIANT type.
- 1.
Create a file format and load the file into Snowflake.
- 2.
Create a table with a column type of VARIANT, OBJECT, or ARRAY.
- 3.
Parse JSON or XML using Snowflake SQL extension functions,11 e.g., PARSE_JSON or PARSE_XML.
- 4.
Extract values from a structure and determine the data types using specific SQL functions, e.g., FLATTEN or GET.
- 5.
Convert a structure to a lateral view using the LATERAL function.
- 6.
Work with relational views as usual.
Working with XML
One of the most used formats for exchanging between companies is the XML format.
This format is often used in the world of enterprises as the main format for exchanging information between corporate services.
For instance, let’s use an open database called US New York Demographic Statistics. The database provides statistical information about the residents of New York. Using a specific request12 to this database, we can download the sample in XML format.
Example of XML
Field message that contains basic response information
- Summary of request, including the following:
Tag response, including the tag row
Tag row, which contains attribute tags
Many attribute tags with data
Snowflake allows you to load such data directly into the database while applying encryption on the fly and provides a set of functions that extends the standard SQL, which makes it easy to work within the structure of XML documents.
Using Snowflake SQL for Xml
- 1.
Log into your Snowflake’s account and choose the Worksheets tab.
- 2.
Choose your sample databases and warehouse using DML, as follows:
use warehouse "COMPUTE_WH_TEST";use "DEMO_DB"."PUBLIC"; - 3.
Create a table called demo_xml with the VARIANT attribute by using a DDL statement.
create or replace table demo_xml (val variant); - 4.
Download the XML file onto your computer using the request in the previous link.
- 5.Open the Snowflake UI and choose the Databases tab. Click the table and then click the Load Data button. See Figure 9-1.
- 6.In the window that opens, select your Snowflake warehouse, as shown in Figure 9-2.
- 7.Choose your XML file and download it into the Snowflake cloud. See Figure 9-3.
- 8.Create a new file format for the table. See Figure 9-4.
- 9.Choose the FF_XML format. See Figure 9-5.
- 10.Upload the sample file into Snowflake using Load Options by default. See Figure 9-6.
- 11.
Check the data by using a SQL statement. See Figure 9-7.
select * from demo_xml; - 12.
Try to access the root element using the $ operator. See Figure 9-8.
select val:"$" from demo_xml; - 13.
Try to extract the row attribute using the XMLGET function. See Figure 9-9.
select XMLGET(val, 'row',0):"$" from demo_xml;
Extracting Values from an Array of the Table
Creating a View Based on Extracting Values from the Array of the Table
Extracting Values from the Array of the Table
In the view stats_by_zip_with_neighborhoods, we combined a regular relational table and view based on XML data, applying the schema and extracting only the necessary attributes on the fly. This view can easily be connected to the BI tool.
Working with JSON
These days, the JSON format is the most popular format for exchanging data. Let’s take a look at how Snowflake works with JSON. Let’s take a sample of financial data using a provider called “World Trading Data.”16 Perform a request17 to find out information about the companies Apple, Microsoft, and HSBC Holding.
Example of JSON Data from NASDAQ
Field message that contains basic response information
- Summary of request including the following:
Attribute symbols_requested
Attribute symbols_returned
Attribute data that is a container for data
Using Snowflake SQL for JSON
- 1.
Log into your Snowflake account and choose Worksheets.
- 2.Choose your sample databases and warehouse using DML, as follows:use warehouse "COMPUTE_WH_TEST";use "DEMO_DB"."PUBLIC";
- 3.
Create the table stock_json by using the following DDL statement:
create or replace table stock_json (val variant);val is a field with the type VARIANT.
- 4.
Insert the sample JSON into Snowflake’s table by using the parse_json function.
insert into stock_json select parse_json('<JSON>');Replace the substitutions with the recent JSON code before executing.
- 5.
Check the data into the table. See Figure 9-12.
Select * from stock_json; - 6.
Use the notation <field>:<attribute>[::type] to extract data from a specific attribute, as shown in Listing 9-6. See Figure 9-13.
Querying JSON by Using SQL
- 7.
For extracting nested elements from the array, use the built-in table function FLATTEN, as shown in Listing 9-7.18 See Figure 9-14.
table(flatten(<array>:<elements>))
select f.*from stock_json s,table(flatten(val:data)) f;Listing 9-7Extracting Elements of Arrays of JSON by Using SQL
In Listing 9-8, in the query, FLATTEN extract column names from val:data of the JSON document and converts it into rows.
- 8.
To extract nested elements from the array, use the built-in table function FLATTEN: table(flatten(<array>:<elements>)). See Figure 9-15.
Extracting Elements of Arrays in JSON by Using SQL
Note To count the number of elements in an array, you can use the function array_size(<array>:<elements>).
Working with AVRO
An AVRO file is serialized JSON with a schema. It is often used as a data transport format in Apache Kafka.
- 1.
Create a new stage for creating a new AVRO file format.
- 2.
Upload the AVRO file into the stage in Snowflake to create a new file format.
- 3.
Create a target table.
- 4.
Copy the data from the file into the target table.
- 5.
Query the data in the table using the Snowflake SQL extension.
- 6.
To do this, you can use the Snowflake UI or a command.19
Additionally, you can use AVRO tools.20 Specifically, you can use a Java package of specific tools for working with the AVRO format including doing serialization of some JSON files using AVRO schemas.
Working with AVRO
- 1.
On your local computer, create a new JSON sample file and save it as stock_sample2.json, as shown in Listing 9-9.
JSON Sample File
- 2.
Create an Avro Schema for this sample file and save it as stock_sample2.avsc, as shown in Listing 9-10.
AVRO Schema File
- 3.
Download the last version of the AVRO tools and generate an AVRO sample file, as shown in Listing 9-11.
AVRO File Generation
- 4.
Create the target table and the stage for the AVRO file in the table and save it as meta_avro.sql, as shown in Listing 9-12.
Creating Metadata for Loading an AVRO File
- 5.
Run the script.
- 6.
Create the script for uploading the AVRO data file. Save the script as put_avro_file.sql, as shown in Listing 9-13.
Uploading the Data and Copying It into the Target Table
- 7.
Upload the file into the Snowflake cloud.
Requesting the Data Loaded from an AVRO File
Working with Parquet
A Parquet file is a compressed column-oriented binary file. It is used to store big data with an analytical workload.
- 1.
Create a new stage for creating a new Parquet file format.
- 2.
Upload the Parquet file into the stage in Snowflake where you have to create a new file format.
- 3.
Create a target table.
- 4.
Copy the data from the file in the stage to the target table using mapping fields.
- 5.
Query the data in the table.
Use a similar approach for working with ORC files.
Working with Parquet
Let’s look at how to work with Parquet in Snowflake.
- 1.
On your local computer, create a new CSV sample file and save it as stock_sample3.csv, as shown in Listing 9-15.
CSV Sample File
- 2.
Let’s install the necessary libraries.
- 3.
Make a simple Python script that reads the CSV file and writes it in Parquet format. Save the file as csv_to_parquet.py, as shown in Listing 9-16.
Transforming Data from CSV to Parquet
- 4.
Create the target table and the stage for the Parquet file. Save the script as meta_parquet.sql, as shown in Listing 9-17.
Creating Metadata for Loading the Parquet File
- 5.
Run the script.
- 6.
Create the script for uploading the Parquet file. Save the script as put_parquet_file.sql, as shown in Listing 9-18.
Uploading the Data and Copying It into the Target Table
- 7.
Run the script for uploading and checking the result. See Figure 9-17.
Summary
In this chapter, we briefly covered how Snowflake can work with different data formats. Moreover, you learned about which semistructured data formats are supported in Snowflake and saw how this is done in practice by running the examples with JSON, XML, AVRO, and Parquet.
In the next chapter, you will learn about Snowflake’s data sharing capabilities.