Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_9

9. Working with Semistructured Data

Dmitry Anoshin¹, Dmitry Shirokov² and Donna Strok³

(1)

British Columbia, Canada

(2)

Burnaby, BC, Canada

(3)

Seattle, WA, USA

Nowadays, companies buy and use many different systems from different companies. Eventually, data engineers face the problem of supporting different data formats for building analytical solutions and maintaining data sources. Data scientists face issues related to delivering valuable insight from semistructured data.

Historically, to load semistructured data into a relational repository, it was necessary to convert it to another format. However, with the NoSQL¹ revolution, such databases were used in conjunction with relational databases. Ultimately, relational engines began to support semistructured data.

Another concept that came from big data was the so-called schema-on-read approach. You first load the data as it is without thinking about the schema, and then when data already in the database, you working with this and define schema. A Snowflake database is a full ANSI SQL RDBMS that supports SQL for semistructured data while applying the schema-on-read² approach. In addition, Snowflake Support automatically converts data into column storage that is better suited for analytical workloads.

In this chapter, you will learn about how Snowflake works with different formats. We will cover the following topics:

Working with JSON, XML, and AVRO
Working with ORC and Parquet

Supported File Formats

Snowflake supports many popular data formats. Table 9-1 lists some of the platforms that are integrated with Snowflake.

Snowflake provides the following for working with semistructured data:

Storage engine that supports the most common formats and internal optimization storage processes
Flexible schema data types and the ability to track changes
SQL access for this data

For example, most REST³ services use JSON. This is in contrast to the majority of legacy enterprise-level integration services that use XML to exchange data between corporate applications. If you use Hadoop or S3, you have worked with column-based formats before.

Table 9-1

Snowflake-Supported File Formats

Format	Description
JSON	JSON⁴ is a lightweight data-interchange format based on JavaScript.
AVRO	AVRO⁵ is a data serialization format based on binary JSON.
ORC	Optimized Row Columnar (ORC⁶) is column-oriented⁷ format originally developed for Apache Hive.
Parquet	Parquet⁸ is the most used column-based format that came from the Hadoop ecosystem.
XML	Extensible Markup Language (XML⁹) is a markup language.

Advanced Data Types

In Snowflake the primary universal data type is VARIANT. You have to use it for working with semistructured data such as XML, JSON, AVRO, Parquet, and ORC. For high efficiency and performance, the Snowflake engine stores binary representations that support semistructured data using column-oriented storage with compression.

This process is completely transparent to the end user. The VARIANT type is a universal container that can store other types including OBJECT and ARRAY. There is a restriction on the maximum size of an object in compressed form, and it should not exceed 16 MB.

Any type of data in Snowflake can be converted to a VARIANT type. The database uses explicit and implicit type conversions. For explicit conversions, use the functions TO_VARIANT(<expr>) or <expr>::VARIANT, where <expr> is an expression of any data type. Implicit conversion is used when you do not explicitly indicate this; for example, this happens when comparing data with different data types. For example, var:json_path >= 7 is cast to var:json_path >= 7::VARIANT.

Note

The VARIANT null value is distinct from the SQL NULL value. VARIANT null is real value that may be in semistructured data instead of SQL NULL. Use the test function IS_NULL_VALUE¹⁰ to distinguish them.

In addition to type VARIANT, there are two more types.

OBJECT is a key-value pair, where the key is a nonempty string and the value is a written value of the VARIANT type.
ARRAY is an array, where the index is an integer (from 0 to 2^31-1), and values have the VARIANT type.

The steps for working with these types of data follow:

1.
Create a file format and load the file into Snowflake.
2.
Create a table with a column type of VARIANT, OBJECT, or ARRAY.
3.
Parse JSON or XML using Snowflake SQL extension functions,¹¹ e.g., PARSE_JSON or PARSE_XML.
4.
Extract values from a structure and determine the data types using specific SQL functions, e.g., FLATTEN or GET.
5.
Convert a structure to a lateral view using the LATERAL function.
6.
Work with relational views as usual.

Working with XML

One of the most used formats for exchanging between companies is the XML format.

This format is often used in the world of enterprises as the main format for exchanging information between corporate services.

For instance, let’s use an open database called US New York Demographic Statistics. The database provides statistical information about the residents of New York. Using a specific request¹² to this database, we can download the sample in XML format.

Listing 9-1 shows an example of the response of the RESTful service from the sample request.

<row>

<jurisdiction_name>10001</jurisdiction_name>

<count_participants>44</count_participants>

<count_female>22</count_female>

<percent_female>0.5</percent_female>

<count_male>22</count_male>

<percent_male>0.5</percent_male>

<count_gender_unknown>0</count_gender_unknown>

<percent_gender_unknown>0</percent_gender_unknown>

<count_gender_total>44</count_gender_total>

<percent_gender_total>100</percent_gender_total>

<count_pacific_islander>0</count_pacific_islander>

<percent_pacific_islander>0</percent_pacific_islander>

<count_hispanic_latino>16</count_hispanic_latino>

<percent_hispanic_latino>0.36</percent_hispanic_latino>

<count_american_indian>0</count_american_indian>

<percent_american_indian>0</percent_american_indian>

<count_asian_non_hispanic>3</count_asian_non_hispanic>

Listing 9-1

Example of XML

With this example, you can see that the structure of an XML file contains the following:

Field message that contains basic response information
Summary of request, including the following:
- Tag response, including the tag row
- Tag row, which contains attribute tags
- Many attribute tags with data

Snowflake allows you to load such data directly into the database while applying encryption on the fly and provides a set of functions that extends the standard SQL, which makes it easy to work within the structure of XML documents.

In other words, for querying any XML file, we could use special built-in functions that extend ANSI SQL as follows:

The table function LATERAL FLATTEN¹³ for extracting data from a structure
The $ and @ operators to access the root element and attributes
The XMLGET¹⁴ function for extracting the name of a tag from an XML element

Using Snowflake SQL for Xml

Let’s look at how to work with XML in Snowflake:

1.
Log into your Snowflake’s account and choose the Worksheets tab.
2.
Choose your sample databases and warehouse using DML, as follows:
use warehouse "COMPUTE_WH_TEST";
use "DEMO_DB"."PUBLIC";
3.
Create a table called demo_xml with the VARIANT attribute by using a DDL statement.
create or replace table demo_xml (val variant);
4.
Download the XML file onto your computer using the request in the previous link.
5.
Open the Snowflake UI and choose the Databases tab. Click the table and then click the Load Data button. See Figure 9-1.
Figure 9-1
Loading data into the table using the Snowflake UI

6.
In the window that opens, select your Snowflake warehouse, as shown in Figure 9-2.
Figure 9-2
Choosing the warehouse

7.
Choose your XML file and download it into the Snowflake cloud. See Figure 9-3.
Figure 9-3
Choosing the source file for downloading into Snowflake

8.
Create a new file format for the table. See Figure 9-4.
Figure 9-4
Creating a new file format into Snowflake

9.
Choose the FF_XML format. See Figure 9-5.
Figure 9-5
Selecting a file format

10.
Upload the sample file into Snowflake using Load Options by default. See Figure 9-6.
Figure 9-6
Selecting Load Options

11.
Check the data by using a SQL statement. See Figure 9-7.

select * from demo_xml;

Figure 9-7
XML data in the table
12.
Try to access the root element using the $ operator. See Figure 9-8.

select val:"$" from demo_xml;

Figure 9-8
Applying the $ operator on the XML data in the table
13.
Try to extract the row attribute using the XMLGET function. See Figure 9-9.

select XMLGET(val, 'row',0):"$" from demo_xml;

Figure 9-9
Applying the XMLGET function to the ROW attribute

Figure 9-9 shows the query result, which is a hierarchy in which the name of the tag is written to @ and the value of the tag is written to $.

14.
Extract an array of values using the LATERAL FLATTEN table function with the to_array function, as shown in Listing 9-2. See Figure 9-10.

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig10_HTML.jpg — Figure 9-10
Applying the LATERAL FLATTEN table function to the table

select

xml_doc_val.index,

xml_doc_val.value

from demo_xml,

LATERAL FLATTEN(to_array(demo_xml.val:"$" )) xml_doc,

LATERAL FLATTEN(to_array(xml_doc.VALUE:"$" )) xml_doc_val;

Listing 9-2

Extracting Values from an Array of the Table

In Listing 9-2, in the query, LATERAL FLATTEN is used twice sequentially to extract a portion of the XML document and convert it into rows.

15.
To complete the conversion of an array of values into a table view, modify the query as shown in Listing 9-3. See Figure 9-11.

create view stats_by_zip AS

select

XMLGET( xml_doc_val.value, 'jurisdiction_name'

):"$" as "Jurisdiction_Name",

XMLGET( xml_doc_val.value, 'count_participants'

):"$" as "Count_Participants",

XMLGET( xml_doc_val.value, 'count_female' ):"$"

as "Count_Female",

XMLGET( xml_doc_val.value, 'count_male' ):"$"

as "Count_Male"

from demo_xml,

LATERAL FLATTEN(to_array(demo_xml.val:"$" )) xml_doc,

LATERAL FLATTEN(to_array(xml_doc.VALUE:"$" )) xml_doc_val;

select * from stats_by_zip;

Listing 9-3

Creating a View Based on Extracting Values from the Array of the Table

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig11_HTML.jpg — Figure 9-11
Creating a lateral view from the XML table

In Listing 9-3, the query is similar to the previous one, and the XMLGET function is used, which retrieves values and generates values by columns.

16.
Let’s create a new view using the previous query and add some names of neighborhoods from the site, as shown in Listing 9-4.¹⁵

create or replace table dic_zip_neighborhoods (zip_code

string(5), name string(35));

insert into dic_zip_neighborhoods

values('10001','Chelsea and Clinton'),

('10002','Lower East Side'),

('10003','Lower East Side'),

('10004','Lower Manhattan')

;

create view stats_by_zip_with_neighborhoods AS

select s.Jurisdiction_Name,

d.name as Neighborhoods,

s.Count_Participants,

s.Count_Female,

s.Count_Male

from stats_by_zip as s

left outer join dic_zip_neighborhoods as d

on trim(s.Jurisdiction_Name::string) = d.zip_code

select * from stats_by_zip_with_neighborhoods;

Listing 9-4

Extracting Values from the Array of the Table

In the view stats_by_zip_with_neighborhoods, we combined a regular relational table and view based on XML data, applying the schema and extracting only the necessary attributes on the fly. This view can easily be connected to the BI tool.

Working with JSON

These days, the JSON format is the most popular format for exchanging data. Let’s take a look at how Snowflake works with JSON. Let’s take a sample of financial data using a provider called “World Trading Data.”¹⁶ Perform a request¹⁷ to find out information about the companies Apple, Microsoft, and HSBC Holding.

Listing 9-5 shows an example of the response of the RESTful service on the sample request.

{

"message": "This request..",

"symbols_requested": 3,

"symbols_returned": 3,

"data": [

{

"symbol": "AAPL",

"name": "Apple Inc.",

"currency": "USD",

"price": "202.73",

"price_open": "201.41",

"day_high": "203.13",

"day_low": "201.36",

"52_week_high": "233.47",

"52_week_low": "142.00",

"day_change": "1.18",

"change_pct": "0.59",

"close_yesterday": "201.55",

"market_cap": "932776902656",

"volume": "16682004",

"volume_avg": "27522800",

"shares": "4601079808",

"stock_exchange_long": "NASDAQ Stock Exchange",

"stock_exchange_short": "NASDAQ",

"timezone": "EDT",

"timezone_name": "America/New_York",

"gmt_offset": "-14400",

"last_trade_time": "2019-07-02 16:00:01"

{..},

{..}

]

}

Listing 9-5

Example of JSON Data from NASDAQ

You can see that the response is a tree structure that contains the following:

Field message that contains basic response information
Summary of request including the following:
- Attribute symbols_requested
- Attribute symbols_returned
- Attribute data that is a container for data

Using Snowflake SQL for JSON

Let’s look at how to work with JSON in Snowflake.

1.
Log into your Snowflake account and choose Worksheets.
2.
Choose your sample databases and warehouse using DML, as follows:
use warehouse "COMPUTE_WH_TEST";
use "DEMO_DB"."PUBLIC";
3.
Create the table stock_json by using the following DDL statement:

create or replace table stock_json (val variant);

val is a field with the type VARIANT.
4.
Insert the sample JSON into Snowflake’s table by using the parse_json function.

insert into stock_json select parse_json('<JSON>');

Replace the substitutions with the recent JSON code before executing.
5.
Check the data into the table. See Figure 9-12.

Select * from stock_json;

Figure 9-12
JSON data in the table
6.
Use the notation <field>:<attribute>[::type] to extract data from a specific attribute, as shown in Listing 9-6. See Figure 9-13.

Note Use the notation <field>:<list>.<attribute>[::type] if you need to extract nested attributes but not from arrays.

select val:message::string as msg,

val:symbols_requested::int as smbl_rqstd,

val:symbols_returned::int as smbl_rtrnd

from stock_json;

Listing 9-6

Querying JSON by Using SQL

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig13_HTML.jpg — Figure 9-13
Extracting attributes from the JSON structure

Note Unlike the behavior of ordinary relational databases, in Snowflake, the query will not fail if the schema accidently changes. For example, when an attribute is requested that is missing, it will simply return a NULL value.

7.
For extracting nested elements from the array, use the built-in table function FLATTEN, as shown in Listing 9-7.¹⁸ See Figure 9-14.
table(flatten(<array>:<elements>))
select f.*
from stock_json s,
table(flatten(val:data)) f;
Listing 9-7
Extracting Elements of Arrays of JSON by Using SQL

Figure 9-14
Applying the FLATTEN function to JSON
In Listing 9-8, in the query, FLATTEN extract column names from val:data of the JSON document and converts it into rows.
8.
To extract nested elements from the array, use the built-in table function FLATTEN: table(flatten(<array>:<elements>)). See Figure 9-15.

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig15_HTML.jpg — Figure 9-15
Extracting attributes from the JSON structure

select

s.val:message::string as msg,

s.val:symbols_requested::int as smbl_rqstd,

s.val:symbols_returned::int as smbl_rtrnd,

f.value:symbol::string as smbl,

f.value:name::string as smbl_name,

f.value:currency::string as smbl_currency,

f.value:price::float as prc

from stock_json s,

table(flatten(val:data)) f

Listing 9-8

Extracting Elements of Arrays in JSON by Using SQL

Note To count the number of elements in an array, you can use the function array_size(<array>:<elements>).

Working with AVRO

An AVRO file is serialized JSON with a schema. It is often used as a data transport format in Apache Kafka.

To work with the data from the AVRO file, you have to do the following:

1.
Create a new stage for creating a new AVRO file format.
2.
Upload the AVRO file into the stage in Snowflake to create a new file format.
3.
Create a target table.
4.
Copy the data from the file into the target table.
5.
Query the data in the table using the Snowflake SQL extension.
6.
To do this, you can use the Snowflake UI or a command.¹⁹

Additionally, you can use AVRO tools.²⁰ Specifically, you can use a Java package of specific tools for working with the AVRO format including doing serialization of some JSON files using AVRO schemas.

Working with AVRO

Let’s look at how to work with AVRO in Snowflake:

1.
On your local computer, create a new JSON sample file and save it as stock_sample2.json, as shown in Listing 9-9.

{"symbol":"AAPL","name":"Apple

Inc.","price":201.41,"last_trade_time":

1568579958}

{"symbol":"AAPL","name":"Apple

Inc.","price":201.42,"last_trade_time":

1568587158}

{"symbol":"HSBA.L","name":"HSBC

Holding","price":826.33,"last_trade_time":

1568587158}

{"symbol":"HSBA.L","name":"HSBC

Holding","price":826.47,"last_trade_time":

1568648358}

Listing 9-9

JSON Sample File

2.
Create an Avro Schema for this sample file and save it as stock_sample2.avsc, as shown in Listing 9-10.

{

"type" : "record",

"name" : "simple_stock_schema",

"namespace" : "com.apress.snowflake_jumpstart.avro",

"fields" : [ {

"name" : "symbol",

"type" : "string",

"doc" : "Symbol of the stock"

}, {

"name" : "name",

"type" : "string",

"doc" : "Name of the stock"

}, {

"name" : "price",

"type" : "float",

"doc" : "Price of the stock"

}, {

"name" : "last_trade_time",

"type" : "long",

"doc" : "Last trade time. Time Unix epoch time in seconds"

} ],

"doc:" : "A basic schema for storing stock messages"

}

Listing 9-10

AVRO Schema File

3.
Download the last version of the AVRO tools and generate an AVRO sample file, as shown in Listing 9-11.

java -jar ./avro-tools-1.9.0.jar fromjson --schema-file stock_sample2.avsc stock_sample2.json > stock_sample2.avro

Listing 9-11

AVRO File Generation

Snowflake supports Snappy²¹ compression, so you can add this option:

--codec snappy

4.
Create the target table and the stage for the AVRO file in the table and save it as meta_avro.sql, as shown in Listing 9-12.

use warehouse "COMPUTE_WH_TEST";

use "DEMO_DB"."PUBLIC";

create or replace table c (val variant);

create or replace file format myavroformat

type = 'AVRO';

create or replace stage my_avro_stage

file_format = myavroformat;

Listing 9-12

Creating Metadata for Loading an AVRO File

5.
Run the script.

snowsql -c cc -f meta_avro.sql

Here, cc is your connection label in the config file of snowsql.

6.
Create the script for uploading the AVRO data file. Save the script as put_avro_file.sql, as shown in Listing 9-13.

use warehouse "COMPUTE_WH_TEST";

use "DEMO_DB"."PUBLIC";

put file:///Path/to/file/stock_sample2.avro @my_avro_stage auto_compress=true;

copy into demo_avro

from @my_avro_stage/stock_sample2.avro.gz

file_format = (format_name = myavroformat)

on_error = 'skip_file';

Listing 9-13

Uploading the Data and Copying It into the Target Table

7.
Upload the file into the Snowflake cloud.

snowsql -c cc -f put_avro_file.sql

8.
Now we can check the data in the table, as shown in Listing 9-14. See Figure 9-16.

select val:symbol::string as symbol,

val:name::string as name,

TO_TIMESTAMP(val:last_trade_time::number) as last_trade_time

val:price::number(10,2) as price

from demo_avro;

Listing 9-14

Requesting the Data Loaded from an AVRO File

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig16_HTML.jpg — Figure 9-16
Loaded data from AVRO file in the table

Working with Parquet

A Parquet file is a compressed column-oriented binary file. It is used to store big data with an analytical workload.

To work with the data in a Parquet file, you do the following:

1.
Create a new stage for creating a new Parquet file format.
2.
Upload the Parquet file into the stage in Snowflake where you have to create a new file format.
3.
Create a target table.
4.
Copy the data from the file in the stage to the target table using mapping fields.
5.
Query the data in the table.

Use a similar approach for working with ORC files.

Working with Parquet

Let’s look at how to work with Parquet in Snowflake.

Since we do not have a Parquet file, let’s make it from a CSC file using Python with the Pandas²² and PyArrow²³ libraries. Pandas is a popular library for data manipulation, and it can read our comma-separated file. PyArrow is a Python interface for Apache Arrow that is a cross-language development platform for in-memory data, which can also operate with different types of data including Parquet.

1.
On your local computer, create a new CSV sample file and save it as stock_sample3.csv, as shown in Listing 9-15.

symbol,name,price,last_trade_time

"AAPL","Apple Inc.",201.42,1568587158

"AAPL","Apple Inc.",201.41,1568579958

"AAPL","Apple Inc.",201.44,1568648358

"MSFT","Microsoft",136.01,1568579958

"MSFT","Microsoft",136.92,1568587158

"HSBA.L","HSBC Holding",826.47,1568648358

Listing 9-15

CSV Sample File

2.
Let’s install the necessary libraries.

pip install pandas pyarrow

3.
Make a simple Python script that reads the CSV file and writes it in Parquet format. Save the file as csv_to_parquet.py, as shown in Listing 9-16.

import pandas as pd

import pyarrow as pa

import pyarrow.parquet as pq

csv_file = 'stock_sample3.csv'

parquet_file = 'stock_sample3.parquet'

# read data from CSV file

df = pd.read_csv(csv_file)

# check it

print(df.dtypes)

print(df.to_string())

# write the data in parquet file

table = pa.Table.from_pandas(df)

pq.write_table(table, parquet_file, compression="snappy")

Listing 9-16

Transforming Data from CSV to Parquet

4.
Create the target table and the stage for the Parquet file. Save the script as meta_parquet.sql, as shown in Listing 9-17.

use warehouse "COMPUTE_WH_TEST";

use "DEMO_DB"."PUBLIC";

create or replace table demo_parquet (

symbol varchar,

name varchar,

price number(10,2),

last_trade_time timestamp

);

create or replace file format myparquetformat

type = 'PARQUET';

create or replace stage my_parquet_stage

file_format = myparquetformat;

Listing 9-17

Creating Metadata for Loading the Parquet File

5.
Run the script.

snowsql -c cc -f meta_parquet.sql

cc is your connection label in the config file of snowsql.

6.
Create the script for uploading the Parquet file. Save the script as put_parquet_file.sql, as shown in Listing 9-18.

use warehouse "COMPUTE_WH_TEST";

use "DEMO_DB"."PUBLIC";

put file:///Path/to/File/stock_sample3.parquet @my_parquet_stage auto_compress=true;

# extract and mapping values than copy data in the table

copy into demo_parquet

from (select

$1:symbol::varchar,

$1:name::varchar,

$1:price::number(10,2),

to_timestamp($1:last_trade_time::number)

from @my_parquet_stage/stock_sample3.parquet

)

file_format = (format_name = myparquetformat)

on_error = 'skip_file'

;

Listing 9-18

Uploading the Data and Copying It into the Target Table

7.
Run the script for uploading and checking the result. See Figure 9-17.

snowsql -c cc -f put_parquet_file.sql

../images/482416_1_En_9_Chapter/482416_1_En_9_Fig17_HTML.jpg — Figure 9-17
Loaded data from the Parquet file in the table

Summary

In this chapter, we briefly covered how Snowflake can work with different data formats. Moreover, you learned about which semistructured data formats are supported in Snowflake and saw how this is done in practice by running the examples with JSON, XML, AVRO, and Parquet.

In the next chapter, you will learn about Snowflake’s data sharing capabilities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Working with Semistructured Data

Create new playlist

Sign In

Sign Up

9. Working with Semistructured Data

Supported File Formats

Advanced Data Types

Note

Working with XML

Using Snowflake SQL for Xml

Working with JSON

Using Snowflake SQL for JSON

Working with AVRO

Working with AVRO

Working with Parquet

Working with Parquet

Summary

Table of Contents for
9. Working with Semistructured Data