Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Working with External Data Sources

Apache Spark depends upon the big data pipeline to get data. The pipeline starts with source systems. The source system data ingress can be arbitrarily complex due to the following reasons:

Nature of the data (relational, non-relational)
Data being dirty (yes, it's more of a rule than exception)
Source data being at a different level of normalization (SAP data, for example, has an extremely high degree of normalization)
Lack of consistency in the data (data needs to be harmonized so that it speaks the same language)

In this chapter, we will explore how Apache Spark connects to various data sources. This chapter is divided into the following recipes:

Loading data from the local filesystem
Loading data from HDFS
Loading data from Amazon S3
Loading data from Apache Cassandra

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Working with External Data Sources

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with External Data Sources