Working with External Data Sources

Apache Spark depends upon the big data pipeline to get data. The pipeline starts with source systems. The source system data ingress can be arbitrarily complex due to the following reasons:

  • Nature of the data (relational, non-relational)
  • Data being dirty (yes, it's more of a rule than exception)
  • Source data being at a different level of normalization (SAP data, for example, has an extremely high degree of normalization)
  • Lack of consistency in the data (data needs to be harmonized so that it speaks the same language)

In this chapter, we will explore how Apache Spark connects to various data sources. This chapter is divided into the following recipes:

  • Loading data from the local filesystem
  • Loading data from HDFS
  • Loading data from Amazon S3
  • Loading data from Apache Cassandra
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset