Chapter 7. Big Data Analytics Mini Project

Modern data architectures are moving to a data lake solution that has the ability to ingest data from various sources, transform and analyze at a big data scale. In the last few chapters, we saw various AWS components that help in big data analytics; Amazon now offers a data lake solution that packages the most commonly needed components along with a web application to jump-start the data lake build-out. In this chapter, we will solve a real-life use case leveraging the AWS Data Lake solution and we will cover the following topics:

  • Overview of AWS Data Lake solution
  • AWS Data Lake architecture
  • A Mini project on AWS Data Lake
  • Advanced AWS Data Lake features

Overview of AWS Data Lake solution

A data lake is a new architectural pattern that is a popular way to store and analyze data as it allows enterprises to easily ingest and store data in any format, both structured and unstructured. A modern data lake provides more agility and flexibility than traditional management systems and allows businesses to store all their data, structured and unstructured, in a central repository.

In Chapter 2, Exploring Any Data, we looked at various services that make up an AWS big data ecosystem. These services are building blocks for a data lake and are broadly classified into four major categories: collect, store, analyze, and orchestrate.

To jump start the build-out of a new data lake, AWS offers a data lake solution that has the key building blocks already packaged and deployed, along with an intuitive web application. This pre-packaged implementation allows customers to quickly realize the data lake concept and put a real web interface in front of the data lake to register new data feeds with metadata, catalog, search, and provisioning. The following are the key features provided by the AWS Data Lake solution:

  • Reference implementation: In AWS Data Lake solution, you are provided with an out-of-the-box implementation, including metadata management, that you can customize as per your project needs.
  • User interface: The solution comes packaged with a web-based user interface hosted on S3. Use this to manage data lake users, policies, and packages. You can also use it to search for data packages and create manifests for provisioning it to target databases.
  • APIs: AWS Data Lake solution comes with APIs or CLI to automate integration with other services in AWS and help in onboarding or extracting data to and from the data lake. The solution also comes with API key management by user.
  • Central storage layer: For data lake storage, use the S3 bucket and secure it with AWS key management service to encrypt data at rest.

Data lake core concept - package

Before we get further into the AWS Data Lake solution, we must be familiar with a core concept called package. It is basically a logical concept that groups all files that have the same structure into one unit called a package. Let's take an example: say you are receiving user web clicks information daily as a file from your web application, as a TSV file stored in S3 under a folder. To use this dataset in the AWS Data Lake, you will first have to create a package with a name like user web clicks and define the structure (fields with data types) and the location of the files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset