AWS Data Lake architecture

Let's look at the data lake architecture with AWS Data Lake solution. The overall services provided by a data lake can be grouped into the following four categories:

  • Managed ingestion to onboard data from various sources and any format
  • Centralized storage that can scale as per the business needs
  • Processing and analyzing at big data scale in various programming languages
  • Governing and securing your data packages

The next diagram shows the overall architecture that will form a data lake in AWS. Let's review each feature category in detail:

AWS Data Lake architecture

Figure 7.1: AWS data lake architecture

Managed data ingestion to AWS Data Lake

Amazon has several tools that can ingest data to S3 and Redshift, and I will discuss the most common options here: Direct Connect, Snowball, and Kinesis. Let's review each of these options at a high level:

  • Direct Connect: With Direct Connect, you can establish private connectivity between AWS and your enterprise data center and provide an easy way to move data files from your applications to the AWS S3 storage layer of your data lake
  • Snowball: Snowball (also known as import/export) lets you import hundreds of terabytes of data quickly into AWS using Amazon-provided secure appliances for secure transport
  • Kinesis and Kinesis Firehose: Kinesis services enable the building of custom applications that process or analyze streaming data

Centralized data storage for AWS Data Lake

AWS Data Lake architecture is based on S3 as the primary persistent data store for your cluster. This allows you to separate compute and storage, and resize on demand with no data loss. It is also highly durable and low-cost with several options to connect. We can share data by sharing S3 buckets with multiple data lakes.

Optionally, if your use case demands SQL-based access, you can add Redshift to the storage layer. It is a fast and fully managed petabyte-scale data warehouse that costs less than $1,000 per terabyte per year.

Processing and analyzing data within the AWS Data Lake

For data processing and analysis in an AWS Data Lake, the following services are available:

  • QuickSight: A fast, cloud-powered business intelligence (BI) service and the theme of this book.
  • Machine learning: Machine learning provides visualization tools and wizards for creating machine learning models and executing them on your big data.
  • EMR: Amazon EMR provides a distributed compute framework that makes it an easy, fast, and cost-effective way to process data on S3 at scale and on demand. AWS now provides options for spot instances that are offered at lower cost. They are best used for application tests and/or use cases that don't have hard SLA's.
  • Lambda: AWS Lambda allows you to run code without provisioning or managing the server and can be triggered upon arrival of data in S3 and for any streaming sources from Kinesis.
  • Athena: A query service that makes it easy to analyze data directly from files in S3 using standard SQL statements. Athena is server-less, which makes it really stand out since there is no additional infrastructure to be provisioned.
  • Glue: A new service that enables ETL and makes it easy to transform and move data from S3 to your consumers. It is integrated with S3, Redshift, and other JDBC complaint data sources and auto-suggests schemas and transformations, which improve developer productivity. You can also view and edit the code it generates in popular languages such as Python and Spark, with the ability to share the code with your peers. Glue schedules the ETL jobs and auto-provisions and scales the infrastructure based on the job requirements.

Governing and securing the AWS Data Lake

Governance is key in building a managed data lake and this area is where the AWS solution has made recent efforts by introducing catalog and a web interface. Let's review the key services used for governance:

  • Catalog and user interface: AWS Data Lake solution comes with a data catalog that is searchable using the web application. The catalog can be populated by the web interface or via the API with information about the various packages for the data lake, and this information is stored in DynamoDB. Once the datasets are registered, they are automatically indexed to Elasticsearch and are searchable by the web interface.
  • User and access management: AWS Data Lake solution provides a web interface to manage users for the data lake. As a data lake administrator, you can decide which user gets access to the data lake and at what level (member or administrator). You can also grant API access to specific users.

This concludes the AWS Data Lake architecture; next, we will build a real-life use case using AWS Data Lake.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset