Getting started with AWS Data Pipeline

Creating your own pipeline is a fairly simple process, once you get to know the intricacies of working with the pipeline dashboard. In this section, we will be exploring the AWS Data Pipeline dashboard, its various functions, and editor to create a simple Hello World example pipeline. To start off, here are a few necessary prerequisite steps that you need to complete first, starting with a simple Amazon S3 bucket for storing all our data pipeline logs.

AWS Data Pipeline is only available in the EU (Ireland), Asia Pacific (Sydney), Asia Pacific (Tokyo), US East (N. Virginia), and the US West (Oregon) regions. For the purpose of the scenarios in this chapter, we will be using the US East (N. Virginia) region only.

From the AWS Management Console, launch the Amazon S3 console by either filtering the service name from the Filter option or navigating to this URL: https://s3.console.aws.amazon.com/s3/home?region=us-east-1.

Next, select the Create bucket option and provide a suitable value in the Bucket name field. Leave the rest of the fields to their default values and select Create to complete the process.

With the log bucket created, the next prerequisite step involves the creation of a couple of IAM Roles that are required by AWS Data Pipeline for accessing resources, as well as what particular action it can perform over them. Since we are going to use the AWS Data Pipeline console for our first pipeline build, Data Pipeline provides two default IAM Roles that you can leverage out of the box:

DataPipelineDefaultRole: An IAM Role that grants AWS Data Pipeline access to all your AWS resources, including EC2, IAM, Redshift, S3, SNS, SQS and EMR. You can customize it to restrict the AWS services that Data Pipeline can access. Here is a snippet of the policy that is created:

{ 
    "Version": "2012-10-17", 
    "Statement": [ 
        { 
            "Effect": "Allow", 
            "Action": [ 
                "cloudwatch:*", 
                "datapipeline:DescribeObjects", 
                "datapipeline:EvaluateExpression", 
                "dynamodb:BatchGetItem", 
                "dynamodb:DescribeTable", 
                "dynamodb:GetItem", 
                ... 
                "ec2:RunInstances", 
                "ec2:StartInstances", 
                "ec2:StopInstances", 
                ... 
                "elasticmapreduce:*", 
                "iam:GetInstanceProfile", 
                "iam:GetRole", 
                "iam:GetRolePolicy", 
                ...   
                "rds:DescribeDBInstances", 
                "rds:DescribeDBSecurityGroups", 
                "redshift:DescribeClusters", 
                "redshift:DescribeClusterSecurityGroups", 
                "s3:CreateBucket", 
                "s3:DeleteObject", 
                "s3:Get*", 
                "s3:List*", 
                "s3:Put*", 
                ... 
                "sns:ListTopics", 
                "sns:Publish", 
                "sns:Subscribe", 
                ... 
                "sqs:GetQueue*", 
                "sqs:PurgeQueue", 
                "sqs:ReceiveMessage" 
            ], 
            "Resource": [ 
                "*" 
            ] 
        }, 
        { 
            "Effect": "Allow", 
            "Action": "iam:CreateServiceLinkedRole", 
            "Resource": "*", 
            "Condition": { 
                "StringLike": { 
                    "iam:AWSServiceName": [ 
                        "elasticmapreduce.amazonaws.com", 
                        "spot.amazonaws.com" 
                    ] 
                } 
            } 
        } 
    ] 
}

DataPipelineDefaultResourceRole: This Role allows applications, scripts, or code executed on the Data Pipeline resources' (EC2/EMR instances) access to your AWS resources:

{ 
    "Version": "2012-10-17", 
    "Statement": [ 
        { 
            "Effect": "Allow", 
            "Action": [ 
                "cloudwatch:*", 
                "datapipeline:*", 
                "dynamodb:*", 
                "ec2:Describe*", 
                "elasticmapreduce:AddJobFlowSteps", 
                "elasticmapreduce:Describe*", 
                "elasticmapreduce:ListInstance*", 
                "elasticmapreduce:ModifyInstanceGroups", 
                "rds:Describe*", 
                "redshift:DescribeClusters", 
                "redshift:DescribeClusterSecurityGroups", 
                "s3:*", 
                "sdb:*", 
                "sns:*", 
                "sqs:*" 
            ], 
            "Resource": [ 
                "*" 
            ] 
        } 
    ] 
}

With the prerequisites out of the way, let's now move on to creating our very first pipeline:

From the AWS Management Console, filter out Data Pipeline using the Filter option or alternatively, selecting this URL provided here https://console.aws.amazon.com/datapipeline/home?region=us-east-1. Select the Get started now option.
This will bring up the Create Pipeline wizard as displayed. Start by providing a suitable name for the pipeline using the Name field followed by an optional Description.

Next, select the Build using Architect option from the Source field.

AWS Data Pipeline provides different ways for creating pipelines. You can leverage either one of the several pre-built templates using the Build using a template option, or opt for a more customized approach by selecting the Import a definition option, where you can create and upload your own data pipeline definitions. Finally, you can use the data pipeline architect mode to drag-drop and customize your pipeline using a simple intuitive dashboard, which is what we are going to do in this use case:

Moving on, you can also schedule the run of your pipeline by selecting the correct option, provided under the Schedule section. For now, select the On pipeline activation option, as we want our pipeline to start its execution only when it is first activated.
Next, browse and select the correct S3 bucket for logging the data pipelines' logs using the S3 location for logs option. This should be the same bucket that was created during the prerequisite section of this scenario.
Optionally, you can also provide your custom IAM Roles for Data Pipeline by selecting the Custom option provided under the Security/Access section. In this case, we have gone ahead and selected the Default IAM Roles themselves.
Once all the required fields are populated, select the Edit in Architect option to continue.

With this step completed, you should see the architect view of your current pipeline as depicted. By default, you will only have a single box called Configuration displayed.

Select the Configuration box to view the various configuration options required by your pipeline to run. This information should be visible on the right-hand side navigation pane under the Others section, as shown in the following screenshot:

You can use this Configuration to edit your pipeline's Resource Role, Pipeline Log Uri, Schedule Type, and many other such settings as well.

To add Resources and Activities to your pipeline, select the Add drop-down list as shown. Here, select ShellCommandActivity to get started. We will use this activity to echo a simple Hello World message for starters.
Once the ShellCommandActivity option is selected, you should be able to see its corresponding configuration items in the adjoining navigation pane under the Activities tab.
Type in a suitable Name for your activity. Next, from the Type section, select the Add an optional field drop-down list and select the Command option as shown. In the new Command field, type echo "This is just a Hello World message!".

With the activity in place, the final step left is to provide and associate a resource to the pipeline. The resource will execute the ShellCommandActivity on either an EC2 instance or an EMR instance.
To create and associate a resource, from the Activities section, select the Add an optional field option once again and from the drop-down list, select the Runs On option. Using the Runs On option, you can create and select Resources for executing the task for your pipeline.
Select the Create new: Resource option to get started. This will create a new resource named DefaultResource1, as depicted in the following screenshot:

Select the newly created resource or alternatively, select the Resources option from the navigation pane to view and add resource specific configurations.
Fill in the following information as depicted in the previous screenshot in the Resources section of your pipeline:
- Name: Provide a suitable name for your new resource.
- Type: Select the Ec2Resource option from the drop-down list.
- Role/Resource Role: You can choose to provide different IAM Roles, however I have opted to go for the default pipeline roles itself.
- Instance Type: Type in t1.micro in the adjoining field. If you do not provide or select the instance type field, the resource will launch a m1.medium instance by default.
- Terminate After: Select the appropriate time after which the instance should be terminated. In this case, I have selected to terminate after 10 minutes.

Here's a screenshot of what the final pipeline would look like once the Resources section is filled out:

Once the pipeline is ready, click on Save to save the changes made. Selecting the Save option automatically compiles your pipeline and checks for any errors as well. If any errors are found, they will be displayed in the Errors/Warnings section. If no errors are reported, click on Activate to finally activate your pipeline.

The pipeline takes a few minutes to transition from WAITING_FOR_RUNNER state to a FINISHED state. This process involves first spinning up the EC2 instance or resource, which we defined in the pipeline. Once the resource is up and running, Data Pipeline will automatically install the task runner on this particular resource, as Data Pipeline itself manages it. With the task runner installed, it starts polling the data pipeline for pending activities and executes them.

Once the pipeline's status turns to FINISHED, expand the pipeline's component name and select the Attempts tab, as shown. If not specified, Data Pipeline will try and execute your pipeline for a default three attempts before it finally stops the execution.

For each attempt, you can view the corresponding Activity Logs, Stdout as well as the Stderr messages:

Select the Stdout option to view your Hello World message! Et voila! Your first pipeline is up and running!

Feel free to try out a few other options for your pipeline by simply selecting the pipeline name and click on the Edit Pipeline option. You can also export your pipeline's definition by selecting the pipeline name and from the Actions tab, opting for the Export option.

Pipeline definitions are a far better and easier way of creating pipelines if you are a fan of working with JSON and CLI interfaces. They offer better flexibility and usability as compared to the standard pipeline dashboard which can take time to get used to for beginners. With this in mind, in the next section we will be exploring a few basics on how you can get started by creating your very own pipeline definition file.

Table of Contents for Getting started with AWS Data Pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting started with AWS Data Pipeline