Executing remote commands using AWS Data Pipeline

One of the best parts of working with Data Pipeline is that versatility of tasks that you can achieve by just using this one tool. In this section, we will be looking at a relatively simple pipeline definition using which you can execute remote scripts and commands on EC2 instances.

How does this setup work? Well, to start with, we will be requiring one S3 bucket (can be present in any AWS region) to be created that will store and act as a repository for all our shell scripts. Once the bucket is created, simply create and upload the following shell script to the bucket. Note however that in this case, the shell script is named simplescript.sh and the same name is used in the following pipeline definition, as well:

#!/bin/bash 
echo "----------------------------------" 
echo "Your username is: $(echo $USER)" 
echo "----------------------------------" 
echo "The current date and time : $(date)" 
echo "----------------------------------" 
echo "Users currently logged on this system: " 
echo "$(who)" 
echo "----------------------------------" 
echo "AWS CLI installed at: " 
echo "$(aws --version)" 
echo "----------------------------------"

The script is pretty self-explanatory. It will print out a series of messages based on the EC2 instance it is launched from. You can substitute this script with any other shell script that can either be used to take backups of particular files, or archive existing files into a tar.gz and push it over to an awaiting S3 bucket for archiving, and so on.

With the script file uploaded to the correct S3 bucket, the final step is to copy and paste the following pipeline definition in a file and upload it to Data Pipeline for execution:

{ 
  "objects": [ 
    { 
      "failureAndRerunMode": "CASCADE", 
      "resourceRole": "DataPipelineDefaultResourceRole", 
      "role": "DataPipelineDefaultRole", 
      "pipelineLogUri": "s3://<DATAPIPELINE_LOG_BUCKET>", 
      "scheduleType": "ONDEMAND", 
      "name": "Default", 
      "id": "Default" 
    }, 
    { 
      "name": "CliActivity", 
      "id": "CliActivity", 
      "runsOn": { 
        "ref": "Ec2Instance" 
      }, 
      "type": "ShellCommandActivity", 
      "command": "(sudo yum -y update aws-cli) && (#{myCustomScriptCmd})" 
    }, 
    { 
      "instanceType": "t1.micro", 
      "name": "Ec2Instance", 
      "id": "Ec2Instance", 
      "type": "Ec2Resource", 
      "terminateAfter": "15 Minutes" 
    } 
  ], 
  "parameters": [ 
    { 
      "watermark": "aws [options] <command> <subcommand> [parameters]", 
      "description": "AWS CLI command", 
      "id": "myCustomScriptCmd", 
      "type": "String" 
    } 
  ], 
  "values": { 
    "myCustomScriptCmd": "aws s3 cp s3://<S3_BUCKET_SCRIPT_LOCATION>/simplescript.sh . && sh simplescript.sh" 
  } 
}

Remember to swap out the values for <DATAPIPELINE_LOG_BUCKET> and <S3_BUCKET_SCRIPT_LOCATION> with their corresponding actual values, and to save the file with a JSON extension.

This particular pipeline definition relies on the ShellCommandActivity to first install the AWS CLI on the remote EC2 instance and then execute the shell script by copying it locally from the S3 bucket.

To upload the pipeline definition, use the AWS Data Pipeline console to create a new pipeline. In the Create Pipeline wizard, provide a suitable Name and Description for the new pipeline. Once done, select the Import a definition option from the Source field, as shown in the following screenshot:

Once the script loads, you should see the custom AWS CLI command in the Parameters section. With the pipeline definition successfully loaded, you can now choose to run the pipeline, either on a schedule or on activation. In my case, I have select to run the pipeline on activation itself, as this is for demo purposes.

Ensure that the logging is enabled for the new pipeline and the correct S3 bucket for storing the pipeline's logs is mentioned. With all necessary fields filled, click on Activate to start up the pipeline.

Once again, the pipeline will transition from WAITING_FOR_RUNNER state to the FINISHED state. This usually takes a good minute or two to complete.

From the Data Pipeline console, expand on the existing pipeline and select the Attempts tab as shown in the following screenshot. Here, click on Stdout to view the output of the script's execution:

Once the output is viewed, you can optionally select the pipeline and click on Mark Finished option, as well. This will stop the pipeline from undertaking any further attempts on executions.

Simple, isn't it! You can use a similar method and approach to back up your files and execute some commands over managed instances. In the next section, we will be looking at one last pipeline definition example as well, that essentially helps us take periodic backups of content stored in one Amazon S3 bucket to another using both the Data Pipeline console, as well as the AWS CLI!

Table of Contents for Executing remote commands using AWS Data Pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
Executing remote commands using AWS Data Pipeline