Backing up data using AWS Data Pipeline

One of the most widely used use cases for AWS Data Pipeline is its ability to synchronize and schedule backup jobs. You can use Data Pipeline to take backups of data stored within EC2 instances, EBS volumes, databases and even S3 buckets. In this section, we will walk through a simple, parameterized pipeline definition using which you can effectively schedule and perform backups of files stored within an Amazon S3 bucket.

First up, let's have a look at the pipeline definition file itself:

You can find the complete copy of code at https://github.com/yoyoclouds/Administering-AWS-Volume2.

To start with, we once again provide a list of objects that describe the pipeline components starting with a pipeline configuration object, as highlighted in the following code:

  "objects": [
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myDataPipelineLogs}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },

Next, we provide the definition for other pipeline objects, including the data nodes:

    {
      "filePath": "#{myInputS3FilePath}",
      "name": "inputS3Bucket",
      "id": "InputS3FilePath",
      "type": "S3DataNode"
    },
    {
      "filePath": "#{myOutputS3FilePath}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}.bak",
      "name": "outputS3Bucket",
      "id": "OutputS3FilePath",
      "type": "S3DataNode"
    },

In this case, we are using the #{VARIABLE_NAMES} to declare a set of variables to make the pipeline definition more reusable. Once the data nodes are configured, we also have to define a set of actions that will trigger SNS alerts based on the pipeline's success or failure. Here is a snippet of the same:

{
    "role": "DataPipelineDefaultRole",
    "subject": "Failure",
    "name": "SNSAlertonFailure",
    "id": "OnFailSNSAlert",
    "message": "File was not copied over successfully. Pls check with Data Pipeline Logs",
    "type": "SnsAlarm",
    "topicArn": "#{mySNSTopicARN}"
},

With the objects defined, the second section requires the parameters to be set up, where each of the variables declared in the objects section are detailed and defined:

  "parameters": [
    {
      "watermark": "s3://mysourcebucket/filename",
      "description": "Source File Path:",
      "id": "myInputS3FilePath",
      "type": "AWS::S3::ObjectKey",
      "myComment": "The File path from the Input S3 Bucket"
    },
    {
      "watermark": "s3://mydestinationbucket/filename",
      "description": "Destination (Backup) File Path:",
      "id": "myOutputS3FilePath",
      "myComment": "The File path for the Output S3 Bucket",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "watermark": "arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic",
      "description": "SNS Topic ARN:",
      "id": "mySNSTopicARN",
      "type": "string",
      "myComment": "The SNS Topic's ARN for notifications"
    },
. . . .
  ]
}

With this in mind, let us first look at uploading this definition to AWS Data Pipeline using the web console:

Log in to the AWS Data Pipeline console by navigating to this URL: https://console.aws.amazon.com/datapipeline/home?region=us-east-1.

We have deployed all of our pipelines so far in the US East (N. Virginia) region itself. You can opt to change the region, as per your requirements.

Once done, select the Create Pipeline option to get started. In the Create Pipeline page, fill in a suitable Name and Description for the new pipeline:

Next, select the Import a definition option and click on the Load local file as shown. Copy and upload the JSON file definition here.
With the file uploaded, fill out the Parameters section as explained here:
- S3 bucket path to data pipeline logs: Browse and provide the bucket path for storing the pipeline's logs.
- Source file path: Browse and select a file that you wish to backup from an Amazon S3 bucket.
- Destination (backup) file path: Browse and select an Amazon S3 bucket path where you store the backed up file. You can optionally provide a backup folder name as well. Each file backed up to this location will follow a standard naming convention: YYYY-MM-dd-HH-mm-ss.bak.
- SNS Topic ARN: Provide a valid SNS Topic ARN here. This ARN will be used to notify the user whether the pipeline's execution was a success or a failure.
- EC2 instance type: You can optionally provide a different EC2 instance type as a resource here. By default, it will take the t1.micro instance type.
- EC2 instance termination: Once again, you can provide a different instance termination value here. By default, it is set to 20 minutes. The termination time should be changed based on the approximate time taken to back up a file. The larger the file, the more time required to copy it and vice versa.

Once the parameter fields are populated, select the Edit in Architect option to view the overall components of the pipeline definition. You should see the following depiction:

Click on Save to validate the pipeline for any errors. Once done, select Activate to start the pipeline's execution process.
The pipeline takes a few minutes to transition from the WAITING_FOR_RUNNER state to the FINISHED state. Once done, check for the backed up file in your destination S3 folder.

You can further tweak this particular pipeline definition to include entire S3 folder paths rather than just an individual file as performed now. Additionally, you can also change the start of the pipeline's execution by changing the scheduleType from ONDEMAND to Schedule, as depicted in the following code snippet:

{ 
  "id" : "Default", 
  "type" : "Schedule", 
  "period" : "1 hours", 
  "startDateTime" : "2018-03-01T00:00:00", 
  "endDateTime" : "2018-04-01T00:00:00" 
}

The following snippet will execute the pipeline every hour starting from March 1, 2018 at 00:00:00 until April 1, 2018 00:00:00.

To know more on how you can use the Schedule object, visit https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html.

Now that the pipeline is up and running using the console, let us also have a look at a few simple AWS CLI commands using which you can achieve the same results:

To start with, create a blank pipeline using the following command:

# aws datapipeline create-pipeline  
--name <NAME_OF_PIPELINE>  
--unique-id <UNIQUE_TOKEN>

The <UNIQUE_TOKEN> can be any string of characters and is used to ensure idempotency during repeated calls to the create-pipeline command.

Once the pipeline is created, you will be presented with the pipeline's ID, as depicted in the following screenshot. Make a note of this ID as it will be required in the next steps:

Next, we need to create three separate JSON files with the following content in them:
- pipeline.json: Copy and paste only the object definitions in this file.
- parameters.json: Copy and paste the parameter definitions here.
- values.json: Create a new file that contains the values for the parameters ,as shown in the following code snippet. Remember to substitute the values in <> with those of your own:

{ 
  "values": 
    { 
      "myDataPipelineLogs": "s3://<BUCKET_NAME>", 
      "myOutputS3FilePath": "s3://<BUCKET_NAME>/<FOLDER>", 
      "myInputS3FilePath": "s3://<BUCKET_NAME>/<FILE_NAME>", 
      "mySNSTopicARN": "<SNS_ARN_FOR_NOTIFICATIONS>", 
      "myEC2InstanceType": "t1.micro", 
      "myEC2InstanceTermination": "20" 
    } 
}

Once done, save all three files and type in the following command to attach the pipeline definition to the newly created pipeline:

# aws datapipeline put-pipeline-definition  
--pipeline-id <PIPELINE_ID>  
--pipeline-definition file://pipeline.json  
--parameter-objects file://parameters.json  
--parameter-values-uri file://values.json

Here is a screenshot of the command's output for your reference:

With the pipeline definition uploaded, the final step left is to activate the pipeline using the following command:

# aws datapipeline activate-pipeline  
--pipeline-id <PIPELINE_ID>

Once the pipeline is activated, you can view its status and last runtimes, using the following command:

# aws datapipeline list-runs
--pipeline-id <PIPELINE_ID>

Once the pipeline's execution completes, you can deactivate and delete the pipeline using the following set of commands:

# aws datapipeline deactivate-pipeline  
--pipeline-id <PIPELINE_ID> 
# aws datapipeline delete-pipeline  
--pipeline-id <PIPELINE_ID>

Here is a screenshot of the command's output for your reference:

With this, we come towards the end of yet another interesting chapter, but before we wind things up, here is a quick look at some important next steps that you should try out on your own.

Table of Contents for Backing up data using AWS Data Pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
Backing up data using AWS Data Pipeline