Working with data pipeline definition Files

The AWS Data Pipeline console provides us with three different options to get started with creating a new pipeline. You could use the architect mode, which is exactly what we ended up working with in the earlier section, or alternatively, use any one of the pre-defined templates as a boilerplate and build your pipeline fro them. Last but not the least, the console also provides you with an ability to upload your very own pipeline definition file, which is basically a collection of various pipeline objects and conditions written in a JSON format. In this section, we will be learning how to write our very own pipeline definitions and later, use the same for building a custom pipeline as well.

To start, you will need two components to build up a pipeline definition file: objects and fields:

  • Objects: An object is an individual component required to build a pipeline. These can be data nodes, conditions, activities, resources, schedules, and so on.
  • Fields: Each object is described by one or more fields. The fields are made up of key-value pairs that are enclosed in double quotes and separated by a colon.

Here is a skeleton structure of a pipeline definition file:

{ 
  "objects" : [ 
    { 
       "key1" : "value1", 
       "key2" : "value2" 
    }, 
    { 
       "key3" : "value3" 
    } 
  ] 
} 

Here is a look at the pipeline definition file obtained by exporting the Hello World pipeline example that we performed a while back:

{ 
  "objects": [ 
    { 
      "failureAndRerunMode": "CASCADE", 
      "resourceRole": "DataPipelineDefaultResourceRole", 
      "role": "DataPipelineDefaultRole", 
      "pipelineLogUri": "s3://us-east-datapipeline-logs-01/logs/", 
      "scheduleType": "ONDEMAND", 
      "name": "Default", 
      "id": "Default" 
    }, 
    { 
      "name": "myActivity", 
      "id": "ShellCommandActivityId_2viZe", 
      "runsOn": { 
        "ref": "ResourceId_EhxAF" 
      }, 
      "type": "ShellCommandActivity", 
      "command": "echo "This is just a Hello World message!"" 
    }, 
    { 
      "resourceRole": "DataPipelineDefaultResourceRole", 
      "role": "DataPipelineDefaultRole", 
      "name": "myEC2Resource", 
      "id": "ResourceId_EhxAF", 
      "type": "Ec2Resource", 
      "terminateAfter": "10 Minutes" 
    } 
  ], 
  "parameters": [] 
} 
You can find the complete copy of code at https://github.com/yoyoclouds/Administering-AWS-Volume2.

Each object generally contains an id, name, and type fields that are used to describe it and its functionality. For example, the Resource object in the Hello World scenario contains the following values:

{ 
      "name": "myEC2Resource", 
      "id": "ResourceId_EhxAF", 
      "type": "Ec2Resource", 
       ... 
} 

You can also find the same fields in both the ShellCommandActivity, as well as the default configurations objects.

A pipeline object can refer to other objects within the same pipeline using the "ref" : "ID_of_referred_resource" field. Here is an example of the ShellCommandActivity referencing to the EC2 resource, using the resource ID:

{ 
      "name": "myActivity", 
      "id": "ShellCommandActivityId_2viZe", 
      "runsOn": { 
        "ref": "ResourceId_EhxAF" 
      }, 
      "type": "ShellCommandActivity", 
      "command": "echo "This is just a Hello World message!"" 
    }, 
    { 
      "resourceRole": "DataPipelineDefaultResourceRole", 
      "role": "DataPipelineDefaultRole", 
      "name": "myEC2Resource", 
      "id": "ResourceId_EhxAF", 
      "type": "Ec2Resource", 
      "terminateAfter": "10 Minutes" 
    } 

You can additionally create custom or user-defined fields and refer them to other pipeline components, using the same syntax as described in the previous code:

{ 
  "id": " ResourceId_EhxAF", 
  "type": "Ec2Resource", 
  "myCustomField": "This is a custom field.", 
  "myCustomReference": {"ref":" ShellCommandActivityId_2vi"} 
  }, 
You can find the detailed references for data nodes, resources, activities, and other objects at https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-objects.html.

Last but not the least; you can also leverage a parameterized template to customize the pipeline definition. Using this method, you can basically have one common pipeline definition and pass different values to it at the time of pipeline creation.

To parametrize a pipeline definition you need to specify a variable using the following syntax:

 "#{VARIABLE_NAME}"

With the variable created, you can define its value in a separate parameters object which can be stored in the same pipeline definition file, or in a separate JSON file altogether as well. Consider the following example where we pass the same Hello World message in the ShellCommandActivity however, this time using a variable definition:

{ 
      "name": "myActivity", 
      "id": "ShellCommandActivityId_2viZe", 
      "runsOn": { 
        "ref": "ResourceId_EhxAF" 
      }, 
      "type": "ShellCommandActivity", 
      "command": "#{myVariable}" 
}

Once the variable is defined, we pass its corresponding values and expression in a separate parameters object, as shown in the following code:

{ 
  "parameters": [ 
    { 
      "id": "myVariable", 
      "description": "Shell command to run", 
      "type": "String", 
      "default": "echo "Default message!"" 
    } 
  ] 
} 

In this case, the variable myVariable is a simple string type and we have also provided it with a default value, in case a value is not provided to this variable at the time of the pipeline's creation.

To know more about how to leverage and use variable and parameters in your pipeline definitions, visit https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html.

With this, we come towards the end of this section. In the next section, we will look at how you can leverage the AWS Data Pipeline to execute scripts and commands on remote EC2 instances using a parameterized pipeline definition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset