Generator of pipelines executing containerized apps

Introduction

National Studies on Air Pollution and Health organization (NSAPH) publishes containerized applications to produce certain types of data. These applications are published on the NSAPH Data Production GitHub.

The Pipeline Generator generates a CWL pipeline to execute the app and ingest the data it produces into Dorieh Data warehouse.

The process of data ingestion consists of two steps:

  1. Generation of the piepline for data ingestion

  2. Execution of the pipeline

Prerequisites

Docker or Python virtual environment

You need either Option 1 or Option2, not both!

Option 1: Docker

The first step, generation of the pipeline, has minimal requirements. The easiest way to generate the pipeline is to use a docker container, which will only requires Docker to be installed on the host system, where the step is executed. See Docker installation instructions for details.

Option 2: Python virtual environment

Alternatively, instead of Docker one can set up a Python virtual environment. Once virtual environment is set up, you should install Dorieh packages in it with the following command:

pip install git+https://github.com/NSAPH-Data-Platform/nsaph-core-platform.git@develop

Setup DBMS Server

Dorieh uses PostgreSQL DBMS to manage its data warehouse. The data warehouse is assumed to be set up and operational to ingest data. Generating the pipeline does not require the data warehouse.

Define connection

Dorieh uses database.ini type file to manage connections to data warehouse. The format described in documentation.

If the file with database connections does not exist, you need to create one. For example, named database.ini somewhere on your local file system.

Using pipeline generator

Generate pipeline and metadata

Generator takes 3 command line parameters:

  1. GitHub URL or a local path for the containerized app. In the root directory of the path, generator will look for a file named app.config.yaml.

If you use a local Python virtual environment, then run:

python -m dorieh.platform.apprunner.app_run_generator $GitHubURL $outputfile $branch

Example:

python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git pipeline.cwl master

to generate a pipeline executing ZIP to ZCTA Crosswalk Producer app using master branch and output the result into current directory in a file named pipeline.cwl

Alternatively, to do the same using Docker container, execute:

docker run -v $(pwd):/tmp/work forome/dorieh python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git /tmp/work/pipeline.cwl master

In both cases, the generator will produce 3 files:

  • pipeline.cwl: main workflow file

  • ingest.cwl: subworkflow used for data ingestion

  • common.yaml: metadata required for ingestion. Name common is derived from domain key in the app.config.yaml file in the app repository.

Execute generated pipeline

If you installed Dorieh packages in your local Python virtual environment, you can execute the pipeline with the following command in the working directory, for example using CWL reference implementation built into Dorieh (cwl-runner).

cwl-runner pipeline.cwl --database $path_to_your_connection_def_file --connection_name $connection_name

for example:

cwl-runner pipeline.cwl --database ../../database.ini --connection_name dorieh

A better way would be to use a production grade CWL implementation such as Toil. To do this you need to install Toil on your local system.

You do not need to install Dorieh packages to execute the pipeline. The runtime engine will use Dorieh container where all requirements are preinstalled.

For Toil, a good advice would be to first create working directory, e.g. named work. Otherwise, Toil will create a default directory somewhere in yoru temporary space.

The command to execute the pipeline with Toil would be:

toil-cwl-runner --retryCount 0 --cleanWorkDir never  --jobStore j1 --outdir results --workDir work  pipeline.cwl --database ../../database.ini --connection_name nsaph-docker

Specifying jobStore will let you restart the pipeline from a point of failure if pipeline execution fails for any reason.

Appendix 1: Metadata description

File app.config.yaml

Keys:

  • metadata: a relative path to the app metadata file

  • dorieh-metadata: a relative path to the metadata required to create a database table

  • docker: information about the docker container, including:

    • image: the tag for the container image that executes the app

    • run: the command to be run within the container. This is an optional field

    • outputdir: directory, where to look for the results of the execution of the app

File metadata.yml

This is the file referenced from app.config.yaml by metadata key.

It should contain the following keys:

  • dataset_name

  • description

  • fields:

    • table

      • columns

Each column should have name, type and description keys.

File dorieh-metadata.yaml

This is a header for a knowledge domain that will be created. Detailed description is provided in the Data modeling section. It is important to define correct values for quoting, schema and primary_key for each table.