Generator of pipelines executing containerized apps

Introduction 

National Studies on Air Pollution and Health organization (NSAPH) publishes containerized applications to produce certain types of data. These applications are published on the NSAPH Data Production GitHub.

The Pipeline Generator generates a CWL pipeline to execute the app and ingest the data it produces into Dorieh Data warehouse.

The process of data ingestion consists of two steps:

Generation of the piepline for data ingestion
Execution of the pipeline

Prerequisites 

Docker or Python virtual environment 

You need either Option 1 or Option2, not both!

Option 1: Docker 

The first step, generation of the pipeline, has minimal requirements. The easiest way to generate the pipeline is to use a docker container, which will only requires Docker to be installed on the host system, where the step is executed. See Docker installation instructions for details.

Option 2: Python virtual environment 

Alternatively, instead of Docker one can set up a Python virtual environment. Once virtual environment is set up, you should install Dorieh packages in it with the following command:

pip install git+https://github.com/NSAPH-Data-Platform/nsaph-core-platform.git@develop

Setup DBMS Server 

Dorieh uses PostgreSQL DBMS to manage its data warehouse. The data warehouse is assumed to be set up and operational to ingest data. Generating the pipeline does not require the data warehouse.

Define connection 

Dorieh uses database.ini type file to manage connections to data warehouse. The format described in documentation.

If the file with database connections does not exist, you need to create one. For example, named database.ini somewhere on your local file system.

Using pipeline generator 

Generate pipeline and metadata 

Generator takes 3 command line parameters:

GitHub URL or a local path for the containerized app. In the root directory of the path, generator will look for a file named app.config.yaml.

If you use a local Python virtual environment, then run:

python -m dorieh.platform.apprunner.app_run_generator $GitHubURL $outputfile $branch

Example:

python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git pipeline.cwl master

to generate a pipeline executing ZIP to ZCTA Crosswalk Producer app using master branch and output the result into current directory in a file named pipeline.cwl

Alternatively, to do the same using Docker container, execute:

docker run -v $(pwd):/tmp/work forome/dorieh python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git /tmp/work/pipeline.cwl master

In both cases, the generator will produce 3 files:

pipeline.cwl: main workflow file
ingest.cwl: subworkflow used for data ingestion
common.yaml: metadata required for ingestion. Name common is derived from domain key in the app.config.yaml file in the app repository.

Execute generated pipeline 

If you installed Dorieh packages in your local Python virtual environment, you can execute the pipeline with the following command in the working directory, for example using CWL reference implementation built into Dorieh (cwl-runner).

cwl-runner pipeline.cwl --database $path_to_your_connection_def_file --connection_name $connection_name

for example:

cwl-runner pipeline.cwl --database ../../database.ini --connection_name dorieh

A better way would be to use a production grade CWL implementation such as Toil. To do this you need to install Toil on your local system.

You do not need to install Dorieh packages to execute the pipeline. The runtime engine will use Dorieh container where all requirements are preinstalled.

For Toil, a good advice would be to first create working directory, e.g. named work. Otherwise, Toil will create a default directory somewhere in yoru temporary space.

The command to execute the pipeline with Toil would be:

toil-cwl-runner --retryCount 0 --cleanWorkDir never  --jobStore j1 --outdir results --workDir work  pipeline.cwl --database ../../database.ini --connection_name nsaph-docker

Specifying jobStore will let you restart the pipeline from a point of failure if pipeline execution fails for any reason.

Appendix 1: Metadata description 

File app.config.yaml 

Keys:

metadata: a relative path to the app metadata file
dorieh-metadata: a relative path to the metadata required to create a database table
docker: information about the docker container, including:
- image: the tag for the container image that executes the app
- run: the command to be run within the container. This is an optional field
- outputdir: directory, where to look for the results of the execution of the app

File metadata.yml 

This is the file referenced from app.config.yaml by metadata key.

It should contain the following keys:

dataset_name
description
fields:
- table
  - columns

Each column should have name, type and description keys.

File dorieh-metadata.yaml 

This is a header for a knowledge domain that will be created. Detailed description is provided in the Data modeling section. It is important to define correct values for quoting, schema and primary_key for each table.