Generator of pipelines executing containerized apps
Introduction
National Studies on Air Pollution and Health organization (NSAPH) publishes containerized applications to produce certain types of data. These applications are published on the NSAPH Data Production GitHub.
The Pipeline Generator generates a CWL pipeline to execute the app and ingest the data it produces into Dorieh Data warehouse.
The process of data ingestion consists of two steps:
Generation of the piepline for data ingestion
Execution of the pipeline
Prerequisites
Docker or Python virtual environment
You need either Option 1 or Option2, not both!
Option 1: Docker
The first step, generation of the pipeline, has minimal requirements. The easiest way to generate the pipeline is to use a docker container, which will only requires Docker to be installed on the host system, where the step is executed. See Docker installation instructions for details.
Option 2: Python virtual environment
Alternatively, instead of Docker one can set up a Python virtual environment. Once virtual environment is set up, you should install Dorieh packages in it with the following command:
pip install git+https://github.com/NSAPH-Data-Platform/nsaph-core-platform.git@develop
Setup DBMS Server
Dorieh uses PostgreSQL DBMS to manage its data warehouse. The data warehouse is assumed to be set up and operational to ingest data. Generating the pipeline does not require the data warehouse.
Define connection
Dorieh uses database.ini type file to manage connections to data warehouse. The format described in documentation.
If the file with database connections does not exist, you need to create one. For example, named database.ini somewhere on your local file system.
Using pipeline generator
Generate pipeline and metadata
Generator takes 3 command line parameters:
GitHub URL or a local path for the containerized app. In the root directory of the path, generator will look for a file named
app.config.yaml
.
If you use a local Python virtual environment, then run:
python -m dorieh.platform.apprunner.app_run_generator $GitHubURL $outputfile $branch
Example:
python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git pipeline.cwl master
to generate a pipeline executing
ZIP to ZCTA Crosswalk Producer app
using master
branch and output the result into current directory in
a file named pipeline.cwl
Alternatively, to do the same using Docker container, execute:
docker run -v $(pwd):/tmp/work forome/dorieh python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git /tmp/work/pipeline.cwl master
In both cases, the generator will produce 3 files:
pipeline.cwl: main workflow file
ingest.cwl: subworkflow used for data ingestion
common.yaml: metadata required for ingestion. Name
common
is derived fromdomain
key in the app.config.yaml file in the app repository.
Execute generated pipeline
If you installed Dorieh packages in your local Python virtual environment, you can execute the pipeline with the following command in the working directory, for example using CWL reference implementation built into Dorieh (cwl-runner).
cwl-runner pipeline.cwl --database $path_to_your_connection_def_file --connection_name $connection_name
for example:
cwl-runner pipeline.cwl --database ../../database.ini --connection_name dorieh
A better way would be to use a production grade CWL implementation such as Toil. To do this you need to install Toil on your local system.
You do not need to install Dorieh packages to execute the pipeline. The runtime engine will use Dorieh container where all requirements are preinstalled.
For Toil, a good advice would be to first create working
directory, e.g. named work
. Otherwise, Toil will create a default
directory somewhere in yoru temporary space.
The command to execute the pipeline with Toil would be:
toil-cwl-runner --retryCount 0 --cleanWorkDir never --jobStore j1 --outdir results --workDir work pipeline.cwl --database ../../database.ini --connection_name nsaph-docker
Specifying jobStore
will let you restart the pipeline from a
point of failure if pipeline execution fails for any reason.
Appendix 1: Metadata description
File app.config.yaml
Keys:
metadata
: a relative path to the app metadata filedorieh-metadata
: a relative path to the metadata required to create a database tabledocker
: information about the docker container, including:image
: the tag for the container image that executes the apprun
: the command to be run within the container. This is an optional fieldoutputdir
: directory, where to look for the results of the execution of the app
File metadata.yml
This is the file referenced from app.config.yaml by metadata
key.
It should contain the following keys:
dataset_name
description
fields:
table
columns
Each column should have name
, type
and description
keys.
File dorieh-metadata.yaml
This is a header for a knowledge domain that will be created.
Detailed description is provided in the
Data modeling section. It is important
to define correct values for quoting
, schema
and
primary_key
for each table.