Example of a workflow: aggregating a climate variable

What the sample workflow is doing: aggregating a climate variable

In this example we will be running a simple Common Workflow Language (CWL) workflow. This workflow produces a CSV file with aggregated climate variable over either US Postal zip-codes or US counties. The CSV file consists of 3 columns: variable value, date, zcta (Zip Code Tabulation Area). If US Counties are used, then the third column will be county FIPS code.

tmmx,date,zcta
295.36935960591137,2020-10-03,35592
293.6454457364341,2020-10-03,35616
...

The workflow consists of 3 steps:

  1. Downloads NetCDF file with gridMET data from Atmospheric Composition Analysis Group

  2. Downloads a shapefile set for the given geography type (ZCTA or county) and date

  3. Aggregate NetCDF over polygons corresponding to a given geography

It accepts 1 required and 3 optional input arguments:

  • date - a required argument to specify the date for which we will be computing aggregations.

  • band - an optional argument specifying a climate variable. By default, it is tmmx, maximum daily temperature. Other options can be found at the Google Earth Engine website

  • geography a geography type, zcta (ZIP code) or county, by default it is ZCTA.

  • ram to be used for aggregation, by default it is 2GB. Specifying higher ram will improve accuracy of the aggregation

This architecture is reflected in this diagram: diagram

The source code for the workflow is in examples directory. See more details in CWL Workflow Specifications

Prepare to run a workflow

We suggest that You create a Python virtual environment for trying this workflow, or use an existing one. If you are creating a new virtual environment, run the following command:

python3 -m venv $path
source $path/bin/activate

where $path is a path to a directory, that will be created and where the new visualiser environment will reside.

To run the workflow you need to install a CWL implementation. We suggest using Toil. To install Toil just run the following command in your Python Virtual Environment:

pip install "toil[cwl,aws]"

Running the workflow in Python virtual environment

To run the workflow in the Python virtual environment you need to install dorieh package:

pip install dorieh

There is one required argument to the workflow - the date for which we will be aggregating climate data. Then you can run the following command:

toil-cwl-runner --retryCount 1 --cleanWorkDir never --outdir tmmx --workDir . \
https://raw.githubusercontent.com/NSAPH-Data-Platform/dorieh/main/examples/climate-example.cwl \ 
--date 2020-10-03

(Replace the date with any date you fancy)

Running the workflow using Docker

Dorieh Docker image

A prebuilt Docker image with Dorieh is available from DockerHub. Pull it to your local machine using

docker pull forome/dorieh

command. The image is built for Intel/AMD and ARM CPUs. ARM architecture is used in AWS Graviton2 processors that, according to AWS, deliver up to 40% better price performance. ARM CPUs are also used by latest Mac computers.

There are two ways to use Docker instead of installing a dorieh package in your Python virtual environment. A recommended way is to specify DockerRequirement (See also Using Containers section of the CWL User Guide). An alternative is to manually run the commands inside a running Docker container. While much more cumbersome, this alternative way does not require installing any CWL implementation or even Python.

Using DockerRequirement for your workflow

When using DockerRequirement, you still need to have an engine that supports CWL, e.g., Toil. Therefore, unless you already have done so, you need to create a Python virtual environment and install CWL implementation there, for example with the following command:

pip install "toil[cwl,aws]"

Then you need to add DockerRequirement to your workflow. Using climate example workflow, uncomment the following 3 lines (lines 34-36):

#hints:
#  DockerRequirement:
#    dockerPull: forome/dorieh

So you have:

hints:
  DockerRequirement:
    dockerPull: forome/dorieh

You can now run the workflow with the same command:

toil-cwl-runner --retryCount 1 --cleanWorkDir never --outdir tmmx --workDir . \
https://raw.githubusercontent.com/NSAPH-Data-Platform/dorieh/main/examples/climate-example.cwl \ 
--date 2020-10-03

even without having dorieh package installed in you Python virtual environment.

Using your Docker container manually

You can also use a Docker container more like a virtual machine. Start it by executing

docker start forome/dorieh 

command and just run teh commands inside the container, using

docker exec -it forome/dorieh ${commands}

This way you can use any machine that has Docker without a need of either Pyton or CWL. But you will need to copy your files manually between the host and the container.