Tutorial: Building a Bronze–Silver–Gold Climate Pipeline with Dorieh

Introduction:

This tutorial walks you through building a small, but complete, Dorieh‑based data processing workflow. It attempts to model a realistic process of incrementally adding features to a workflow, keeping the workflow runnable at every step but with the ultimate goal of producing a production-ready dataset.

The eventual pipeline:

  • Downloads gridded daily climate data.

  • Aggregates it over ZIP Code Tabulation Areas (ZCTAs).

  • Loads the result into PostgreSQL.

  • Builds Bronze, Silver, and Gold layers using the Dorieh data‑modeling DSL.

  • Generates documentation and data lineage diagrams.

The same design patterns apply directly to health and claims data pipelines; here we use open climate data so anyone can reproduce the example.

The concepts in this tutorial are used in the “Data Provenance for Secondary Use of Health Data through Dataflow Programming” book in the chapter named “Sample Application: Building ML‑Ready Datasets”.

Prerequisites

You will need:

  • A Unix‑like environment (Linux or macOS).

  • Python 3.9+.

  • A CWL runner, e.g.:

    • toil-cwl-runner (used in examples),

    • or cwltool.

  • PostgreSQL or Docker to use the provided container image

  • graphviz (for lineage diagrams) and pandoc (optional, for HTML docs).

Dorieh must be installed as described in the Example.

For convenience, the full list of commands is also provided below:

# Create a virtual environment and activate it.
python3 -m venv $path
source $path/bin/activate

# Install Toil and its dependencies.
pip install "toil[cwl,aws]"

# Install Dorieh and its dependencies.
pip install dorieh

Design overview

We briefly outline the pipeline design before touching any code.

Inputs

  • Climate variable: daily maximum temperature (tmmx) from a gridded dataset (e.g., GRIDMET / TerraClimate via Google Earth Engine / NKN).

Important

As we will be progressing with the building of the workflow we will see that we also need an additional input: shapefile with ZIP Code Tabulation Areas (ZCTAs).

Outputs

  • File output: a compressed CSV with columns: date, zcta, and tmmx (Kelvin).

  • Database outputs:

    • Bronze table: bronze_temperature(date, zcta, tmmx).

    • Silver view: silver_temperature with:

      • temperature_in_C, temperature_in_F

      • us_state, city

    • Gold materialized view: gold_temperature_by_state with: *mean temperature and temperature span per state/day.

Architecture

We will:

  • Use CWL to orchestrate:

    • climate download → shape download → spatial aggregation → DB init → ingestion → Silver/Gold creation.

  • Use the Dorieh data‑modeling DSL (YAML) to define:

    • Bronze/Silver/Gold schema and transformations.

  • Use Dorieh utilities to generate:

    • Workflow documentation (from CWL).

    • Data dictionaries and lineage diagrams (from YAML).

Directory layout

Create a working directory for this tutorial, for example:

cd tmp
mkdir -p dorieh/tutorials/climate
cd dorieh/tutorials/climate

We will place:

  • example1.cwl – the CWL workflow.

  • example1_model.yml – the domain definition for Bronze/Silver/Gold.

  • (Optionally) a local database.ini if you need to adjust it for your environment.

Step 1. Create a minimal CWL workflow skeleton

We will start with a minimal CWL workflow definition containing the main steps—data acquisition, shape file retrieval, and aggregation. At this stage, placeholders can be used for inputs and outputs; these will be filled in as more details on the required tool parameters are gathered.

Dorieh provides a library of prebuilt CWL tool definitions, streamlining common steps such as downloading remote files and running aggregations. Corresponding tools are:

The initial workflow skeleton can look like:

 1#!/usr/bin/env cwl-runner
 2
 3cwlVersion: v1.2
 4class: Workflow
 5
 6requirements:
 7  SubworkflowFeatureRequirement: {}
 8  StepInputExpressionRequirement: {}
 9  InlineJavascriptRequirement: {}
10  ScatterFeatureRequirement: {}
11  MultipleInputFeatureRequirement: {}
12  NetworkAccess:
13    networkAccess: True
14
15inputs: {}
16
17steps:
18  download:
19    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
20
21  aggregate:
22    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
23
24outputs: {}

This skeleton is not yet runnable. It defines three steps but no inputs, outputs, or wiring.

Step 2. Iteratively Defining Steps and Parameters

At this implementation step we need to:

  • For each workflow step, consult the underlying tool’s documentation or CWL file to determine:

    • Required input parameters

    • Output files and directory structure

  • Incrementally add inputs (e.g., band, year, geography), outputs, and wiring between steps (using CWL’s in and out sections).

Dorieh’s cwl_collect_outputs utility helps automating extraction of output specifications and minimizing errors.

To illustrate the process, we will first look at the download step. The tool we are using is Dorieh download.cwl

We see that it takes 3 input parameters:

  • proxy settings,

  • year, for which we will be performing data aggregation

  • meteorological band or variable that we would like to aggregate.

If you are using a system with direct access to the Internet, you probably do not need a proxy. However, we need to take care of ‘year’ and ‘band’.

After adding these two input parameters to the ‘ inputs ’ section and ‘download’ step, the workflow file will look like:

 1#!/usr/bin/env cwl-runner
 2
 3cwlVersion: v1.2
 4class: Workflow
 5
 6requirements:
 7  SubworkflowFeatureRequirement: {}
 8  StepInputExpressionRequirement: {}
 9  InlineJavascriptRequirement: {}
10  ScatterFeatureRequirement: {}
11  MultipleInputFeatureRequirement: {}
12  NetworkAccess:
13    networkAccess: True
14
15inputs:
16  band:
17    type: string
18  year:
19    type: string
20
21steps:
22  download:
23    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
24    in:
25      year: year
26      band: band
27
28  aggregate:
29    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
30
31Outputs: {}

Though we have now defined the inputs for the first step, the workflow is still not runnable because we have not yet defined the outputs. As mentioned, Dorieh’s cwl_collect_outputs utility assists with this task. You can run:

python -m dorieh.platform.util.cwl_collect_outputs download https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/cwl/download.cwl

The command should produce the following output:

    out:
      - log
      - data
      - errors
## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors

From the standard output of this command, copy the “out” section to the step and the rest to the “outputs” section. You can exclude the outputs you want to ignore. The resulting workflow file should look like:

 1cwlVersion: v1.2
 2class: Workflow
 3
 4requirements:
 5  SubworkflowFeatureRequirement: {}
 6  StepInputExpressionRequirement: {}
 7  InlineJavascriptRequirement: {}
 8  ScatterFeatureRequirement: {}
 9  MultipleInputFeatureRequirement: {}
10  NetworkAccess:
11    networkAccess: True
12inputs:
13  band:
14    type: string
15  year:
16    type: string
17
18steps:
19  download:
20    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
21    in:
22      year: year
23      band: band
24    out:
25      - log
26      - data
27      - errors
28
29  aggregate:
30    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
31
32outputs:
33  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
34  download_log:
35    type: File?
36    outputSource: download/log
37  download_data:
38    type: File?
39    outputSource: download/data
40  download_errors:
41    type: File
42    outputSource: download/errors

Now we will repeat the process for the aggregate step. Looking at the tool we are using aggregate.cwl. we notice that besides the input parameters we already discussed, it also requires Shapefiles. Dorieh provides another CWL tool to download the shapefiles for US geographies (ZCTAs and counties) from the US Census website: get_shapes.cwl.

Therefore, we need to add a third step to the existing two:

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl

Which will bring us to the following workflow definition:

 1cwlVersion: v1.2
 2class: Workflow
 3
 4requirements:
 5  SubworkflowFeatureRequirement: {}
 6  StepInputExpressionRequirement: {}
 7  InlineJavascriptRequirement: {}
 8  ScatterFeatureRequirement: {}
 9  MultipleInputFeatureRequirement: {}
10  NetworkAccess:
11    networkAccess: True
12inputs:
13  band:
14    type: string
15  year:
16    type: string
17
18steps:
19  download:
20    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
21    in:
22      year: year
23      band: band
24    out:
25      - log
26      - data
27      - errors
28
29  get_shapes:
30    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
31
32  aggregate:
33    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
34
35outputs:
36  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
37  download_log:
38    type: File?
39    outputSource: download/log
40  download_data:
41    type: File?
42    outputSource: download/data
43  download_errors:
44    type: File
45    outputSource: download/errors

After repeating the process described above for the remaining two steps (get_shapes and aggregate) we finally have a runnable workflow definition:

 1#!/usr/bin/env cwl-runner
 2
 3cwlVersion: v1.2
 4class: Workflow
 5
 6requirements:
 7  SubworkflowFeatureRequirement: {}
 8  StepInputExpressionRequirement: {}
 9  InlineJavascriptRequirement: {}
10  ScatterFeatureRequirement: {}
11  MultipleInputFeatureRequirement: {}
12  NetworkAccess:
13    networkAccess: True
14inputs:
15  band:
16    type: string
17  year:
18    type: string
19  geography:
20    type: string
21
22steps:
23  download:
24    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
25    in:
26      year: year
27      band: band
28    out:
29      - log
30      - data
31      - errors
32
33  get_shapes:
34    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
35    in:
36      geo: geography
37      year: year
38    out:
39      - shape_files
40
41  aggregate:
42    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
43    in:
44      geography: geography
45      year: year
46      band: band
47      input: download/data
48      shape_files: get_shapes/shape_files
49    out:
50      - log
51      - data
52      - errors
53
54outputs:
55  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
56  download_log:
57    type: File?
58    outputSource: download/log
59  download_data:
60    type: File?
61    outputSource: download/data
62  download_errors:
63    type: File
64    outputSource: download/errors
65  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
66  get_shapes_shape_files:
67    type: File[]
68    outputSource: get_shapes/shape_files
69  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
70  aggregate_log:
71    type: File?
72    outputSource: aggregate/log
73  aggregate_data:
74    type: File?
75    outputSource: aggregate/data
76  aggregate_errors:
77    type: File
78    outputSource: aggregate/errors

At this point, the workflow can aggregate a full year of data, but that is slow for development. Next we add a “toy slice” mode.

If you prefer to test the current workflow, and if you do not mind waiting hours for its completion, you can run it with the following command:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --year 2019 --geography zcta

Step 3. Parameterize for a single day (“toy” run)

To speed up development and debugging, we can parameterize the workflow so that it can run on “toy” datasets—for example, filtering a single day (e.g., 2019-01-15) instead of an entire year. Running your pipeline on a minimal dataset ensures quick feedback, uncovers errors early, and avoids wasting compute resources.

First, we need to modify the inputs section of the workflow:

inputs:
  band:
    type: string
  date:
    type: string    # e.g. "2019-01-15"
  geography:
    type: string    # "zcta" or "county"

We must keep in mind that such parametrization often requires additional input transformations (e.g., extracting year from a date); hence, we need to insert transformation steps, chaining outputs via CWL’s valueFrom mechanism.

In our case, to allow the workflow to run on a toy dataset, we will first replace the input parameter year with date. Our downstream steps, however, require a year, e.g., to download the right shape file. Hence, we will add an additional step to extract the year from the date:

  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year

With these changes in place, the workflow is now:

 1#!/usr/bin/env cwl-runner
 2
 3cwlVersion: v1.2
 4class: Workflow
 5
 6requirements:
 7  SubworkflowFeatureRequirement: {}
 8  StepInputExpressionRequirement: {}
 9  InlineJavascriptRequirement: {}
10  ScatterFeatureRequirement: {}
11  MultipleInputFeatureRequirement: {}
12  NetworkAccess:
13    networkAccess: True
14inputs:
15  band:
16    type: string
17  date:
18    type: string
19  geography:
20    type: string
21
22steps:
23  extract_year:
24    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
25    in:
26      date: date
27    out:
28      - year
29
30  download:
31    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
32    in:
33      year: extract_year/year
34      band: band
35    out:
36      - log
37      - data
38      - errors
39
40  get_shapes:
41    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
42    in:
43      geo: geography
44      year: extract_year/year
45    out:
46      - shape_files
47
48  aggregate:
49    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
50    in:
51      geography: geography
52      year: extract_year/year
53      dates: date
54      band: band
55      input: download/data
56      shape_files: get_shapes/shape_files
57      strategy:
58        valueFrom: default
59    out:
60      - log
61      - data
62      - errors
63
64outputs:
65  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
66  download_log:
67    type: File?
68    outputSource: download/log
69  download_data:
70    type: File?
71    outputSource: download/data
72  download_errors:
73    type: File
74    outputSource: download/errors
75  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
76  get_shapes_shape_files:
77    type: File[]
78    outputSource: get_shapes/shape_files
79  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
80  aggregate_log:
81    type: File?
82    outputSource: aggregate/log
83  aggregate_data:
84    type: File?
85    outputSource: aggregate/data
86  aggregate_errors:
87    type: File
88    outputSource: aggregate/errors

It can be run with the following command:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If successful, you should find a gzipped CSV file under tmmx_zcta_polygon_2019.csv.gz containing the following columns:

  • date

  • zcta

  • tmmx

This is exactly what we will ingest into the Bronze layer.

If the workflow fails, follow the troubleshooting steps described in the Troubleshooting documentation.

Step 4. Add database integration (PostgreSQL)

Start or check PostgreSQL

If you already have a PostgreSQL database running, you will need to create a database.ini file as described in the Documentation.

Otherwise, you can start a PostgreSQL container using Docker:

git clone https://github.com/ForomePlatform/dorieh.git
cd dorieh/docker/pg-hll
docker compose up -d

In the latter case use the provided database.ini file.

Add PostgreSQL integration to the workflow

Back in your tutorial directory (examples/tutorials/climate), add two new workflow inputs to example1.cwl:

inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string
  database:
    type: File
    default:
      class: File
      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/examples/with-postgres/database.ini
  connection_name:
    type: string
    default: "localhost"

Add the Database initialization step

Another required action is to ensure that the database contains the latest Dorieh code for PostgreSQL. This is done by adding init_db step:

initdb:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
    in:
      database: database
      connection_name: connection_name
    out:
      - log
      - err

Optionally, though we recommended it, add the outputs of the initdb to the pipeline outputs:

outputs:
  # ... existing outputs ...

  initdb_log:
    type: File
    outputSource: initdb/log
  initdb_err:
    type: File
    outputSource: initdb/err

Defining Data Model

However, to load the data into a database, we also need to define the database schema. It is possible to automatically infer schema using Dorieh tools like Project Loader and Introspector. But for Medallion architecture the schema should be explicitly defined and will use it with the Data Loader tool.

The data model definition language is described in the Data Model documentation.

Initially, we will define a simple schema for the Bronze layer:

 1tutorial:
 2  header: true
 3  quoting: 3
 4  index: "unless excluded"
 5  description: "Data model for data transformation tutorial"
 6  tables:
 7    bronze_temperature:
 8      description: |
 9        Maximum daily temperature for US Zip Code Tabulation Areas
10      columns:
11        - tmmx:
12            type: float
13            description: Maximum temperature variable from the TerraClimate dataset in K
14            reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
15        - date:
16            type: date
17        - zcta:
18            description: Zip Code for a Zip Code Tabulation Area
19            type: int
20      primary_key:
21        - zcta
22        - date

It defines a data domain named “tutorial” and in it a table named “bronze_temperature” with 3 columns: tmmx, date and zcta. We will assume that the file name is example1_model.yml.

Adding Ingestion Step

Now we can add the actual ingestion step to the workflow using the Dorieh ingest tool:

  ingest:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
    in:
      depends_on: initdb/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "bronze_temperature"
      input: aggregate/data
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

After adding ingestion to steps and the logs it produces to the outputs, the resulting workflow file should look like:

  1#!/usr/bin/env cwl-runner
  2
  3cwlVersion: v1.2
  4class: Workflow
  5
  6requirements:
  7  SubworkflowFeatureRequirement: {}
  8  StepInputExpressionRequirement: {}
  9  InlineJavascriptRequirement: {}
 10  ScatterFeatureRequirement: {}
 11  MultipleInputFeatureRequirement: {}
 12  NetworkAccess:
 13    networkAccess: True
 14inputs:
 15  band:
 16    type: string
 17  date:
 18    type: string
 19  geography:
 20    type: string
 21  database:
 22    type: File
 23    default:
 24      class: File
 25      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
 26  connection_name:
 27    type: string
 28    default: "localhost"
 29
 30steps:
 31  extract_year:
 32    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
 33    in:
 34      date: date
 35    out:
 36      - year
 37
 38  download:
 39    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
 40    in:
 41      year: extract_year/year
 42      band: band
 43    out:
 44      - log
 45      - data
 46      - errors
 47
 48  get_shapes:
 49    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
 50    in:
 51      geo: geography
 52      year: extract_year/year
 53    out:
 54      - shape_files
 55
 56  aggregate:
 57    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
 58    in:
 59      geography: geography
 60      year: extract_year/year
 61      dates: date
 62      band: band
 63      input: download/data
 64      shape_files: get_shapes/shape_files
 65      strategy:
 66        valueFrom: default
 67    out:
 68      - log
 69      - data
 70      - errors
 71
 72  initdb:
 73    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
 74    in:
 75      database: database
 76      connection_name: connection_name
 77    out:
 78      - log
 79      - err
 80
 81  ingest:
 82    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
 83    in:
 84      depends_on: initdb/log
 85      registry:
 86        default:
 87          class: File
 88          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
 89      domain:
 90        valueFrom: "tutorial"
 91      table:
 92        valueFrom: "bronze_temperature"
 93      input: aggregate/data
 94      database: database
 95      connection_name: connection_name
 96    out:
 97      - log
 98      - errors
 99
100outputs:
101  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
102  download_log:
103    type: File?
104    outputSource: download/log
105  download_data:
106    type: File?
107    outputSource: download/data
108  download_errors:
109    type: File
110    outputSource: download/errors
111  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
112  get_shapes_shape_files:
113    type: File[]
114    outputSource: get_shapes/shape_files
115  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
116  aggregate_log:
117    type: File?
118    outputSource: aggregate/log
119  aggregate_data:
120    type: File?
121    outputSource: aggregate/data
122  aggregate_errors:
123    type: File
124    outputSource: aggregate/errors
125
126  initdb_log:
127    type: File
128    outputSource: initdb/log
129  initdb_err:
130    type: File
131    outputSource: initdb/err
132
133  ingest_log:
134    type: File
135    outputSource: ingest/log
136  ingest_err:
137    type: File
138    outputSource: ingest/errors

The same command as before can be used to run the workflow:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If you chose to perform this run, beside the CSV file, you will see the data in the database in the table bronze_temperature.

Step 5. Building Medallion Layers (Bronze, Silver, Gold)

Medallion architecture defines three layers:

  • Bronze Layer: Load as-is, minimally processed data to database from pipeline outputs.

    • In this climate data example, the “raw” data is not strictly straight-from-source due to initial aggregation necessary for technical compatibility as NetCDF data can not be ingested directly into the majority of DBMSs unless a specialized extensions are installed. Hence, we need to transform the data to a more conventional tabular format before ingestion - the exact operation performed by the aggregation step

  • Silver Layer: Clean, harmonize, and enrich data.

    • Built from Bronze layer (no external inputs are allowed).

    • Add derived columns (e.g., Celsius/Fahrenheit conversions, state & city annotation via ZIP lookup).

    • Define as a view on top of the bronze table in your data model YAML.

  • Gold Layer: Produce analytic/ML-ready outputs.

    • Built from Silver Layer.

    • Perform groupings and summaries (e.g., aggregating by state and date).

    • Use materialized views where performance and reusability are required.

Silver layer usually is responsible for normalization, harmonization and cleansing of the data. In our example the data does not require cleansing, as it is already clean, normalized, and does not require harmonization because it is coming from a single source. However, even if source data is already clean, enriching it with derived fields or external metadata (such as regional names) increases analytic utility and facilitates downstream ML feature engineering. Hence, we illustrate enriching the data by adding columns, displaying the temperature in different units and annotating zip codes with the US State abbreviations and the names of the cities for those areas that lie within a city.

Based on the discussion above, for the silver layer we will add a corresponding step:

  build_silver:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: ingest/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "silver_temperature"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

This step builds a table named silver_temperature. We also need to describe the table in the data model file. Best Practice is to keep your Silver and Gold layer table/view definitions together in a versioned domain YAML file, checked into source control along with your workflow scripts.

hence, we will add the following table definition to example1_model.yml:

    silver_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
      create:
        type: view
        from: bronze_temperature
      columns:
        - tmmx
        - date
        - zcta
        - temperature_in_C:
            type: float
            description: Temperature in Celsius
            source: (tmmx - 273.15)
        - temperature_in_F:
            type: float
            description: Temperature in Fahrenheit
            source: ((tmmx - 273.15)*9/5 + 32)
        - us_state:
            type: VARCHAR(2)
            description: US State
            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
        - city:
            type: VARCHAR(128)
            description: >
              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
              for ZCTAs spanning multiple cities, this is the city covering the largest 
              portion of the area or population.
            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"

This silver table retains all 3 bronze columns and adds 4 new:

  • Temperature expressed in degrees Celsius for the benefit of readers outside of the United States. It is computed by the trivial formula.

  • Temperature expressed in degrees Fahrenheit for the benefit of the United States reader. It is computed by the known conversion formula.

  • A code (2 letter abbreviation) for the state, in which the area lies. It is computed by calling Dorieh built-in function zip_to_state.

  • A name of the city for the zip code for urban areas, also computed by calling Dorieh built-in function zip_to_city.

After you made these changes to both files (example1.cwl and example1_model.yml), you are welcome to run the pipeline again, using the same command as above. It will now build the silver layer.

Alternatively, we can add the Gold layer before testing.

In the Gold layer, we will add just one table that computes some data for the whole states and is ready for analysis. The table named gold_temperature_by_state is defined by the following block:

    gold_temperature_by_state:
      description: |
        Temperature variations by US State
      create:
        type: materialized view
        from: silver_temperature
        group by:
          - us_state
          - date
      columns:
        - us_state
        - date
        - t_span:
            type: float
            description: Temperature variation in Celsius
            source: MAX(tmmx) - MIN(tmmx)
        - t_mean_in_C:
            type: float
            description: Mean Temperature in Celsius
            source: AVG(temperature_in_C)
        - t_mean_in_F:
            type: float
            description: Mean Temperature in Fahrenheit
            source: AVG(temperature_in_F)
      primary_key:
        - us_state
        - date

The gold table contains mean temperatures on a date for every US state and also the variation in the temperature on the day. The variation is in maximum temperature, so it does not reflect a change during a day, but only the diversity of geography.

We now need to add a step to build a gold schema to the workflow itself. The step is literally the same as silver, the difference is just the target table name:

  build_gold:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: build_silver/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "gold_temperature_by_state"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

The final version of the workflow is:

  1#!/usr/bin/env cwl-runner
  2
  3cwlVersion: v1.2
  4class: Workflow
  5
  6requirements:
  7  SubworkflowFeatureRequirement: {}
  8  StepInputExpressionRequirement: {}
  9  InlineJavascriptRequirement: {}
 10  ScatterFeatureRequirement: {}
 11  MultipleInputFeatureRequirement: {}
 12  NetworkAccess:
 13    networkAccess: True
 14inputs:
 15  band:
 16    type: string
 17  date:
 18    type: string
 19  geography:
 20    type: string
 21  database:
 22    type: File
 23    default:
 24      class: File
 25      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
 26  connection_name:
 27    type: string
 28    default: "localhost"
 29
 30steps:
 31  extract_year:
 32    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
 33    in:
 34      date: date
 35    out:
 36      - year
 37
 38  download:
 39    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
 40    in:
 41      year: extract_year/year
 42      band: band
 43    out:
 44      - log
 45      - data
 46      - errors
 47
 48  get_shapes:
 49    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
 50    in:
 51      geo: geography
 52      year: extract_year/year
 53    out:
 54      - shape_files
 55
 56  aggregate:
 57    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
 58    in:
 59      geography: geography
 60      year: extract_year/year
 61      dates: date
 62      band: band
 63      input: download/data
 64      shape_files: get_shapes/shape_files
 65      strategy:
 66        valueFrom: default
 67    out:
 68      - log
 69      - data
 70      - errors
 71
 72  initdb:
 73    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
 74    in:
 75      database: database
 76      connection_name: connection_name
 77    out:
 78      - log
 79      - err
 80
 81  ingest:
 82    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
 83    in:
 84      depends_on: initdb/log
 85      registry:
 86        default:
 87          class: File
 88          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
 89      domain:
 90        valueFrom: "tutorial"
 91      table:
 92        valueFrom: "bronze_temperature"
 93      input: aggregate/data
 94      database: database
 95      connection_name: connection_name
 96    out:
 97      - log
 98      - errors
 99
100  build_silver:
101    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
102    in:
103      depends_on: ingest/log
104      registry:
105        default:
106          class: File
107          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
108      domain:
109        valueFrom: "tutorial"
110      table:
111        valueFrom: "silver_temperature"
112      database: database
113      connection_name: connection_name
114    out:
115      - log
116      - errors
117
118  build_gold:
119    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
120    in:
121      depends_on: build_silver/log
122      registry:
123        default:
124          class: File
125          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
126      domain:
127        valueFrom: "tutorial"
128      table:
129        valueFrom: "gold_temperature_by_state"
130      database: database
131      connection_name: connection_name
132    out:
133      - log
134      - errors
135
136
137outputs:
138  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
139  download_log:
140    type: File?
141    outputSource: download/log
142  download_data:
143    type: File?
144    outputSource: download/data
145  download_errors:
146    type: File
147    outputSource: download/errors
148  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
149  get_shapes_shape_files:
150    type: File[]
151    outputSource: get_shapes/shape_files
152  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
153  aggregate_log:
154    type: File?
155    outputSource: aggregate/log
156  aggregate_data:
157    type: File?
158    outputSource: aggregate/data
159  aggregate_errors:
160    type: File
161    outputSource: aggregate/errors
162
163  initdb_log:
164    type: File
165    outputSource: initdb/log
166  initdb_err:
167    type: File
168    outputSource: initdb/err
169
170  ingest_log:
171    type: File
172    outputSource: ingest/log
173  ingest_err:
174    type: File
175    outputSource: ingest/errors
176
177  build_silver_log:
178    type: File
179    outputSource: build_silver/log
180  build_silver_err:
181    type: File
182    outputSource: build_silver/errors
183
184  build_gold_log:
185    type: File
186    outputSource: build_gold/log
187  build_gold_err:
188    type: File
189    outputSource: build_gold/errors

While the final Medallion data model is:

 1tutorial:
 2  header: true
 3  quoting: 3
 4  index: "unless excluded"
 5  description: "Data model for data transformation tutorial"
 6  tables:
 7    bronze_temperature:
 8      description: |
 9        Maximum daily temperature for US Zip Code Tabulation Areas
10      columns:
11        - tmmx:
12            type: float
13            description: Maximum temperature variable from the TerraClimate dataset in K
14            reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
15        - date:
16            type: date
17        - zcta:
18            description: Zip Code for a Zip Code Tabulation Area
19            type: int
20      primary_key:
21        - zcta
22        - date
23    silver_temperature:
24      description: |
25        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
26      create:
27        type: view
28        from: bronze_temperature
29      columns:
30        - tmmx
31        - date
32        - zcta
33        - temperature_in_C:
34            type: float
35            description: Temperature in Celsius
36            source: (tmmx - 273.15)
37        - temperature_in_F:
38            type: float
39            description: Temperature in Fahrenheit
40            source: ((tmmx - 273.15)*9/5 + 32)
41        - us_state:
42            type: VARCHAR(2)
43            description: US State
44            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
45        - city:
46            type: VARCHAR(128)
47            description: >
48              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
49              for ZCTAs spanning multiple cities, this is the city covering the largest 
50              portion of the area or population.
51            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"
52
53    gold_temperature_by_state:
54      description: |
55        Temperature variations by US State
56      create:
57        type: materialized view
58        from: silver_temperature
59        group by:
60          - us_state
61          - date
62      columns:
63        - us_state
64        - date
65        - t_span:
66            type: float
67            description: Temperature variation in Celsius
68            source: MAX(tmmx) - MIN(tmmx)
69        - t_mean_in_C:
70            type: float
71            description: Mean Temperature in Celsius
72            source: AVG(temperature_in_C)
73        - t_mean_in_F:
74            type: float
75            description: Mean Temperature in Fahrenheit
76            source: AVG(temperature_in_F)
77      primary_key:
78        - us_state
79        - date

Step 6. Testing the Pipeline

At this point example1.cwl orchestrates:

  1. Data download.

  2. Shape download.

  3. Spatial aggregation.

  4. DB initialization.

  5. Bronze ingestion.

  6. Silver view creation.

  7. Gold materialized view creation.

The pipeline is now ready to be tested. You can run it with the same command as before:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If everything completes successfully, you should see the following tables in your PostgreSQL database:

  • bronze_temperature

  • silver_temperature (view)

  • gold_temperature_by_state (materialized view)

You can also run the following SQL queries to verify the data:

-- Inspect Bronze
SELECT * FROM bronze_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Silver
SELECT date, zcta, temperature_in_C, us_state, city
FROM silver_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Gold: which state was hottest on 2019‑01‑15?
SELECT us_state, t_mean_in_C, t_span
FROM gold_temperature_by_state
WHERE date = '2019-01-15'
ORDER BY t_mean_in_C DESC
LIMIT 10;

Next Steps

In the next steps we should learn how to: