Tutorial: Building a Bronze–Silver–Gold Climate Pipeline with Dorieh

Introduction 

This tutorial walks you through building a small, but complete, Dorieh‑based data processing workflow. It attempts to model a realistic process of incrementally adding features to a workflow, keeping the workflow runnable at every step but with the ultimate goal of producing a production-ready dataset.

The eventual pipeline:

Downloads gridded daily climate data.
Aggregates it over ZIP Code Tabulation Areas (ZCTAs).
Loads the result into PostgreSQL.
Builds Bronze, Silver, and Gold layers using the Dorieh data‑modeling DSL.
Generates documentation and data lineage diagrams.

The same design patterns apply directly to health and claims data pipelines; here we use open climate data so anyone can reproduce the example.

The concepts in this tutorial are used in the “Data Provenance for Secondary Use of Health Data through Dataflow Programming” book in the chapter named “Sample Application: Building ML‑Ready Datasets”.

Prerequisites 

You will need:

A Unix‑like environment (Linux or macOS).
Python 3.12+.
A CWL runner, e.g.:
- toil-cwl-runner (used in examples),
- or cwltool.
PostgreSQL or Docker to use the provided container image
graphviz (for lineage diagrams) and pandoc (optional, for HTML docs).

Dorieh must be installed as described in the Example.

For convenience, the full list of commands is also provided below:

# Create a virtual environment and activate it.
python3 -m venv $path
source $path/bin/activate

# Install Toil and its dependencies.
pip install "toil[cwl,aws]"

# Install Dorieh and its dependencies.
pip install dorieh

Design overview 

We briefly outline the pipeline design before touching any code.

Inputs 

Climate variable: daily maximum temperature (tmmx) from a gridded dataset (e.g., GRIDMET / TerraClimate via Google Earth Engine / NKN).

Important

As we will be progressing with the building of the workflow we will see that we also need an additional input: shapefile with ZIP Code Tabulation Areas (ZCTAs).

Outputs 

File output: a compressed CSV with columns: date, zcta, and tmmx (Kelvin).
Database outputs:
- Bronze table: bronze_temperature(date, zcta, tmmx).
- Silver view: silver_temperature with:
  - temperature_in_C, temperature_in_F
  - us_state, city
- Gold materialized view: gold_temperature_by_state with: *mean temperature and temperature span per state/day.

Architecture 

We will:

Use CWL to orchestrate:
- climate download → shape download → spatial aggregation → DB init → ingestion → Silver/Gold creation.
Use the Dorieh data‑modeling DSL (YAML) to define:
- Bronze/Silver/Gold schema and transformations.
Use Dorieh utilities to generate:
- Workflow documentation (from CWL).
- Data dictionaries and lineage diagrams (from YAML).

Directory layout 

Create a working directory for this tutorial, for example:

cd tmp
mkdir -p dorieh/tutorials/climate
cd dorieh/tutorials/climate

We will place:

example1.cwl – the CWL workflow.
example1_model.yml – the domain definition for Bronze/Silver/Gold.
(Optionally) a local database.ini if you need to adjust it for your environment.

Step 1. Create a minimal CWL workflow skeleton 

We will start with a minimal CWL workflow definition containing the main steps—data acquisition, shape file retrieval, and aggregation. At this stage, placeholders can be used for inputs and outputs; these will be filled in as more details on the required tool parameters are gathered.

Dorieh provides a library of prebuilt CWL tool definitions, streamlining common steps such as downloading remote files and running aggregations. Corresponding tools are:

The initial workflow skeleton can look like:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True

inputs: {}

steps:
  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl

outputs: {}

This skeleton is not yet runnable. It defines three steps but no inputs, outputs, or wiring.

Step 2. Iteratively Defining Steps and Parameters 

At this implementation step we need to:

For each workflow step, consult the underlying tool’s documentation or CWL file to determine:
- Required input parameters
- Output files and directory structure
Incrementally add inputs (e.g., band, year, geography), outputs, and wiring between steps (using CWL’s in and out sections).

Dorieh’s cwl_collect_outputs utility helps automating extraction of output specifications and minimizing errors.

To illustrate the process, we will first look at the download step. The tool we are using is Dorieh download.cwl

We see that it takes 3 input parameters:

proxy settings,
year, for which we will be performing data aggregation
meteorological band or variable that we would like to aggregate.

If you are using a system with direct access to the Internet, you probably do not need a proxy. However, we need to take care of ‘year’ and ‘band’.

After adding these two input parameters to the ‘ inputs ’ section and ‘download’ step, the workflow file will look like:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True

inputs:
  band:
    type: string
  year:
    type: string

steps:
  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: year
      band: band

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl

Outputs: {}

Though we have now defined the inputs for the first step, the workflow is still not runnable because we have not yet defined the outputs. As mentioned, Dorieh’s cwl_collect_outputs utility assists with this task. You can run:

python -m dorieh.platform.util.cwl_collect_outputs download https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/cwl/download.cwl

The command should produce the following output:

    out:
      - log
      - data
      - errors
## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors

From the standard output of this command, copy the “out” section to the step and the rest to the “outputs” section. You can exclude the outputs you want to ignore. The resulting workflow file should look like:

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  year:
    type: string

steps:
  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: year
      band: band
    out:
      - log
      - data
      - errors

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl

outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors

Now we will repeat the process for the aggregate step. Looking at the tool we are using aggregate.cwl. we notice that besides the input parameters we already discussed, it also requires Shapefiles. Dorieh provides another CWL tool to download the shapefiles for US geographies (ZCTAs and counties) from the US Census website: get_shapes.cwl.

Therefore, we need to add a third step to the existing two:

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl

Which will bring us to the following workflow definition:

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  year:
    type: string

steps:
  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: year
      band: band
    out:
      - log
      - data
      - errors

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl

outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors

After repeating the process described above for the remaining two steps (get_shapes and aggregate) we finally have a runnable workflow definition:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  year:
    type: string
  geography:
    type: string

steps:
  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: year
      band: band
    out:
      - log
      - data
      - errors

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
    in:
      geo: geography
      year: year
    out:
      - shape_files

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
    in:
      geography: geography
      year: year
      band: band
      input: download/data
      shape_files: get_shapes/shape_files
    out:
      - log
      - data
      - errors

outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors
  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
  get_shapes_shape_files:
    type: File[]
    outputSource: get_shapes/shape_files
  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
  aggregate_log:
    type: File?
    outputSource: aggregate/log
  aggregate_data:
    type: File?
    outputSource: aggregate/data
  aggregate_errors:
    type: File
    outputSource: aggregate/errors

At this point, the workflow can aggregate a full year of data, but that is slow for development. Next we add a “toy slice” mode.

If you prefer to test the current workflow, and if you do not mind waiting hours for its completion, you can run it with the following command:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --year 2019 --geography zcta

Step 3. Parameterize for a single day (“toy” run)

To speed up development and debugging, we can parameterize the workflow so that it can run on “toy” datasets—for example, filtering a single day (e.g., 2019-01-15) instead of an entire year. Running your pipeline on a minimal dataset ensures quick feedback, uncovers errors early, and avoids wasting compute resources.

First, we need to modify the inputs section of the workflow:

inputs:
  band:
    type: string
  date:
    type: string    # e.g. "2019-01-15"
  geography:
    type: string    # "zcta" or "county"

We must keep in mind that such parametrization often requires additional input transformations (e.g., extracting year from a date); hence, we need to insert transformation steps, chaining outputs via CWL’s valueFrom mechanism.

In our case, to allow the workflow to run on a toy dataset, we will first replace the input parameter year with date. Our downstream steps, however, require a year, e.g., to download the right shape file. Hence, we will add an additional step to extract the year from the date:

  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year

With these changes in place, the workflow is now:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string

steps:
  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year

  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: extract_year/year
      band: band
    out:
      - log
      - data
      - errors

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
    in:
      geo: geography
      year: extract_year/year
    out:
      - shape_files

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
    in:
      geography: geography
      year: extract_year/year
      dates: date
      band: band
      input: download/data
      shape_files: get_shapes/shape_files
      strategy:
        valueFrom: default
    out:
      - log
      - data
      - errors

outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors
  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
  get_shapes_shape_files:
    type: File[]
    outputSource: get_shapes/shape_files
  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
  aggregate_log:
    type: File?
    outputSource: aggregate/log
  aggregate_data:
    type: File?
    outputSource: aggregate/data
  aggregate_errors:
    type: File
    outputSource: aggregate/errors

It can be run with the following command:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If successful, you should find a gzipped CSV file under tmmx_zcta_polygon_2019.csv.gz containing the following columns:

date
zcta
tmmx

This is exactly what we will ingest into the Bronze layer.

If the workflow fails, follow the troubleshooting steps described in the Troubleshooting documentation.

Step 4. Add database integration (PostgreSQL)

Start or check PostgreSQL 

If you already have a PostgreSQL database running, you will need to create a database.ini file as described in the Documentation.

Otherwise, you can start a PostgreSQL container using Docker:

git clone https://github.com/ForomePlatform/dorieh.git
cd dorieh/docker/pg-hll
docker compose up -d

In the latter case use the provided database.ini file.

Add PostgreSQL integration to the workflow 

Back in your tutorial directory (examples/tutorials/climate), add two new workflow inputs to example1.cwl:

inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string
  database:
    type: File
    default:
      class: File
      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/examples/with-postgres/database.ini
  connection_name:
    type: string
    default: "localhost"

Add the Database initialization step 

Another required action is to ensure that the database contains the latest Dorieh code for PostgreSQL. This is done by adding init_db step:

initdb:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
    in:
      database: database
      connection_name: connection_name
    out:
      - log
      - err

Optionally, though we recommended it, add the outputs of the initdb to the pipeline outputs:

outputs:
  # ... existing outputs ...

  initdb_log:
    type: File
    outputSource: initdb/log
  initdb_err:
    type: File
    outputSource: initdb/err

Defining Data Model 

However, to load the data into a database, we also need to define the database schema. It is possible to automatically infer schema using Dorieh tools like Project Loader and Introspector. But for Medallion architecture the schema should be explicitly defined and will use it with the Data Loader tool.

Adding Ingestion Step 

Now we can add the actual ingestion step to the workflow using the Dorieh ingest tool:

  ingest:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
    in:
      depends_on: initdb/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "bronze_temperature"
      input: aggregate/data
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

After adding ingestion to steps and the logs it produces to the outputs, the resulting workflow file should look like:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string
  database:
    type: File
    default:
      class: File
      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
  connection_name:
    type: string
    default: "localhost"

steps:
  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year

  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: extract_year/year
      band: band
    out:
      - log
      - data
      - errors

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
    in:
      geo: geography
      year: extract_year/year
    out:
      - shape_files

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
    in:
      geography: geography
      year: extract_year/year
      dates: date
      band: band
      input: download/data
      shape_files: get_shapes/shape_files
      strategy:
        valueFrom: default
    out:
      - log
      - data
      - errors

  initdb:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
    in:
      database: database
      connection_name: connection_name
    out:
      - log
      - err

  ingest:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
    in:
      depends_on: initdb/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "bronze_temperature"
      input: aggregate/data
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors
  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
  get_shapes_shape_files:
    type: File[]
    outputSource: get_shapes/shape_files
  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
  aggregate_log:
    type: File?
    outputSource: aggregate/log
  aggregate_data:
    type: File?
    outputSource: aggregate/data
  aggregate_errors:
    type: File
    outputSource: aggregate/errors

  initdb_log:
    type: File
    outputSource: initdb/log
  initdb_err:
    type: File
    outputSource: initdb/err

  ingest_log:
    type: File
    outputSource: ingest/log
  ingest_err:
    type: File
    outputSource: ingest/errors

The same command as before can be used to run the workflow:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If you chose to perform this run, beside the CSV file, you will see the data in the database in the table bronze_temperature.

Step 5. Building Medallion Layers (Bronze, Silver, Gold)

Medallion architecture defines three layers:

Bronze Layer: Load as-is, minimally processed data to database from pipeline outputs.
- In this climate data example, the “raw” data is not strictly straight-from-source due to initial aggregation necessary for technical compatibility as NetCDF data can not be ingested directly into the majority of DBMSs unless a specialized extensions are installed. Hence, we need to transform the data to a more conventional tabular format before ingestion - the exact operation performed by the aggregation step
Silver Layer: Clean, harmonize, and enrich data.
- Built from Bronze layer (no external inputs are allowed).
- Add derived columns (e.g., Celsius/Fahrenheit conversions, state & city annotation via ZIP lookup).
- Define as a view on top of the bronze table in your data model YAML.
Gold Layer: Produce analytic/ML-ready outputs.
- Built from Silver Layer.
- Perform groupings and summaries (e.g., aggregating by state and date).
- Use materialized views where performance and reusability are required.

Silver layer usually is responsible for normalization, harmonization and cleansing of the data. In our example the data does not require cleansing, as it is already clean, normalized, and does not require harmonization because it is coming from a single source. However, even if source data is already clean, enriching it with derived fields or external metadata (such as regional names) increases analytic utility and facilitates downstream ML feature engineering. Hence, we illustrate enriching the data by adding columns, displaying the temperature in different units and annotating zip codes with the US State abbreviations and the names of the cities for those areas that lie within a city.

Based on the discussion above, for the silver layer we will add a corresponding step:

  build_silver:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: ingest/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "silver_temperature"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

This step builds a table named silver_temperature. We also need to describe the table in the data model file. Best Practice is to keep your Silver and Gold layer table/view definitions together in a versioned domain YAML file, checked into source control along with your workflow scripts.

hence, we will add the following table definition to example1_model.yml:

    silver_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
      create:
        type: view
        from: bronze_temperature
      columns:
        - tmmx
        - date
        - zcta
        - temperature_in_C:
            type: float
            description: Temperature in Celsius
            source: (tmmx - 273.15)
        - temperature_in_F:
            type: float
            description: Temperature in Fahrenheit
            source: ((tmmx - 273.15)*9/5 + 32)
        - us_state:
            type: VARCHAR(2)
            description: US State
            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
        - city:
            type: VARCHAR(128)
            description: >
              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
              for ZCTAs spanning multiple cities, this is the city covering the largest 
              portion of the area or population.
            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"

This silver table retains all 3 bronze columns and adds 4 new:

Temperature expressed in degrees Celsius for the benefit of readers outside of the United States. It is computed by the trivial formula.
Temperature expressed in degrees Fahrenheit for the benefit of the United States reader. It is computed by the known conversion formula.
A code (2 letter abbreviation) for the state, in which the area lies. It is computed by calling Dorieh built-in function zip_to_state.
A name of the city for the zip code for urban areas, also computed by calling Dorieh built-in function zip_to_city.

After you made these changes to both files (example1.cwl and example1_model.yml), you are welcome to run the pipeline again, using the same command as above. It will now build the silver layer.

Alternatively, we can add the Gold layer before testing.

In the Gold layer, we will add just one table that computes some data for the whole states and is ready for analysis. The table named gold_temperature_by_state is defined by the following block:

    gold_temperature_by_state:
      description: |
        Temperature variations by US State
      create:
        type: materialized view
        from: silver_temperature
        group by:
          - us_state
          - date
      columns:
        - us_state
        - date
        - t_span:
            type: float
            description: Temperature variation in Celsius
            source: MAX(tmmx) - MIN(tmmx)
        - t_mean_in_C:
            type: float
            description: Mean Temperature in Celsius
            source: AVG(temperature_in_C)
        - t_mean_in_F:
            type: float
            description: Mean Temperature in Fahrenheit
            source: AVG(temperature_in_F)
      primary_key:
        - us_state
        - date

The gold table contains mean temperatures on a date for every US state and also the variation in the temperature on the day. The variation is in maximum temperature, so it does not reflect a change during a day, but only the diversity of geography.

We now need to add a step to build a gold schema to the workflow itself. The step is literally the same as silver, the difference is just the target table name:

  build_gold:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: build_silver/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "gold_temperature_by_state"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

The final version of the workflow is:

#!/usr/bin/env cwl-runner

cwlVersion: v1.2
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}
  NetworkAccess:
    networkAccess: True
inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string
  database:
    type: File
    default:
      class: File
      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
  connection_name:
    type: string
    default: "localhost"

steps:
  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year

  download:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
    in:
      year: extract_year/year
      band: band
    out:
      - log
      - data
      - errors

  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
    in:
      geo: geography
      year: extract_year/year
    out:
      - shape_files

  aggregate:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
    in:
      geography: geography
      year: extract_year/year
      dates: date
      band: band
      input: download/data
      shape_files: get_shapes/shape_files
      strategy:
        valueFrom: default
    out:
      - log
      - data
      - errors

  initdb:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
    in:
      database: database
      connection_name: connection_name
    out:
      - log
      - err

  ingest:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
    in:
      depends_on: initdb/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "bronze_temperature"
      input: aggregate/data
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

  build_silver:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: ingest/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "silver_temperature"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors

  build_gold:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: build_silver/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "gold_temperature_by_state"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors


outputs:
  ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors
  ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
  get_shapes_shape_files:
    type: File[]
    outputSource: get_shapes/shape_files
  ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
  aggregate_log:
    type: File?
    outputSource: aggregate/log
  aggregate_data:
    type: File?
    outputSource: aggregate/data
  aggregate_errors:
    type: File
    outputSource: aggregate/errors

  initdb_log:
    type: File
    outputSource: initdb/log
  initdb_err:
    type: File
    outputSource: initdb/err

  ingest_log:
    type: File
    outputSource: ingest/log
  ingest_err:
    type: File
    outputSource: ingest/errors

  build_silver_log:
    type: File
    outputSource: build_silver/log
  build_silver_err:
    type: File
    outputSource: build_silver/errors

  build_gold_log:
    type: File
    outputSource: build_gold/log
  build_gold_err:
    type: File
    outputSource: build_gold/errors

While the final Medallion data model is:

tutorial:
  header: true
  quoting: 3
  index: "unless excluded"
  description: "Data model for data transformation tutorial"
  tables:
    bronze_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas
      columns:
        - tmmx:
            type: float
            description: Maximum temperature variable from the TerraClimate dataset in K
            reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
        - date:
            type: date
        - zcta:
            description: Zip Code for a Zip Code Tabulation Area
            type: int
      primary_key:
        - zcta
        - date
    silver_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
      create:
        type: view
        from: bronze_temperature
      columns:
        - tmmx
        - date
        - zcta
        - temperature_in_C:
            type: float
            description: Temperature in Celsius
            source: (tmmx - 273.15)
        - temperature_in_F:
            type: float
            description: Temperature in Fahrenheit
            source: ((tmmx - 273.15)*9/5 + 32)
        - us_state:
            type: VARCHAR(2)
            description: US State
            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
        - city:
            type: VARCHAR(128)
            description: >
              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
              for ZCTAs spanning multiple cities, this is the city covering the largest 
              portion of the area or population.
            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"

    gold_temperature_by_state:
      description: |
        Temperature variations by US State
      create:
        type: materialized view
        from: silver_temperature
        group by:
          - us_state
          - date
      columns:
        - us_state
        - date
        - t_span:
            type: float
            description: Temperature variation in Celsius
            source: MAX(tmmx) - MIN(tmmx)
        - t_mean_in_C:
            type: float
            description: Mean Temperature in Celsius
            source: AVG(temperature_in_C)
        - t_mean_in_F:
            type: float
            description: Mean Temperature in Fahrenheit
            source: AVG(temperature_in_F)
      primary_key:
        - us_state
        - date

Step 6. Testing the Pipeline 

At this point example1.cwl orchestrates:

Data download.
Shape download.
Spatial aggregation.
DB initialization.
Bronze ingestion.
Silver view creation.
Gold materialized view creation.

The pipeline is now ready to be tested. You can run it with the same command as before:

toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta

If everything completes successfully, you should see the following tables in your PostgreSQL database:

bronze_temperature
silver_temperature (view)
gold_temperature_by_state (materialized view)

You can also run the following SQL queries to verify the data:

-- Inspect Bronze
SELECT * FROM bronze_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Silver
SELECT date, zcta, temperature_in_C, us_state, city
FROM silver_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Gold: which state was hottest on 2019‑01‑15?
SELECT us_state, t_mean_in_C, t_span
FROM gold_temperature_by_state
WHERE date = '2019-01-15'
ORDER BY t_mean_in_C DESC
LIMIT 10;

Next Steps 

In the next steps we should learn how to: