# Tutorial: Building a Bronze–Silver–Gold Climate Pipeline with Dorieh

```{contents}
---
local:
---
```

```{seealso}
[Example: aggregating a climate variable](../../Example-climate-workflow.md)
```
            
## Introduction: 

This tutorial walks you through building a small, but complete, 
Dorieh‑based data processing workflow. It attempts to model a realistic 
process of incrementally adding features to a workflow, keeping the 
workflow runnable at every step but with the ultimate goal of producing 
a production-ready dataset.

The eventual pipeline:

* Downloads gridded daily climate data.
* Aggregates it over ZIP Code Tabulation Areas (ZCTAs).
* Loads the result into PostgreSQL.
* Builds Bronze, Silver, and Gold layers using the Dorieh data‑modeling DSL.
* Generates documentation and data lineage diagrams.

The same design patterns apply directly to health and claims data 
pipelines; here we use open climate data so anyone can reproduce the 
example.   

The concepts in this tutorial are used in the "Data Provenance for 
Secondary Use of Health Data through Dataflow Programming" book 
in the chapter named “Sample Application: Building ML‑Ready Datasets”.


## Prerequisites

You will need:

* A Unix‑like environment (Linux or macOS).
* Python 3.9+.
* A CWL runner, e.g.:
  * toil-cwl-runner (used in examples), 
  * or cwltool.
* PostgreSQL or Docker to use the provided container image
* graphviz (for lineage diagrams) and pandoc (optional, for HTML docs).

Dorieh must be installed as described in the 
[Example](../../Example-climate-workflow.md#prepare-to-run-a-workflow).

For convenience, the full list of commands is also provided below:

```bash
# Create a virtual environment and activate it.
python3 -m venv $path
source $path/bin/activate

# Install Toil and its dependencies.
pip install "toil[cwl,aws]"

# Install Dorieh and its dependencies.
pip install dorieh
```

## Design overview
We briefly outline the pipeline design before touching any code.

### Inputs

* Climate variable: daily maximum temperature (tmmx) from a gridded 
dataset (e.g., GRIDMET / TerraClimate via Google Earth Engine / NKN).

```{Important}
As we will be progressing with the building of the workflow we 
will see that we also need an additional input: shapefile with 
ZIP Code Tabulation Areas (ZCTAs).
```

### Outputs
* **File output**: a compressed CSV with columns: date, zcta, and tmmx 
(Kelvin).
* **Database outputs**:
  * Bronze table: bronze_temperature(date, zcta, tmmx).
  * Silver view: silver_temperature with:
     * temperature_in_C, temperature_in_F
     * us_state, city
  * Gold materialized view: gold_temperature_by_state with:
     *mean temperature and temperature span per state/day.

### Architecture
We will:

* Use **CWL** to orchestrate:
  * climate download → shape download → spatial aggregation → DB init 
   → ingestion → Silver/Gold creation.
* Use the **Dorieh data‑modeling DSL (YAML)** to define:
  * Bronze/Silver/Gold schema and transformations.
* Use Dorieh utilities to generate:
  * Workflow documentation (from CWL).
  * Data dictionaries and lineage diagrams (from YAML).


## Directory layout

Create a working directory for this tutorial, for example:

```bash
cd tmp
mkdir -p dorieh/tutorials/climate
cd dorieh/tutorials/climate
```
We will place:

* example1.cwl – the CWL workflow.
* example1_model.yml – the domain definition for Bronze/Silver/Gold.
* (Optionally) a local database.ini if you need to adjust it for your 
   environment.
                                        
## Step 1. Create a minimal CWL workflow skeleton

We will start with a minimal CWL workflow definition containing the 
main steps—data acquisition, shape file retrieval, and aggregation. 
At this stage, placeholders can be used for inputs and outputs; 
these will be filled in as more details on the required tool 
parameters are gathered.    

Dorieh provides a library of prebuilt CWL tool definitions, 
streamlining common steps such as downloading remote files and 
running aggregations. Corresponding tools are:

* [download.cwl](../../pipeline/download.md)
* [aggregate.cwl](../../pipeline/aggregate_daily.md)


The initial workflow skeleton can look like:

:::{toggle} Expand Code Block
```{literalinclude} steps/step1/example1.cwl
:linenos:
:language: yaml
```
:::


This skeleton is not yet runnable. It defines three steps but no 
inputs, outputs, or wiring.                

## Step 2. Iteratively Defining Steps and Parameters

At this implementation step we need to:

* For each workflow step, consult the underlying tool’s 
  documentation or CWL file to determine:
  * Required input parameters
  * Output files and directory structure
* Incrementally add inputs (e.g., band, year, geography), outputs, 
  and wiring between steps (using CWL’s in and out sections).

Dorieh’s cwl_collect_outputs utility helps automating extraction of 
output specifications and minimizing errors. 

To illustrate the process, we will first look at the download step.
The tool we are using is 
[Dorieh download.cwl](../../pipeline/download.md)

We see that it takes 3 input parameters: 

* proxy settings, 
* year, for which we will be performing data aggregation 
* meteorological band or variable that we would like to aggregate. 

If you are using a system with direct access to the Internet, you 
probably do not need a proxy. However, we need to take care of 
‘year’ and ‘band’.    

After adding these two input parameters to the ‘ inputs ’ section and 
‘download’ step, the workflow file will look like:  

:::{toggle} Expand Code Block
```{literalinclude} steps/step2/example1-1.cwl
:linenos:
:language: yaml
```
:::

Though we have now defined the inputs for the first step, the 
workflow is still not runnable because we have not yet defined the 
outputs. As mentioned, Dorieh’s cwl_collect_outputs utility assists 
with this task. You can run: 
                             
```bash
python -m dorieh.platform.util.cwl_collect_outputs download https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/cwl/download.cwl
```

The command should produce the following output:

```text
    out:
      - log
      - data
      - errors
## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
  download_log:
    type: File?
    outputSource: download/log
  download_data:
    type: File?
    outputSource: download/data
  download_errors:
    type: File
    outputSource: download/errors
```

From the standard output of this command, copy the “out” section to 
the step and the rest to the “outputs” section. You can exclude the 
outputs you want to ignore. The resulting workflow file should look 
like:    

:::{toggle} Expand Code Block
```{literalinclude} steps/step2/example1-2.cwl
:linenos:
:language: yaml
```
:::


Now we will repeat the process for the `aggregate` step. Looking at the 
tool we are using 
[aggregate.cwl](../../pipeline/aggregate_daily.md).
we notice that besides the input parameters we already discussed, it
also requires Shapefiles. Dorieh provides another CWL tool to 
download the shapefiles for US geographies (ZCTAs and counties) from 
the US Census website: [get_shapes.cwl](../../pipeline/get_shapes.md).  

Therefore, we need to add a third step to the 
existing two:

```yaml
  get_shapes:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
```
                        
Which will bring us to the following workflow definition:

:::{toggle} Expand Code Block
```{literalinclude} steps/step2/example1-3.cwl
:linenos:
:language: yaml
```
:::


After repeating the process described above for the remaining two steps 
(`get_shapes` and `aggregate`) we finally have a runnable workflow 
definition:

:::{toggle} Expand Code Block
```{literalinclude} steps/step2/example1-4.cwl
:linenos:
:language: yaml
```
:::

At this point, the workflow can aggregate a full year of data, but 
that is slow for development. Next we add a “toy slice” mode. 

If you prefer to test the current workflow, and if you do not mind 
waiting hours for its completion, you can run it with the following 
command: 

```shell
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --year 2019 --geography zcta

```

## Step 3. Parameterize for a single day (“toy” run)

To speed up development and debugging, we can parameterize the 
workflow so that it can run on “toy” datasets—for example, filtering 
a single day (e.g., 2019-01-15) instead of an entire year. Running 
your pipeline on a minimal dataset ensures quick feedback, uncovers 
errors early, and avoids wasting compute resources.     

First, we need to modify the `inputs` section of the workflow:

```yaml
inputs:
  band:
    type: string
  date:
    type: string    # e.g. "2019-01-15"
  geography:
    type: string    # "zcta" or "county"
```

We must keep in mind that such parametrization often requires 
additional input transformations 
(e.g., extracting year from a date); hence, we need to insert 
transformation steps, chaining outputs via CWL’s valueFrom mechanism.       

In our case, to allow the workflow to run on a toy dataset, we will 
first replace the input parameter `year` with `date`. Our downstream 
steps, however, require a year, e.g., to download the right shape 
file. Hence, we will add an additional step to extract the year from 
the date:        
                
```yaml
  extract_year:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
    in:
      date: date
    out:
      - year
```

With these changes in place, the workflow is now:

:::{toggle} Expand Code Block
```{literalinclude} steps/step3/example1.cwl
:linenos:
:language: yaml
```
:::


It can be run with the following command:

```shell
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
```

If successful, you should find a gzipped CSV file under 
`tmmx_zcta_polygon_2019.csv.gz` containing the following columns:    

* date
* zcta
* tmmx

This is exactly what we will ingest into the **Bronze layer**.

If the workflow fails, follow the troubleshooting steps described in 
the 
[Troubleshooting](../../pipelines.md#troubleshooting-workflows-run-by-toil) 
documentation.
         
## Step 4. Add database integration (PostgreSQL)

### Start or check PostgreSQL

If you already have a PostgreSQL database running, you will need to
create a `database.ini` file as described in the 
[Documentation](../../DBConnections.md). 

Otherwise, you can start a PostgreSQL container using Docker:

```shell
git clone https://github.com/ForomePlatform/dorieh.git
cd dorieh/docker/pg-hll
docker compose up -d
```
   
In the latter case use the provided 
[`database.ini`](https://github.com/ForomePlatform/dorieh/blob/main/examples/with-postgres/database.ini) 
file.

### Add PostgreSQL integration to the workflow

Back in your tutorial directory (examples/tutorials/climate), add 
two new workflow inputs to example1.cwl: 

```yaml
inputs:
  band:
    type: string
  date:
    type: string
  geography:
    type: string
  database:
    type: File
    default:
      class: File
      location: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/examples/with-postgres/database.ini
  connection_name:
    type: string
    default: "localhost"
```

### Add the Database initialization step

Another required action is to ensure that the database contains the 
latest Dorieh code for PostgreSQL. This is done by adding `init_db` 
step:  

```yaml
initdb:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
    in:
      database: database
      connection_name: connection_name
    out:
      - log
      - err
```

Optionally, though we recommended it, add the outputs of the 
`initdb` to the pipeline outputs:
            
```yaml
outputs:
  # ... existing outputs ...

  initdb_log:
    type: File
    outputSource: initdb/log
  initdb_err:
    type: File
    outputSource: initdb/err
```
          
### Defining Data Model

However, to load the data into a database, we also need to define 
the database schema. It is possible to automatically infer schema 
using Dorieh tools like 
[Project Loader](../../ProjectLoader.md) and 
[Introspector](../../members/introspector). But for Medallion 
architecture the schema should be explicitly defined and will use it 
with the [Data Loader](../../DataLoader.md) tool.

```{seealso}
[Data modelling vs data introspection](../../adding_data.md#data-modelling-vs-data-introspection)
```

The data model definition language is described in the 
[Data Model](../../Datamodels.md) documentation. 

Initially, we 
will define a simple schema for the Bronze layer:

:::{toggle} Expand Code Block
```{literalinclude} steps/step4/example1_model.yml
:linenos:
:language: yaml
```
:::


It defines a data domain named “tutorial” and in it a table named 
“bronze_temperature” with 3 columns: `tmmx`, `date` and `zcta`. We will 
assume that the file name is `example1_model.yml`.

### Adding Ingestion Step
                                
Now we can add the actual ingestion step to the workflow using the 
Dorieh [ingest tool](../../pipeline/ingest.md):

```yaml
  ingest:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
    in:
      depends_on: initdb/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "bronze_temperature"
      input: aggregate/data
      database: database
      connection_name: connection_name
    out:
      - log
      - errors
```

After adding ingestion to **steps** and the logs it produces to the 
**outputs**, the resulting workflow file should look like:

:::{toggle} Expand Code Block
```{literalinclude} steps/step4/example1.cwl
:linenos:
:language: yaml
```
:::

                        
The same command as before can be used to run the workflow:

```shell
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
```

If you chose to perform this run, beside the CSV file, you will see the 
data in the database in the table `bronze_temperature`.
                                         
## Step 5. Building Medallion Layers (Bronze, Silver, Gold)

Medallion architecture defines three layers:

* **Bronze Layer**: Load as-is, minimally processed data to database 
  from pipeline outputs. 
  * In this climate data example, the “raw” data is not strictly 
    straight-from-source due to initial aggregation necessary for 
    technical compatibility as NetCDF data can not be ingested 
    directly into the majority of DBMSs unless a specialized 
    extensions are installed. Hence, we need to transform the data 
    to a more conventional tabular format before ingestion - the 
    exact operation performed by the aggregation step      
* **Silver Layer**: Clean, harmonize, and enrich data.
  * Built from Bronze layer (no external inputs are allowed).
  * Add derived columns (e.g., Celsius/Fahrenheit conversions, state 
    & city annotation via ZIP lookup). 
  * Define as a view on top of the bronze table in your data model YAML.
* **Gold Layer**: Produce analytic/ML-ready outputs.
  * Built from Silver Layer.
  * Perform groupings and summaries (e.g., aggregating by state and 
    date). 
  * Use materialized views where performance and reusability are 
    required. 

Silver layer usually is responsible for normalization, harmonization 
and cleansing of the data. In our example the data does not require 
cleansing, as it is already clean, normalized, and does not require 
harmonization because it is coming from a single source. However, 
even if source data is already clean, enriching it with derived 
fields or external metadata (such as regional names) increases 
analytic utility and facilitates downstream ML feature engineering. 
Hence, we illustrate enriching the data by adding columns, 
displaying the temperature in different units and annotating zip 
codes with the US State abbreviations and the names of the cities 
for those areas that lie within a city.         

Based on the discussion above, for the silver layer we will add a 
corresponding step:   

```yaml
  build_silver:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: ingest/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "silver_temperature"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors
```

This step builds a table named `silver_temperature`. We also need to 
describe the table in the data model file. Best Practice is to keep 
your Silver and Gold layer table/view definitions together in a 
versioned domain YAML file, checked into source control along with 
your workflow scripts.    

hence, we will add the following table definition to 
`example1_model.yml`: 

```yaml
    silver_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
      create:
        type: view
        from: bronze_temperature
      columns:
        - tmmx
        - date
        - zcta
        - temperature_in_C:
            type: float
            description: Temperature in Celsius
            source: (tmmx - 273.15)
        - temperature_in_F:
            type: float
            description: Temperature in Fahrenheit
            source: ((tmmx - 273.15)*9/5 + 32)
        - us_state:
            type: VARCHAR(2)
            description: US State
            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
        - city:
            type: VARCHAR(128)
            description: >
              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
              for ZCTAs spanning multiple cities, this is the city covering the largest 
              portion of the area or population.
            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"
```                

This silver table retains all 3 bronze columns and adds 4 new:

* Temperature expressed in degrees Celsius for the benefit of 
  readers outside of the United States. It is computed by the 
  trivial formula.   
* Temperature expressed in degrees Fahrenheit for the benefit of the 
  United States reader. It is computed by the known conversion 
  formula.  
* A code (2 letter abbreviation) for the state, in which the area 
  lies. It is computed by calling Dorieh built-in function 
  `zip_to_state`.  
* A name of the city for the zip code for urban areas, also computed 
  by calling Dorieh built-in function `zip_to_city`.  

After you made these changes to both files (`example1.cwl` and 
`example1_model.yml`), you are welcome to run the pipeline again, 
using the same command as above. It will now 
build the silver layer. 

Alternatively, we can add the Gold layer before testing.   

In the Gold layer, we will add just one table that computes some 
data for the whole states and is ready for analysis. The table 
named `gold_temperature_by_state` is defined by the following block:  

```yaml
    gold_temperature_by_state:
      description: |
        Temperature variations by US State
      create:
        type: materialized view
        from: silver_temperature
        group by:
          - us_state
          - date
      columns:
        - us_state
        - date
        - t_span:
            type: float
            description: Temperature variation in Celsius
            source: MAX(tmmx) - MIN(tmmx)
        - t_mean_in_C:
            type: float
            description: Mean Temperature in Celsius
            source: AVG(temperature_in_C)
        - t_mean_in_F:
            type: float
            description: Mean Temperature in Fahrenheit
            source: AVG(temperature_in_F)
      primary_key:
        - us_state
        - date
```

The gold table contains mean temperatures on a date for every US state 
and also the variation in the temperature on the day. The variation 
is in maximum temperature, so it does not reflect a change during a 
day, but only the diversity of geography.   

We now need to add a step to build a gold schema to the workflow 
itself. The step is literally the same as silver, the difference is 
just the target table name:  
                             
```yaml
  build_gold:
    run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
    in:
      depends_on: build_silver/log
      registry:
        default:
          class: File
          location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
      domain:
        valueFrom: "tutorial"
      table:
        valueFrom: "gold_temperature_by_state"
      database: database
      connection_name: connection_name
    out:
      - log
      - errors
```

The final version of the workflow is:

:::{toggle} Expand Code Block
```{literalinclude} example1.cwl
:linenos:
:language: yaml
```
:::


While the final Medallion data model is:

:::{toggle} Expand Code Block
```{literalinclude} example1_model.yml
:linenos:
:language: yaml
```
:::

## Step 6. Testing the Pipeline

At this point example1.cwl orchestrates:

1. Data download.
2. Shape download.
3. Spatial aggregation.
4. DB initialization.
5. Bronze ingestion.
6. Silver view creation.
7. Gold materialized view creation.

The pipeline is now ready to be tested. You can run it with the 
same command as before:

```shell
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
```

If everything completes successfully, you should see the following 
tables in your PostgreSQL database:

* `bronze_temperature`
* `silver_temperature` (view)
* `gold_temperature_by_state` (materialized view)

You can also run the following SQL queries to verify the data:

```sql
-- Inspect Bronze
SELECT * FROM bronze_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Silver
SELECT date, zcta, temperature_in_C, us_state, city
FROM silver_temperature
ORDER BY date, zcta
LIMIT 10;

-- Inspect Gold: which state was hottest on 2019‑01‑15?
SELECT us_state, t_mean_in_C, t_span
FROM gold_temperature_by_state
WHERE date = '2019-01-15'
ORDER BY t_mean_in_C DESC
LIMIT 10;
```

## Next Steps

In the next steps we should learn how to:

* [Document a workflow](documenting-a-workflow.md)
* [Construct Data dictionaries and lineage graphs](constructing-lineage.md)