Tutorial: Building a Bronze–Silver–Gold Climate Pipeline with Dorieh
Introduction:
This tutorial walks you through building a small, but complete, Dorieh‑based data processing workflow. It attempts to model a realistic process of incrementally adding features to a workflow, keeping the workflow runnable at every step but with the ultimate goal of producing a production-ready dataset.
The eventual pipeline:
Downloads gridded daily climate data.
Aggregates it over ZIP Code Tabulation Areas (ZCTAs).
Loads the result into PostgreSQL.
Builds Bronze, Silver, and Gold layers using the Dorieh data‑modeling DSL.
Generates documentation and data lineage diagrams.
The same design patterns apply directly to health and claims data pipelines; here we use open climate data so anyone can reproduce the example.
The concepts in this tutorial are used in the “Data Provenance for Secondary Use of Health Data through Dataflow Programming” book in the chapter named “Sample Application: Building ML‑Ready Datasets”.
Prerequisites
You will need:
A Unix‑like environment (Linux or macOS).
Python 3.9+.
A CWL runner, e.g.:
toil-cwl-runner (used in examples),
or cwltool.
PostgreSQL or Docker to use the provided container image
graphviz (for lineage diagrams) and pandoc (optional, for HTML docs).
Dorieh must be installed as described in the Example.
For convenience, the full list of commands is also provided below:
# Create a virtual environment and activate it.
python3 -m venv $path
source $path/bin/activate
# Install Toil and its dependencies.
pip install "toil[cwl,aws]"
# Install Dorieh and its dependencies.
pip install dorieh
Design overview
We briefly outline the pipeline design before touching any code.
Inputs
Climate variable: daily maximum temperature (tmmx) from a gridded dataset (e.g., GRIDMET / TerraClimate via Google Earth Engine / NKN).
Important
As we will be progressing with the building of the workflow we will see that we also need an additional input: shapefile with ZIP Code Tabulation Areas (ZCTAs).
Outputs
File output: a compressed CSV with columns: date, zcta, and tmmx (Kelvin).
Database outputs:
Bronze table: bronze_temperature(date, zcta, tmmx).
Silver view: silver_temperature with:
temperature_in_C, temperature_in_F
us_state, city
Gold materialized view: gold_temperature_by_state with: *mean temperature and temperature span per state/day.
Architecture
We will:
Use CWL to orchestrate:
climate download → shape download → spatial aggregation → DB init → ingestion → Silver/Gold creation.
Use the Dorieh data‑modeling DSL (YAML) to define:
Bronze/Silver/Gold schema and transformations.
Use Dorieh utilities to generate:
Workflow documentation (from CWL).
Data dictionaries and lineage diagrams (from YAML).
Directory layout
Create a working directory for this tutorial, for example:
cd tmp
mkdir -p dorieh/tutorials/climate
cd dorieh/tutorials/climate
We will place:
example1.cwl – the CWL workflow.
example1_model.yml – the domain definition for Bronze/Silver/Gold.
(Optionally) a local database.ini if you need to adjust it for your environment.
Step 1. Create a minimal CWL workflow skeleton
We will start with a minimal CWL workflow definition containing the main steps—data acquisition, shape file retrieval, and aggregation. At this stage, placeholders can be used for inputs and outputs; these will be filled in as more details on the required tool parameters are gathered.
Dorieh provides a library of prebuilt CWL tool definitions, streamlining common steps such as downloading remote files and running aggregations. Corresponding tools are:
The initial workflow skeleton can look like:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14
15inputs: {}
16
17steps:
18 download:
19 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
20
21 aggregate:
22 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
23
24outputs: {}
This skeleton is not yet runnable. It defines three steps but no inputs, outputs, or wiring.
Step 2. Iteratively Defining Steps and Parameters
At this implementation step we need to:
For each workflow step, consult the underlying tool’s documentation or CWL file to determine:
Required input parameters
Output files and directory structure
Incrementally add inputs (e.g., band, year, geography), outputs, and wiring between steps (using CWL’s in and out sections).
Dorieh’s cwl_collect_outputs utility helps automating extraction of output specifications and minimizing errors.
To illustrate the process, we will first look at the download step. The tool we are using is Dorieh download.cwl
We see that it takes 3 input parameters:
proxy settings,
year, for which we will be performing data aggregation
meteorological band or variable that we would like to aggregate.
If you are using a system with direct access to the Internet, you probably do not need a proxy. However, we need to take care of ‘year’ and ‘band’.
After adding these two input parameters to the ‘ inputs ’ section and ‘download’ step, the workflow file will look like:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14
15inputs:
16 band:
17 type: string
18 year:
19 type: string
20
21steps:
22 download:
23 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
24 in:
25 year: year
26 band: band
27
28 aggregate:
29 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
30
31Outputs: {}
Though we have now defined the inputs for the first step, the workflow is still not runnable because we have not yet defined the outputs. As mentioned, Dorieh’s cwl_collect_outputs utility assists with this task. You can run:
python -m dorieh.platform.util.cwl_collect_outputs download https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/cwl/download.cwl
The command should produce the following output:
out:
- log
- data
- errors
## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
download_log:
type: File?
outputSource: download/log
download_data:
type: File?
outputSource: download/data
download_errors:
type: File
outputSource: download/errors
From the standard output of this command, copy the “out” section to the step and the rest to the “outputs” section. You can exclude the outputs you want to ignore. The resulting workflow file should look like:
1cwlVersion: v1.2
2class: Workflow
3
4requirements:
5 SubworkflowFeatureRequirement: {}
6 StepInputExpressionRequirement: {}
7 InlineJavascriptRequirement: {}
8 ScatterFeatureRequirement: {}
9 MultipleInputFeatureRequirement: {}
10 NetworkAccess:
11 networkAccess: True
12inputs:
13 band:
14 type: string
15 year:
16 type: string
17
18steps:
19 download:
20 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
21 in:
22 year: year
23 band: band
24 out:
25 - log
26 - data
27 - errors
28
29 aggregate:
30 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
31
32outputs:
33 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
34 download_log:
35 type: File?
36 outputSource: download/log
37 download_data:
38 type: File?
39 outputSource: download/data
40 download_errors:
41 type: File
42 outputSource: download/errors
Now we will repeat the process for the aggregate step. Looking at the
tool we are using
aggregate.cwl.
we notice that besides the input parameters we already discussed, it
also requires Shapefiles. Dorieh provides another CWL tool to
download the shapefiles for US geographies (ZCTAs and counties) from
the US Census website: get_shapes.cwl.
Therefore, we need to add a third step to the existing two:
get_shapes:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
Which will bring us to the following workflow definition:
1cwlVersion: v1.2
2class: Workflow
3
4requirements:
5 SubworkflowFeatureRequirement: {}
6 StepInputExpressionRequirement: {}
7 InlineJavascriptRequirement: {}
8 ScatterFeatureRequirement: {}
9 MultipleInputFeatureRequirement: {}
10 NetworkAccess:
11 networkAccess: True
12inputs:
13 band:
14 type: string
15 year:
16 type: string
17
18steps:
19 download:
20 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
21 in:
22 year: year
23 band: band
24 out:
25 - log
26 - data
27 - errors
28
29 get_shapes:
30 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
31
32 aggregate:
33 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
34
35outputs:
36 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
37 download_log:
38 type: File?
39 outputSource: download/log
40 download_data:
41 type: File?
42 outputSource: download/data
43 download_errors:
44 type: File
45 outputSource: download/errors
After repeating the process described above for the remaining two steps
(get_shapes and aggregate) we finally have a runnable workflow
definition:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14inputs:
15 band:
16 type: string
17 year:
18 type: string
19 geography:
20 type: string
21
22steps:
23 download:
24 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
25 in:
26 year: year
27 band: band
28 out:
29 - log
30 - data
31 - errors
32
33 get_shapes:
34 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
35 in:
36 geo: geography
37 year: year
38 out:
39 - shape_files
40
41 aggregate:
42 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
43 in:
44 geography: geography
45 year: year
46 band: band
47 input: download/data
48 shape_files: get_shapes/shape_files
49 out:
50 - log
51 - data
52 - errors
53
54outputs:
55 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
56 download_log:
57 type: File?
58 outputSource: download/log
59 download_data:
60 type: File?
61 outputSource: download/data
62 download_errors:
63 type: File
64 outputSource: download/errors
65 ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
66 get_shapes_shape_files:
67 type: File[]
68 outputSource: get_shapes/shape_files
69 ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
70 aggregate_log:
71 type: File?
72 outputSource: aggregate/log
73 aggregate_data:
74 type: File?
75 outputSource: aggregate/data
76 aggregate_errors:
77 type: File
78 outputSource: aggregate/errors
At this point, the workflow can aggregate a full year of data, but that is slow for development. Next we add a “toy slice” mode.
If you prefer to test the current workflow, and if you do not mind waiting hours for its completion, you can run it with the following command:
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --year 2019 --geography zcta
Step 3. Parameterize for a single day (“toy” run)
To speed up development and debugging, we can parameterize the workflow so that it can run on “toy” datasets—for example, filtering a single day (e.g., 2019-01-15) instead of an entire year. Running your pipeline on a minimal dataset ensures quick feedback, uncovers errors early, and avoids wasting compute resources.
First, we need to modify the inputs section of the workflow:
inputs:
band:
type: string
date:
type: string # e.g. "2019-01-15"
geography:
type: string # "zcta" or "county"
We must keep in mind that such parametrization often requires additional input transformations (e.g., extracting year from a date); hence, we need to insert transformation steps, chaining outputs via CWL’s valueFrom mechanism.
In our case, to allow the workflow to run on a toy dataset, we will
first replace the input parameter year with date. Our downstream
steps, however, require a year, e.g., to download the right shape
file. Hence, we will add an additional step to extract the year from
the date:
extract_year:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
in:
date: date
out:
- year
With these changes in place, the workflow is now:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14inputs:
15 band:
16 type: string
17 date:
18 type: string
19 geography:
20 type: string
21
22steps:
23 extract_year:
24 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
25 in:
26 date: date
27 out:
28 - year
29
30 download:
31 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
32 in:
33 year: extract_year/year
34 band: band
35 out:
36 - log
37 - data
38 - errors
39
40 get_shapes:
41 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
42 in:
43 geo: geography
44 year: extract_year/year
45 out:
46 - shape_files
47
48 aggregate:
49 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
50 in:
51 geography: geography
52 year: extract_year/year
53 dates: date
54 band: band
55 input: download/data
56 shape_files: get_shapes/shape_files
57 strategy:
58 valueFrom: default
59 out:
60 - log
61 - data
62 - errors
63
64outputs:
65 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
66 download_log:
67 type: File?
68 outputSource: download/log
69 download_data:
70 type: File?
71 outputSource: download/data
72 download_errors:
73 type: File
74 outputSource: download/errors
75 ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
76 get_shapes_shape_files:
77 type: File[]
78 outputSource: get_shapes/shape_files
79 ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
80 aggregate_log:
81 type: File?
82 outputSource: aggregate/log
83 aggregate_data:
84 type: File?
85 outputSource: aggregate/data
86 aggregate_errors:
87 type: File
88 outputSource: aggregate/errors
It can be run with the following command:
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
If successful, you should find a gzipped CSV file under
tmmx_zcta_polygon_2019.csv.gz containing the following columns:
date
zcta
tmmx
This is exactly what we will ingest into the Bronze layer.
If the workflow fails, follow the troubleshooting steps described in the Troubleshooting documentation.
Step 4. Add database integration (PostgreSQL)
Start or check PostgreSQL
If you already have a PostgreSQL database running, you will need to
create a database.ini file as described in the
Documentation.
Otherwise, you can start a PostgreSQL container using Docker:
git clone https://github.com/ForomePlatform/dorieh.git
cd dorieh/docker/pg-hll
docker compose up -d
In the latter case use the provided
database.ini
file.
Add PostgreSQL integration to the workflow
Back in your tutorial directory (examples/tutorials/climate), add two new workflow inputs to example1.cwl:
inputs:
band:
type: string
date:
type: string
geography:
type: string
database:
type: File
default:
class: File
location: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/examples/with-postgres/database.ini
connection_name:
type: string
default: "localhost"
Add the Database initialization step
Another required action is to ensure that the database contains the
latest Dorieh code for PostgreSQL. This is done by adding init_db
step:
initdb:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
in:
database: database
connection_name: connection_name
out:
- log
- err
Optionally, though we recommended it, add the outputs of the
initdb to the pipeline outputs:
outputs:
# ... existing outputs ...
initdb_log:
type: File
outputSource: initdb/log
initdb_err:
type: File
outputSource: initdb/err
Defining Data Model
However, to load the data into a database, we also need to define the database schema. It is possible to automatically infer schema using Dorieh tools like Project Loader and Introspector. But for Medallion architecture the schema should be explicitly defined and will use it with the Data Loader tool.
See also
The data model definition language is described in the Data Model documentation.
Initially, we will define a simple schema for the Bronze layer:
1tutorial:
2 header: true
3 quoting: 3
4 index: "unless excluded"
5 description: "Data model for data transformation tutorial"
6 tables:
7 bronze_temperature:
8 description: |
9 Maximum daily temperature for US Zip Code Tabulation Areas
10 columns:
11 - tmmx:
12 type: float
13 description: Maximum temperature variable from the TerraClimate dataset in K
14 reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
15 - date:
16 type: date
17 - zcta:
18 description: Zip Code for a Zip Code Tabulation Area
19 type: int
20 primary_key:
21 - zcta
22 - date
It defines a data domain named “tutorial” and in it a table named
“bronze_temperature” with 3 columns: tmmx, date and zcta. We will
assume that the file name is example1_model.yml.
Adding Ingestion Step
Now we can add the actual ingestion step to the workflow using the Dorieh ingest tool:
ingest:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
in:
depends_on: initdb/log
registry:
default:
class: File
location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/example1_model.yml"
domain:
valueFrom: "tutorial"
table:
valueFrom: "bronze_temperature"
input: aggregate/data
database: database
connection_name: connection_name
out:
- log
- errors
After adding ingestion to steps and the logs it produces to the outputs, the resulting workflow file should look like:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14inputs:
15 band:
16 type: string
17 date:
18 type: string
19 geography:
20 type: string
21 database:
22 type: File
23 default:
24 class: File
25 location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
26 connection_name:
27 type: string
28 default: "localhost"
29
30steps:
31 extract_year:
32 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
33 in:
34 date: date
35 out:
36 - year
37
38 download:
39 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
40 in:
41 year: extract_year/year
42 band: band
43 out:
44 - log
45 - data
46 - errors
47
48 get_shapes:
49 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
50 in:
51 geo: geography
52 year: extract_year/year
53 out:
54 - shape_files
55
56 aggregate:
57 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
58 in:
59 geography: geography
60 year: extract_year/year
61 dates: date
62 band: band
63 input: download/data
64 shape_files: get_shapes/shape_files
65 strategy:
66 valueFrom: default
67 out:
68 - log
69 - data
70 - errors
71
72 initdb:
73 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
74 in:
75 database: database
76 connection_name: connection_name
77 out:
78 - log
79 - err
80
81 ingest:
82 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
83 in:
84 depends_on: initdb/log
85 registry:
86 default:
87 class: File
88 location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
89 domain:
90 valueFrom: "tutorial"
91 table:
92 valueFrom: "bronze_temperature"
93 input: aggregate/data
94 database: database
95 connection_name: connection_name
96 out:
97 - log
98 - errors
99
100outputs:
101 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
102 download_log:
103 type: File?
104 outputSource: download/log
105 download_data:
106 type: File?
107 outputSource: download/data
108 download_errors:
109 type: File
110 outputSource: download/errors
111 ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
112 get_shapes_shape_files:
113 type: File[]
114 outputSource: get_shapes/shape_files
115 ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
116 aggregate_log:
117 type: File?
118 outputSource: aggregate/log
119 aggregate_data:
120 type: File?
121 outputSource: aggregate/data
122 aggregate_errors:
123 type: File
124 outputSource: aggregate/errors
125
126 initdb_log:
127 type: File
128 outputSource: initdb/log
129 initdb_err:
130 type: File
131 outputSource: initdb/err
132
133 ingest_log:
134 type: File
135 outputSource: ingest/log
136 ingest_err:
137 type: File
138 outputSource: ingest/errors
The same command as before can be used to run the workflow:
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
If you chose to perform this run, beside the CSV file, you will see the
data in the database in the table bronze_temperature.
Step 5. Building Medallion Layers (Bronze, Silver, Gold)
Medallion architecture defines three layers:
Bronze Layer: Load as-is, minimally processed data to database from pipeline outputs.
In this climate data example, the “raw” data is not strictly straight-from-source due to initial aggregation necessary for technical compatibility as NetCDF data can not be ingested directly into the majority of DBMSs unless a specialized extensions are installed. Hence, we need to transform the data to a more conventional tabular format before ingestion - the exact operation performed by the aggregation step
Silver Layer: Clean, harmonize, and enrich data.
Built from Bronze layer (no external inputs are allowed).
Add derived columns (e.g., Celsius/Fahrenheit conversions, state & city annotation via ZIP lookup).
Define as a view on top of the bronze table in your data model YAML.
Gold Layer: Produce analytic/ML-ready outputs.
Built from Silver Layer.
Perform groupings and summaries (e.g., aggregating by state and date).
Use materialized views where performance and reusability are required.
Silver layer usually is responsible for normalization, harmonization and cleansing of the data. In our example the data does not require cleansing, as it is already clean, normalized, and does not require harmonization because it is coming from a single source. However, even if source data is already clean, enriching it with derived fields or external metadata (such as regional names) increases analytic utility and facilitates downstream ML feature engineering. Hence, we illustrate enriching the data by adding columns, displaying the temperature in different units and annotating zip codes with the US State abbreviations and the names of the cities for those areas that lie within a city.
Based on the discussion above, for the silver layer we will add a corresponding step:
build_silver:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
in:
depends_on: ingest/log
registry:
default:
class: File
location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
domain:
valueFrom: "tutorial"
table:
valueFrom: "silver_temperature"
database: database
connection_name: connection_name
out:
- log
- errors
This step builds a table named silver_temperature. We also need to
describe the table in the data model file. Best Practice is to keep
your Silver and Gold layer table/view definitions together in a
versioned domain YAML file, checked into source control along with
your workflow scripts.
hence, we will add the following table definition to
example1_model.yml:
silver_temperature:
description: |
Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
create:
type: view
from: bronze_temperature
columns:
- tmmx
- date
- zcta
- temperature_in_C:
type: float
description: Temperature in Celsius
source: (tmmx - 273.15)
- temperature_in_F:
type: float
description: Temperature in Fahrenheit
source: ((tmmx - 273.15)*9/5 + 32)
- us_state:
type: VARCHAR(2)
description: US State
source: "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
- city:
type: VARCHAR(128)
description: >
Name of a representative city for the ZIP Code Tabulation Area (ZCTA);
for ZCTAs spanning multiple cities, this is the city covering the largest
portion of the area or population.
source: "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"
This silver table retains all 3 bronze columns and adds 4 new:
Temperature expressed in degrees Celsius for the benefit of readers outside of the United States. It is computed by the trivial formula.
Temperature expressed in degrees Fahrenheit for the benefit of the United States reader. It is computed by the known conversion formula.
A code (2 letter abbreviation) for the state, in which the area lies. It is computed by calling Dorieh built-in function
zip_to_state.A name of the city for the zip code for urban areas, also computed by calling Dorieh built-in function
zip_to_city.
After you made these changes to both files (example1.cwl and
example1_model.yml), you are welcome to run the pipeline again,
using the same command as above. It will now
build the silver layer.
Alternatively, we can add the Gold layer before testing.
In the Gold layer, we will add just one table that computes some
data for the whole states and is ready for analysis. The table
named gold_temperature_by_state is defined by the following block:
gold_temperature_by_state:
description: |
Temperature variations by US State
create:
type: materialized view
from: silver_temperature
group by:
- us_state
- date
columns:
- us_state
- date
- t_span:
type: float
description: Temperature variation in Celsius
source: MAX(tmmx) - MIN(tmmx)
- t_mean_in_C:
type: float
description: Mean Temperature in Celsius
source: AVG(temperature_in_C)
- t_mean_in_F:
type: float
description: Mean Temperature in Fahrenheit
source: AVG(temperature_in_F)
primary_key:
- us_state
- date
The gold table contains mean temperatures on a date for every US state and also the variation in the temperature on the day. The variation is in maximum temperature, so it does not reflect a change during a day, but only the diversity of geography.
We now need to add a step to build a gold schema to the workflow itself. The step is literally the same as silver, the difference is just the target table name:
build_gold:
run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
in:
depends_on: build_silver/log
registry:
default:
class: File
location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
domain:
valueFrom: "tutorial"
table:
valueFrom: "gold_temperature_by_state"
database: database
connection_name: connection_name
out:
- log
- errors
The final version of the workflow is:
1#!/usr/bin/env cwl-runner
2
3cwlVersion: v1.2
4class: Workflow
5
6requirements:
7 SubworkflowFeatureRequirement: {}
8 StepInputExpressionRequirement: {}
9 InlineJavascriptRequirement: {}
10 ScatterFeatureRequirement: {}
11 MultipleInputFeatureRequirement: {}
12 NetworkAccess:
13 networkAccess: True
14inputs:
15 band:
16 type: string
17 date:
18 type: string
19 geography:
20 type: string
21 database:
22 type: File
23 default:
24 class: File
25 location: https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini
26 connection_name:
27 type: string
28 default: "localhost"
29
30steps:
31 extract_year:
32 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/parse_date.cwl
33 in:
34 date: date
35 out:
36 - year
37
38 download:
39 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/download.cwl
40 in:
41 year: extract_year/year
42 band: band
43 out:
44 - log
45 - data
46 - errors
47
48 get_shapes:
49 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/get_shapes.cwl
50 in:
51 geo: geography
52 year: extract_year/year
53 out:
54 - shape_files
55
56 aggregate:
57 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/aggregate_daily.cwl
58 in:
59 geography: geography
60 year: extract_year/year
61 dates: date
62 band: band
63 input: download/data
64 shape_files: get_shapes/shape_files
65 strategy:
66 valueFrom: default
67 out:
68 - log
69 - data
70 - errors
71
72 initdb:
73 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/initcoredb.cwl
74 in:
75 database: database
76 connection_name: connection_name
77 out:
78 - log
79 - err
80
81 ingest:
82 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/ingest.cwl
83 in:
84 depends_on: initdb/log
85 registry:
86 default:
87 class: File
88 location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
89 domain:
90 valueFrom: "tutorial"
91 table:
92 valueFrom: "bronze_temperature"
93 input: aggregate/data
94 database: database
95 connection_name: connection_name
96 out:
97 - log
98 - errors
99
100 build_silver:
101 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
102 in:
103 depends_on: ingest/log
104 registry:
105 default:
106 class: File
107 location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
108 domain:
109 valueFrom: "tutorial"
110 table:
111 valueFrom: "silver_temperature"
112 database: database
113 connection_name: connection_name
114 out:
115 - log
116 - errors
117
118 build_gold:
119 run: https://raw.githubusercontent.com/ForomePlatform/dorieh/main/src/cwl/create.cwl
120 in:
121 depends_on: build_silver/log
122 registry:
123 default:
124 class: File
125 location: "https://raw.githubusercontent.com/ForomePlatform/dorieh/main/doc/tutorial/climate/example1_model.yml"
126 domain:
127 valueFrom: "tutorial"
128 table:
129 valueFrom: "gold_temperature_by_state"
130 database: database
131 connection_name: connection_name
132 out:
133 - log
134 - errors
135
136
137outputs:
138 ## Generated by dorieh.platform.util.cwl_collect_outputs from download.cwl:
139 download_log:
140 type: File?
141 outputSource: download/log
142 download_data:
143 type: File?
144 outputSource: download/data
145 download_errors:
146 type: File
147 outputSource: download/errors
148 ## Generated by dorieh.platform.util.cwl_collect_outputs from get_shapes.cwl:
149 get_shapes_shape_files:
150 type: File[]
151 outputSource: get_shapes/shape_files
152 ## Generated by dorieh.platform.util.cwl_collect_outputs from aggregate_daily.cwl:
153 aggregate_log:
154 type: File?
155 outputSource: aggregate/log
156 aggregate_data:
157 type: File?
158 outputSource: aggregate/data
159 aggregate_errors:
160 type: File
161 outputSource: aggregate/errors
162
163 initdb_log:
164 type: File
165 outputSource: initdb/log
166 initdb_err:
167 type: File
168 outputSource: initdb/err
169
170 ingest_log:
171 type: File
172 outputSource: ingest/log
173 ingest_err:
174 type: File
175 outputSource: ingest/errors
176
177 build_silver_log:
178 type: File
179 outputSource: build_silver/log
180 build_silver_err:
181 type: File
182 outputSource: build_silver/errors
183
184 build_gold_log:
185 type: File
186 outputSource: build_gold/log
187 build_gold_err:
188 type: File
189 outputSource: build_gold/errors
While the final Medallion data model is:
1tutorial:
2 header: true
3 quoting: 3
4 index: "unless excluded"
5 description: "Data model for data transformation tutorial"
6 tables:
7 bronze_temperature:
8 description: |
9 Maximum daily temperature for US Zip Code Tabulation Areas
10 columns:
11 - tmmx:
12 type: float
13 description: Maximum temperature variable from the TerraClimate dataset in K
14 reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
15 - date:
16 type: date
17 - zcta:
18 description: Zip Code for a Zip Code Tabulation Area
19 type: int
20 primary_key:
21 - zcta
22 - date
23 silver_temperature:
24 description: |
25 Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
26 create:
27 type: view
28 from: bronze_temperature
29 columns:
30 - tmmx
31 - date
32 - zcta
33 - temperature_in_C:
34 type: float
35 description: Temperature in Celsius
36 source: (tmmx - 273.15)
37 - temperature_in_F:
38 type: float
39 description: Temperature in Fahrenheit
40 source: ((tmmx - 273.15)*9/5 + 32)
41 - us_state:
42 type: VARCHAR(2)
43 description: US State
44 source: "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
45 - city:
46 type: VARCHAR(128)
47 description: >
48 Name of a representative city for the ZIP Code Tabulation Area (ZCTA);
49 for ZCTAs spanning multiple cities, this is the city covering the largest
50 portion of the area or population.
51 source: "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"
52
53 gold_temperature_by_state:
54 description: |
55 Temperature variations by US State
56 create:
57 type: materialized view
58 from: silver_temperature
59 group by:
60 - us_state
61 - date
62 columns:
63 - us_state
64 - date
65 - t_span:
66 type: float
67 description: Temperature variation in Celsius
68 source: MAX(tmmx) - MIN(tmmx)
69 - t_mean_in_C:
70 type: float
71 description: Mean Temperature in Celsius
72 source: AVG(temperature_in_C)
73 - t_mean_in_F:
74 type: float
75 description: Mean Temperature in Fahrenheit
76 source: AVG(temperature_in_F)
77 primary_key:
78 - us_state
79 - date
Step 6. Testing the Pipeline
At this point example1.cwl orchestrates:
Data download.
Shape download.
Spatial aggregation.
DB initialization.
Bronze ingestion.
Silver view creation.
Gold materialized view creation.
The pipeline is now ready to be tested. You can run it with the same command as before:
toil-cwl-runner --retryCount 3 --cleanWorkDir never --outdir outputs example1.cwl --workDir . --band tmmx --date 2019-01-15 --geography zcta
If everything completes successfully, you should see the following tables in your PostgreSQL database:
bronze_temperaturesilver_temperature(view)gold_temperature_by_state(materialized view)
You can also run the following SQL queries to verify the data:
-- Inspect Bronze
SELECT * FROM bronze_temperature
ORDER BY date, zcta
LIMIT 10;
-- Inspect Silver
SELECT date, zcta, temperature_in_C, us_state, city
FROM silver_temperature
ORDER BY date, zcta
LIMIT 10;
-- Inspect Gold: which state was hottest on 2019‑01‑15?
SELECT us_state, t_mean_in_C, t_span
FROM gold_temperature_by_state
WHERE date = '2019-01-15'
ORDER BY t_mean_in_C DESC
LIMIT 10;
Next Steps
In the next steps we should learn how to: