Tutorial: Constructing data dictionaries and lineage graphs
This tutorial demonstrates how to construct a data dictionary and lineage graph for a climate analysis workflow built in the Building a workflow Tutorial.
What we can generate
Dorieh provides an integrated dictionary and lineage utility that analyzes your workflow and domain definition file (such as example1_model.yml) to automatically generate a full suite of human -and machine-readable artifacts:
a comprehensive data dictionary, and
graphical lineage diagrams at both the table and the column level.
To anchor this tutorial, we remind here the content of the example1_model.yml file:
1tutorial:
2 header: true
3 quoting: 3
4 index: "unless excluded"
5 description: "Data model for data transformation tutorial"
6 tables:
7 bronze_temperature:
8 description: |
9 Maximum daily temperature for US Zip Code Tabulation Areas
10 columns:
11 - tmmx:
12 type: float
13 description: Maximum temperature variable from the TerraClimate dataset in K
14 reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
15 - date:
16 type: date
17 - zcta:
18 description: Zip Code for a Zip Code Tabulation Area
19 type: int
20 primary_key:
21 - zcta
22 - date
23 silver_temperature:
24 description: |
25 Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
26 create:
27 type: view
28 from: bronze_temperature
29 columns:
30 - tmmx
31 - date
32 - zcta
33 - temperature_in_C:
34 type: float
35 description: Temperature in Celsius
36 source: (tmmx - 273.15)
37 - temperature_in_F:
38 type: float
39 description: Temperature in Fahrenheit
40 source: ((tmmx - 273.15)*9/5 + 32)
41 - us_state:
42 type: VARCHAR(2)
43 description: US State
44 source: "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
45 - city:
46 type: VARCHAR(128)
47 description: >
48 Name of a representative city for the ZIP Code Tabulation Area (ZCTA);
49 for ZCTAs spanning multiple cities, this is the city covering the largest
50 portion of the area or population.
51 source: "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"
52
53 gold_temperature_by_state:
54 description: |
55 Temperature variations by US State
56 create:
57 type: materialized view
58 from: silver_temperature
59 group by:
60 - us_state
61 - date
62 columns:
63 - us_state
64 - date
65 - t_span:
66 type: float
67 description: Temperature variation in Celsius
68 source: MAX(tmmx) - MIN(tmmx)
69 - t_mean_in_C:
70 type: float
71 description: Mean Temperature in Celsius
72 source: AVG(temperature_in_C)
73 - t_mean_in_F:
74 type: float
75 description: Mean Temperature in Fahrenheit
76 source: AVG(temperature_in_F)
77 primary_key:
78 - us_state
79 - date
The data dictionary and lineage tools extract and synthesize information from both your workflow and data model to provide:
A main table-level lineage diagram
Visualizes the sequence of transformations and dependencies among all tables in your pipeline (Bronze, Silver, Gold, etc.)
Can use a variety of graphical formats: png, gif, ps2, svg, cmapx, jpeg
Interactive SVG format: If SVG is selected, each table node is clickable, linking directly to its detailed documentation
Table documentation pages
Human-readable descriptions (drawn from YAML comments in your data model)
The SQL or DDL used to create the table
A summary of columns for the table, each with a link to its own detailed description
Column documentation pages
Human-readable description for the column
A column-level lineage diagram that shows exactly which upstream columns and tables contributed data to the current column
Interactive SVG: All elements are clickable, supporting rapid navigation
An index/glossary of all columns across all tables
Tracks column propagation and transformation through the pipeline—a powerful tool for auditing or code review, especially in complex medallion architectures where columns may be re-used or re-named across layers
Output Formats and Modes
Markdown (.md) is generated by default; these can be browsed directly or easily converted to HTML for rich, interactive documentation.
Other export formats such as YAML, Open Biological and Biomedical Ontologies (OBO) are supported for interoperability with downstream systems.
The tool supports two modes:
Standalone: Produce self-contained HTML (via Pandoc)—ideal for documentation, archiving, or sharing with external parties.
Sphinx: Seamless integration into team or lab-wide Sphinx documentation systems.
Running the tool
We will be using the Data Dictionary Generation tool.
Note
Here we assume you are running from a docs/ directory alongside
example1_model.yml. This directory should have been created when
walking through the
documentation tutorial
If you used a different layout, adjust the paths accordingly.
Run the following commands to produce draft documentation:
cd docs
python -m dorieh.platform.dictionary.domain_dictionary ../example1_model.yml --fmt svg --lod min -o example1.dot --mode standalone
This command produces both Markdown and standalone HTML files that can be easily examined.
Important
If you plan to build documentation with Sphinx and MyST, instead run:
cd docs
python -m dorieh.platform.dictionary.domain_dictionary ../example1_model.yml --fmt svg --lod min -o example1.dot --mode sphinx
Hint
We use the --lod min option generate lineage for derived columns
only. See
Data Dictionary Generation tool
for details.
Exploring the Artifacts
Start with the
high-level lineage DAG, where each
node is a table. If you use Standalone (HTML) mode, open the generated
example1.dot.html file in a browser.
Click a table to view its documentation, including all columns and their descriptions.
Click on a column name to access its detail page—showing a column-level lineage diagram for tracing value origins across the workflow.
All navigation is cross-linked for rapid provenance discovery.
Enriching the Data Dictionary
As mentioned above, the data dictionary tool automatically extracts any code (Python, SQL, DDL) that is used to create tables and columns.
For example, you can see SQL/DDL block in the generated page for the
table silver_temperature or
compute code for columns silver_temperature.temperature_in_F
and gold_temperature_by_state.t_mean_in_F in the
column lineage diagram
To make the documentation more comprehensive, the data modeling language supports the following keys:
description: Verbose description
reference: URL with external documentation
For the following elements:
Additionally, description is supported for the Invalid Record element.
An example of using these keys is shown below:
columns:
- tmmx:
type: float
description: Maximum temperature variable from the TerraClimate dataset in K
reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
See how this is reflected in the generated documentation:
Note
Table |
|
Qualified name |
bronze_temperature.tmmx |
Datatype |
float |
Reference |
https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands |
Maximum temperature variable from the TerraClimate dataset in K
And in the lineage graph

Lineage Diagram for Medicare data
This small climate example uses the same tooling as our Medicare
pipeline. For a large, production-scale use case—including many more
tables and complex transformations—see the
Medicare Data Dictionary.