Tutorial: Constructing data dictionaries and lineage graphs

This tutorial demonstrates how to construct a data dictionary and lineage graph for a climate analysis workflow built in the Building a workflow Tutorial.

What we can generate 

Dorieh provides an integrated dictionary and lineage utility that analyzes your workflow and domain definition file (such as example1_model.yml) to automatically generate a full suite of human -and machine-readable artifacts:

a comprehensive data dictionary, and
graphical lineage diagrams at both the table and the column level.

To anchor this tutorial, we remind here the content of the example1_model.yml file:

tutorial:
  header: true
  quoting: 3
  index: "unless excluded"
  description: "Data model for data transformation tutorial"
  tables:
    bronze_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas
      columns:
        - tmmx:
            type: float
            description: Maximum temperature variable from the TerraClimate dataset in K
            reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands
        - date:
            type: date
        - zcta:
            description: Zip Code for a Zip Code Tabulation Area
            type: int
      primary_key:
        - zcta
        - date
    silver_temperature:
      description: |
        Maximum daily temperature for US Zip Code Tabulation Areas, enriched and harmonized
      create:
        type: view
        from: bronze_temperature
      columns:
        - tmmx
        - date
        - zcta
        - temperature_in_C:
            type: float
            description: Temperature in Celsius
            source: (tmmx - 273.15)
        - temperature_in_F:
            type: float
            description: Temperature in Fahrenheit
            source: ((tmmx - 273.15)*9/5 + 32)
        - us_state:
            type: VARCHAR(2)
            description: US State
            source:  "public.zip_to_state(EXTRACT(YEAR FROM date)::INT, zcta)"
        - city:
            type: VARCHAR(128)
            description: >
              Name of a representative city for the ZIP Code Tabulation Area (ZCTA); 
              for ZCTAs spanning multiple cities, this is the city covering the largest 
              portion of the area or population.
            source:  "public.zip_to_city(EXTRACT(YEAR FROM date)::INT, zcta)"

    gold_temperature_by_state:
      description: |
        Temperature variations by US State
      create:
        type: materialized view
        from: silver_temperature
        group by:
          - us_state
          - date
      columns:
        - us_state
        - date
        - t_span:
            type: float
            description: Temperature variation in Celsius
            source: MAX(tmmx) - MIN(tmmx)
        - t_mean_in_C:
            type: float
            description: Mean Temperature in Celsius
            source: AVG(temperature_in_C)
        - t_mean_in_F:
            type: float
            description: Mean Temperature in Fahrenheit
            source: AVG(temperature_in_F)
      primary_key:
        - us_state
        - date

The data dictionary and lineage tools extract and synthesize information from both your workflow and data model to provide:

A main table-level lineage diagram
- Visualizes the sequence of transformations and dependencies among all tables in your pipeline (Bronze, Silver, Gold, etc.)
- Can use a variety of graphical formats: png, gif, ps2, svg, cmapx, jpeg
- Interactive SVG format: If SVG is selected, each table node is clickable, linking directly to its detailed documentation
Table documentation pages
- Human-readable descriptions (drawn from YAML comments in your data model)
- The SQL or DDL used to create the table
- A summary of columns for the table, each with a link to its own detailed description
Column documentation pages
- Human-readable description for the column
- A column-level lineage diagram that shows exactly which upstream columns and tables contributed data to the current column
- Interactive SVG: All elements are clickable, supporting rapid navigation
An index/glossary of all columns across all tables
- Tracks column propagation and transformation through the pipeline—a powerful tool for auditing or code review, especially in complex medallion architectures where columns may be re-used or re-named across layers

Output Formats and Modes 

Markdown (.md) is generated by default; these can be browsed directly or easily converted to HTML for rich, interactive documentation.

Other export formats such as YAML, Open Biological and Biomedical Ontologies (OBO) are supported for interoperability with downstream systems.

The tool supports two modes:

Standalone: Produce self-contained HTML (via Pandoc)—ideal for documentation, archiving, or sharing with external parties.
Sphinx: Seamless integration into team or lab-wide Sphinx documentation systems.

Running the tool 

We will be using the Data Dictionary Generation tool.

Note

Here we assume you are running from a docs/ directory alongside example1_model.yml. This directory should have been created when walking through the documentation tutorial If you used a different layout, adjust the paths accordingly.

Run the following commands to produce draft documentation:

cd docs
python -m dorieh.platform.dictionary.domain_dictionary ../example1_model.yml --fmt svg --lod min -o example1.dot --mode standalone

This command produces both Markdown and standalone HTML files that can be easily examined.

Important

If you plan to build documentation with Sphinx and MyST, instead run:

cd docs
python -m dorieh.platform.dictionary.domain_dictionary ../example1_model.yml --fmt svg --lod min -o example1.dot --mode sphinx

Hint

We use the --lod min option generate lineage for derived columns only. See Data Dictionary Generation tool for details.

Exploring the Artifacts 

Start with the high-level lineage DAG, where each node is a table. If you use Standalone (HTML) mode, open the generated example1.dot.html file in a browser.

Click a table to view its documentation, including all columns and their descriptions.

Click on a column name to access its detail page—showing a column-level lineage diagram for tracing value origins across the workflow.

All navigation is cross-linked for rapid provenance discovery.

Enriching the Data Dictionary 

As mentioned above, the data dictionary tool automatically extracts any code (Python, SQL, DDL) that is used to create tables and columns.

For example, you can see SQL/DDL block in the generated page for the table silver_temperature or compute code for columns silver_temperature.temperature_in_F and gold_temperature_by_state.t_mean_in_F in the column lineage diagram

To make the documentation more comprehensive, the data modeling language supports the following keys:

description: Verbose description
reference: URL with external documentation

For the following elements:

Additionally, description is supported for the Invalid Record element.

An example of using these keys is shown below:

      columns:
        - tmmx:
            type: float
            description: Maximum temperature variable from the TerraClimate dataset in K
            reference: https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands

See how this is reflected in the generated documentation:

Note


Table	bronze_temperature
Qualified name	bronze_temperature.tmmx
Datatype	float
Reference	https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_GRIDMET#bands

Maximum temperature variable from the TerraClimate dataset in K

And in the lineage graph

Lineage Diagram for Medicare data 

This small climate example uses the same tooling as our Medicare pipeline. For a large, production-scale use case—including many more tables and complex transformations—see the
Medicare Data Dictionary.

Tutorial: Constructing data dictionaries and lineage graphs

What we can generate

Output Formats and Modes

Running the tool

Exploring the Artifacts

Enriching the Data Dictionary

Lineage Diagram for Medicare data