Example: Medicare Processing Pipeline (with PostgreSQL)

This example demonstrates a full Dorieh data processing pipeline run against a publicly available synthetic Medicare-like dataset. Because the data are synthetic and published openly on Zenodo, no institutional data access agreement is required to follow this example. It covers:

  • Downloading synthetic sample data from Zenodo

  • Running the Medicare CWL workflow with Toil against a local PostgreSQL instance

  • (Optional) Exploring the results in a pre-built Apache Superset dashboard

The source files for this example live in examples/with-postgres/medicare/.


Prerequisites

Before starting, make sure you have:

  1. PostgreSQL running and a database.ini ready — follow the steps in Using Dorieh with PostgreSQL Backend. If you want a quick non-production setup, use the bundled Docker Compose file:

    cd $WORKDIR/dorieh/docker/pg-hll/
    docker compose up -d
    
  2. Toil installed and tested — see Installing Dorieh and testing the installation.

  3. The dorieh repository cloned under $WORKDIR:

    git clone https://github.com/ForomePlatform/dorieh.git $WORKDIR/dorieh
    

Then move into the Medicare example directory:

cd $WORKDIR/dorieh/examples/with-postgres/medicare

Step 1 — Download synthetic sample data

The synthetic Medicare-like dataset is hosted on Zenodo and is publicly available — no data use agreement or institutional access is required. It:

  • Mimics the raw fixed-width files (.dat) delivered by ResDAC

  • Reproduces the File Transfer Summary (FTS) layout metadata

  • Contains no real PHI / PII — it is safe for testing, demonstrations, and sharing

Download and unpack it:

mkdir -p data
pushd data

curl -fLo medicare-synthetic-database.zip \
  'https://zenodo.org/records/18915558/files/medicare-synthetic-database-v1.zip?download=1'

unzip medicare-synthetic-database.zip

popd

The extracted directory tree follows the layout expected by the ingestion pipeline — one sub-directory per year, each containing .fts metadata files and the corresponding .dat data files.


Step 2 — Run the Medicare processing pipeline

Activate your Toil virtual environment:

source $TOIL_VENV/bin/activate

Run the Medicare CWL workflow:

toil-cwl-runner \
  --jobStore j1 \
  --retryCount 1 \
  --cleanWorkDir never \
  --outdir outputs \
  --workDir . \
  https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/workflows/medicare.cwl \
  --input data/ \
  --database https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini \
  --connection_name dorieh

After the workflow completes:

  • Processed data is stored in your PostgreSQL backend under the medicare schema (beneficiaries, enrollments, admissions, and QC tables).

  • Any additional output files are written to the outputs/ directory.

For a detailed description of the pipeline steps, see Medicare: Building a Data Warehouse from ResDac Files.


Step 3 (Optional) — Explore results in Apache Superset

A pre-built Medicare Quality Control Superset dashboard is included with the example. The steps below show how to launch PostgreSQL together with Superset using Docker Compose and import the dashboard.

3.1 — Stop the simple PostgreSQL container (if running)

If you started the basic docker-compose.yml stack earlier, bring it down first:

cd $WORKDIR/dorieh/docker/pg-hll/
docker compose down

3.2 — Start PostgreSQL + Superset

From the same directory, start the extended stack:

cd $WORKDIR/dorieh/docker/pg-hll/
docker compose -f docker-compose-superset.yml up --build
# Or, to run detached:
# docker compose -f docker-compose-superset.yml up -d

This brings up:

  • A PostgreSQL instance (with the HLL extension)

  • The Superset web UI

  • Superset background worker services

3.3 — Log in to Superset

Once the services are running, open http://localhost:8088/ in your browser.

Default credentials (defined in docker-compose-superset.yml):

Field

Value

Username

admin

Password

admin

If initialization did not complete and you cannot log in, create the admin user manually:

cd $WORKDIR/dorieh/docker/pg-hll/
docker compose exec superset superset fab create-admin \
  --username admin \
  --firstname Admin \
  --lastname User \
  --email admin@example.com \
  --password admin

3.4 — Add the Dorieh database connection

In the Superset UI:

  1. Open Settings → Database Connections (under Data).

  2. Click + DATABASE and choose PostgreSQL.

  3. At the bottom of the dialog click “Connect this database with a SQL Alchemy URL string instead”.

  4. Fill in:

    • Display Name: DORIEH

    • SQLAlchemy URI: postgresql+psycopg2://dorieh:dorieh_secret@postgres:5432/dorieh

  5. Click TEST CONNECTION — if it succeeds, click CONNECT.

3.5 — Import the Medicare QC dashboard

Back in your terminal, run the helper script:

cd $WORKDIR/dorieh/docker/pg-hll/
./import_superset_dashboard.sh

The script copies the dashboard definition into the Superset container and imports it. Internally it runs:

docker compose cp \
  ../../examples/with-postgres/medicare/superset/MedicareQCDashboard.json \
  superset:/tmp/MedicareQCDashboard.json
docker compose exec superset superset import_dashboards \
  --path /tmp/MedicareQCDashboard.json --username admin
docker compose -f docker-compose-superset.yml restart superset superset-worker

3.6 — Explore the dashboard

In the Superset UI:

  1. Click the Dashboards tab.

  2. Open “Medicare QC (Clean)”.

  3. Use the available charts and filters to examine data quality across enrollments and admissions for the synthetic dataset.

The dashboard surfaces the QC metrics described in Creating QC Tables, including:

  • Percent of beneficiaries with consistent date-of-birth / date-of-death records

  • Percent of valid admission records (passed primary-key, foreign-key, and duplicate checks)

  • Approximate distinct-beneficiary counts computed with HLL sketches


What the pipeline produces

After a successful run the following tables and views are available in the medicare schema of your PostgreSQL database:

Object

Type

Description

medicare.ps

View

Federated patient summary across all raw years

medicare.beneficiaries

Table

One row per unique beneficiary (deduplicated)

medicare.enrollments

Table

Yearly enrollment records per beneficiary/state

medicare.admissions

Table

Validated inpatient admission records

medicare_audit.admissions

Table

Records that failed validation (with reason)

medicare.qc_enrollments

Materialized View

Aggregate QC dimensions for enrollments

medicare.qc_admissions

Materialized View

Aggregate QC dimensions for admissions