Example: Medicare Processing Pipeline (with PostgreSQL)
See also
Medicare: Building a Data Warehouse from ResDac Files Using HLL for Approximate Count Distinct Using Dorieh with PostgreSQL Backend
This example demonstrates a full Dorieh data processing pipeline run against a publicly available synthetic Medicare-like dataset. Because the data are synthetic and published openly on Zenodo, no institutional data access agreement is required to follow this example. It covers:
Downloading synthetic sample data from Zenodo
Running the Medicare CWL workflow with Toil against a local PostgreSQL instance
(Optional) Exploring the results in a pre-built Apache Superset dashboard
The source files for this example live in
examples/with-postgres/medicare/.
Prerequisites
Before starting, make sure you have:
PostgreSQL running and a
database.iniready — follow the steps in Using Dorieh with PostgreSQL Backend. If you want a quick non-production setup, use the bundled Docker Compose file:cd $WORKDIR/dorieh/docker/pg-hll/ docker compose up -d
Toil installed and tested — see Installing Dorieh and testing the installation.
The
doriehrepository cloned under$WORKDIR:git clone https://github.com/ForomePlatform/dorieh.git $WORKDIR/dorieh
Then move into the Medicare example directory:
cd $WORKDIR/dorieh/examples/with-postgres/medicare
Step 1 — Download synthetic sample data
The synthetic Medicare-like dataset is hosted on Zenodo and is publicly available — no data use agreement or institutional access is required. It:
Mimics the raw fixed-width files (
.dat) delivered by ResDACReproduces the File Transfer Summary (FTS) layout metadata
Contains no real PHI / PII — it is safe for testing, demonstrations, and sharing
Download and unpack it:
mkdir -p data
pushd data
curl -fLo medicare-synthetic-database.zip \
'https://zenodo.org/records/18915558/files/medicare-synthetic-database-v1.zip?download=1'
unzip medicare-synthetic-database.zip
popd
The extracted directory tree follows the layout expected by the ingestion
pipeline — one sub-directory per year, each containing .fts metadata files
and the corresponding .dat data files.
Step 2 — Run the Medicare processing pipeline
Activate your Toil virtual environment:
source $TOIL_VENV/bin/activate
Run the Medicare CWL workflow:
toil-cwl-runner \
--jobStore j1 \
--retryCount 1 \
--cleanWorkDir never \
--outdir outputs \
--workDir . \
https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/workflows/medicare.cwl \
--input data/ \
--database https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini \
--connection_name dorieh
After the workflow completes:
Processed data is stored in your PostgreSQL backend under the
medicareschema (beneficiaries, enrollments, admissions, and QC tables).Any additional output files are written to the
outputs/directory.
For a detailed description of the pipeline steps, see Medicare: Building a Data Warehouse from ResDac Files.
Step 3 (Optional) — Explore results in Apache Superset
A pre-built Medicare Quality Control Superset dashboard is included with the example. The steps below show how to launch PostgreSQL together with Superset using Docker Compose and import the dashboard.
3.1 — Stop the simple PostgreSQL container (if running)
If you started the basic docker-compose.yml stack earlier, bring it down
first:
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose down
3.2 — Start PostgreSQL + Superset
From the same directory, start the extended stack:
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose -f docker-compose-superset.yml up --build
# Or, to run detached:
# docker compose -f docker-compose-superset.yml up -d
This brings up:
A PostgreSQL instance (with the HLL extension)
The Superset web UI
Superset background worker services
3.3 — Log in to Superset
Once the services are running, open http://localhost:8088/ in your browser.
Default credentials (defined in docker-compose-superset.yml):
Field |
Value |
|---|---|
Username |
|
Password |
|
If initialization did not complete and you cannot log in, create the admin user manually:
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose exec superset superset fab create-admin \
--username admin \
--firstname Admin \
--lastname User \
--email admin@example.com \
--password admin
3.4 — Add the Dorieh database connection
In the Superset UI:
Open Settings → Database Connections (under Data).
Click + DATABASE and choose PostgreSQL.
At the bottom of the dialog click “Connect this database with a SQL Alchemy URL string instead”.
Fill in:
Display Name:
DORIEHSQLAlchemy URI:
postgresql+psycopg2://dorieh:dorieh_secret@postgres:5432/dorieh
Click TEST CONNECTION — if it succeeds, click CONNECT.
3.5 — Import the Medicare QC dashboard
Back in your terminal, run the helper script:
cd $WORKDIR/dorieh/docker/pg-hll/
./import_superset_dashboard.sh
The script copies the dashboard definition into the Superset container and imports it. Internally it runs:
docker compose cp \
../../examples/with-postgres/medicare/superset/MedicareQCDashboard.json \
superset:/tmp/MedicareQCDashboard.json
docker compose exec superset superset import_dashboards \
--path /tmp/MedicareQCDashboard.json --username admin
docker compose -f docker-compose-superset.yml restart superset superset-worker
3.6 — Explore the dashboard
In the Superset UI:
Click the Dashboards tab.
Open “Medicare QC (Clean)”.
Use the available charts and filters to examine data quality across enrollments and admissions for the synthetic dataset.
The dashboard surfaces the QC metrics described in Creating QC Tables, including:
Percent of beneficiaries with consistent date-of-birth / date-of-death records
Percent of valid admission records (passed primary-key, foreign-key, and duplicate checks)
Approximate distinct-beneficiary counts computed with HLL sketches
What the pipeline produces
After a successful run the following tables and views are available in the
medicare schema of your PostgreSQL database:
Object |
Type |
Description |
|---|---|---|
|
View |
Federated patient summary across all raw years |
|
Table |
One row per unique beneficiary (deduplicated) |
|
Table |
Yearly enrollment records per beneficiary/state |
|
Table |
Validated inpatient admission records |
|
Table |
Records that failed validation (with reason) |
|
Materialized View |
Aggregate QC dimensions for enrollments |
|
Materialized View |
Aggregate QC dimensions for admissions |