# Example: Medicare Processing Pipeline (with PostgreSQL)

```{contents}
---
local:
---
```

```{seealso}
[Medicare: Building a Data Warehouse from ResDac Files](Medicare.md)
[Using HLL for Approximate Count Distinct](UsingHLL.md)
[Using Dorieh with PostgreSQL Backend](https://github.com/ForomePlatform/dorieh/tree/main/examples/with-postgres)
```

This example demonstrates a full Dorieh data processing pipeline run against a
**publicly available synthetic** Medicare-like dataset.  Because the data are
synthetic and published openly on [Zenodo](https://zenodo.org/records/18915558),
no institutional data access agreement is required to follow this example.
It covers:

* Downloading synthetic sample data from Zenodo
* Running the Medicare CWL workflow with [Toil](https://toil.readthedocs.io/en/latest/)
  against a local PostgreSQL instance
* (Optional) Exploring the results in a pre-built
  [Apache Superset](https://superset.apache.org/) dashboard

The source files for this example live in
[`examples/with-postgres/medicare/`](https://github.com/ForomePlatform/dorieh/tree/main/examples/with-postgres/medicare).

---

## Prerequisites

Before starting, make sure you have:

1. **PostgreSQL running** and a `database.ini` ready — follow the steps in
   [Using Dorieh with PostgreSQL Backend](https://github.com/ForomePlatform/dorieh/tree/main/examples/with-postgres).
   If you want a quick non-production setup, use the bundled Docker Compose file:

   ```bash
   cd $WORKDIR/dorieh/docker/pg-hll/
   docker compose up -d
   ```

2. **Toil installed** and tested — see
   [Installing Dorieh and testing the installation](https://foromeplatform.github.io/dorieh/examples.html#installing-dorieh-and-testing-the-installation).

3. The **`dorieh` repository cloned** under `$WORKDIR`:

   ```bash
   git clone https://github.com/ForomePlatform/dorieh.git $WORKDIR/dorieh
   ```

Then move into the Medicare example directory:

```bash
cd $WORKDIR/dorieh/examples/with-postgres/medicare
```

---

## Step 1 — Download synthetic sample data

The synthetic Medicare-like dataset is hosted on
[Zenodo](https://zenodo.org/records/18915558) and is **publicly available** —
no data use agreement or institutional access is required.  It:

* Mimics the raw fixed-width files (`.dat`) delivered by ResDAC
* Reproduces the File Transfer Summary (FTS) layout metadata
* Contains **no real PHI / PII** — it is safe for testing, demonstrations,
  and sharing

Download and unpack it:

```bash
mkdir -p data
pushd data

curl -fLo medicare-synthetic-database.zip \
  'https://zenodo.org/records/18915558/files/medicare-synthetic-database-v1.zip?download=1'

unzip medicare-synthetic-database.zip

popd
```

The extracted directory tree follows the layout expected by the ingestion
pipeline — one sub-directory per year, each containing `.fts` metadata files
and the corresponding `.dat` data files.

---

## Step 2 — Run the Medicare processing pipeline

Activate your Toil virtual environment:

```bash
source $TOIL_VENV/bin/activate
```

Run the Medicare CWL workflow:

```bash
toil-cwl-runner \
  --jobStore j1 \
  --retryCount 1 \
  --cleanWorkDir never \
  --outdir outputs \
  --workDir . \
  https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/src/workflows/medicare.cwl \
  --input data/ \
  --database https://raw.githubusercontent.com/ForomePlatform/dorieh/refs/heads/main/examples/with-postgres/database.ini \
  --connection_name dorieh
```

After the workflow completes:

* Processed data is stored in your PostgreSQL backend under the `medicare`
  schema (beneficiaries, enrollments, admissions, and QC tables).
* Any additional output files are written to the `outputs/` directory.

For a detailed description of the pipeline steps, see
[Medicare: Building a Data Warehouse from ResDac Files](Medicare.md).

---

## Step 3 (Optional) — Explore results in Apache Superset

A pre-built **Medicare Quality Control** Superset dashboard is included with
the example.  The steps below show how to launch PostgreSQL together with
Superset using Docker Compose and import the dashboard.

### 3.1 — Stop the simple PostgreSQL container (if running)

If you started the basic `docker-compose.yml` stack earlier, bring it down
first:

```bash
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose down
```

### 3.2 — Start PostgreSQL + Superset

From the same directory, start the extended stack:

```bash
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose -f docker-compose-superset.yml up --build
# Or, to run detached:
# docker compose -f docker-compose-superset.yml up -d
```

This brings up:

* A PostgreSQL instance (with the HLL extension)
* The Superset web UI
* Superset background worker services

### 3.3 — Log in to Superset

Once the services are running, open `http://localhost:8088/` in your browser.

Default credentials (defined in `docker-compose-superset.yml`):

| Field    | Value   |
|----------|---------|
| Username | `admin` |
| Password | `admin` |

If initialization did not complete and you cannot log in, create the admin
user manually:

```bash
cd $WORKDIR/dorieh/docker/pg-hll/
docker compose exec superset superset fab create-admin \
  --username admin \
  --firstname Admin \
  --lastname User \
  --email admin@example.com \
  --password admin
```

### 3.4 — Add the Dorieh database connection

In the Superset UI:

1. Open **Settings → Database Connections** (under *Data*).
2. Click **+ DATABASE** and choose **PostgreSQL**.
3. At the bottom of the dialog click
   **"Connect this database with a SQL Alchemy URL string instead"**.
4. Fill in:
   * **Display Name:** `DORIEH`
   * **SQLAlchemy URI:** `postgresql+psycopg2://dorieh:dorieh_secret@postgres:5432/dorieh`
5. Click **TEST CONNECTION** — if it succeeds, click **CONNECT**.

### 3.5 — Import the Medicare QC dashboard

Back in your terminal, run the helper script:

```bash
cd $WORKDIR/dorieh/docker/pg-hll/
./import_superset_dashboard.sh
```

The script copies the dashboard definition into the Superset container and
imports it.  Internally it runs:

```bash
docker compose cp \
  ../../examples/with-postgres/medicare/superset/MedicareQCDashboard.json \
  superset:/tmp/MedicareQCDashboard.json
docker compose exec superset superset import_dashboards \
  --path /tmp/MedicareQCDashboard.json --username admin
docker compose -f docker-compose-superset.yml restart superset superset-worker
```

### 3.6 — Explore the dashboard

In the Superset UI:

1. Click the **Dashboards** tab.
2. Open **"Medicare QC (Clean)"**.
3. Use the available charts and filters to examine data quality across
   enrollments and admissions for the synthetic dataset.

The dashboard surfaces the QC metrics described in
[Creating QC Tables](Medicare.md#creating-qc-tables), including:

* Percent of beneficiaries with consistent date-of-birth / date-of-death
  records
* Percent of valid admission records (passed primary-key, foreign-key, and
  duplicate checks)
* Approximate distinct-beneficiary counts computed with
  [HLL sketches](UsingHLL.md)

---

## What the pipeline produces

After a successful run the following tables and views are available in the
`medicare` schema of your PostgreSQL database:

| Object                    | Type              | Description                                      |
|---------------------------|-------------------|--------------------------------------------------|
| `medicare.ps`             | View              | Federated patient summary across all raw years   |
| `medicare.beneficiaries`  | Table             | One row per unique beneficiary (deduplicated)    |
| `medicare.enrollments`    | Table             | Yearly enrollment records per beneficiary/state  |
| `medicare.admissions`     | Table             | Validated inpatient admission records            |
| `medicare_audit.admissions` | Table           | Records that failed validation (with reason)     |
| `medicare.qc_enrollments` | Materialized View | Aggregate QC dimensions for enrollments          |
| `medicare.qc_admissions`  | Materialized View | Aggregate QC dimensions for admissions           |