Health Data in Dorieh (Medicare and Medicaid)

Overview of Health Data Processing 

Dorieh includes Data processing pipelines to ingest and process health datasets provided by the Centers for Medicare & Medicaid Services (CMS) via ResDac.

These pipelines build a data warehouse from ResDAC-delivered raw data files (both Medicare and Medicaid), preparing the data for analysis and visualization. The data warehouse can be used as Feature Store for building ML/AI models.

The workflow performs:

Ingestion of raw fixed-width and SAS-based files into a relational database
Data cleansing and deduplication (where possible)
Creation of standardized and federated tables
Computation of quality metrics (QC)
Optimization for efficient querying

For more details, refer to:

Medicare processing workflow and data model (schema)
Medicaid processing workflow and data model (schema)
Tips on querying of Medicaid data

Medicare processing workflow includes a pipeline to automatically generate Quality COntrol (QC) Tables.

These tables can be visualized in the included Apache Superset dashboard.

Project Structure 

Top level directories at the repository root are:

- doc
- src

The doc/ directory contains this documentation.
The src/ directory contains source code, organized as follows:
- cwl
- python

CWL Workflows 

The cwl/ folder contains reusable Common Workflow Language (CWL) tools and workflows. Each CMS data processing step ( e.g. ingest, combine, transform) is implemented as a modular CWL tool.

CWL tools are documented individually Tools are combined into full workflows, such as:

Medicare Pipeline files.
Medicaid Pipeline

Python Utilities 

The python/ directory provides CLI tools and reusable modules, documented in:

CMS Python Package Overview.

These include:

Parsing fixed-width data layouts from FTS files
Working with SAS7BDAT files
Generating dynamic schemas (YAML models)
Data loading and validation
PostgreSQL utilities (indexing, vacuuming)

Data Model for health data 

We define a YAML-based data model (schema) to describe each processing table.

This model is used to:

Automatically generate SQL DDL statements
Control how data is read from files and loaded into the database
Standardize naming conventions, indexing, and transformations

Schemas are:

Automatically generated via FTS parsing (for 2011+ ResDAC files)
Programmatically introspected from SAS7BDAT files (for 1999–2010 Medicare data)

Medicare Tables 

Medicaid Tables 

SQL Utilities 

Stored Procedures 

📄 Procedures help scale population of large tables like eligibility, by batching inserts by beneficiary, or by year and state. This avoids resource exhaustion in large transactions.

Date Parsing Functions 

📄 Custom SQL functions to parse non-standard date formats commonly found in legacy Medicare files and ResDAC data.