Health Data in Dorieh (Medicare and Medicaid)
Overview of Health Data Processing
Dorieh includes Data processing pipelines to ingest and process health datasets provided by the Centers for Medicare & Medicaid Services (CMS) via ResDac.
These pipelines build a data warehouse from ResDAC-delivered raw data files (both Medicare and Medicaid), preparing the data for analysis and visualization. The data warehouse can be used as Feature Store for building ML/AI models.
The workflow performs:
Ingestion of raw fixed-width and SAS-based files into a relational database
Data cleansing and deduplication (where possible)
Creation of standardized and federated tables
Computation of quality metrics (QC)
Optimization for efficient querying
For more details, refer to:
Medicare processing workflow and data model (schema)
Medicaid processing workflow and data model (schema)
Tips on querying of Medicaid data
Medicare processing workflow includes a pipeline to automatically generate Quality COntrol (QC) Tables.
These tables can be visualized in the included Apache Superset dashboard.
Project Structure
Top level directories at the repository root are:
- doc
- src
The
doc/
directory contains this documentation.The
src/
directory contains source code, organized as follows:cwl
python
CWL Workflows
The cwl/
folder contains reusable Common Workflow Language (CWL)
tools and workflows. Each CMS data processing step (
e.g. ingest, combine, transform) is implemented as a modular CWL tool.
CWL tools are documented individually Tools are combined into full workflows, such as:
Python Utilities
The python/
directory provides CLI tools and
reusable modules, documented in:
These include:
Parsing fixed-width data layouts from FTS files
Working with SAS7BDAT files
Generating dynamic schemas (YAML models)
Data loading and validation
PostgreSQL utilities (indexing, vacuuming)
Data Model for health data
We define a YAML-based data model (schema) to describe each processing table.
This model is used to:
Automatically generate SQL DDL statements
Control how data is read from files and loaded into the database
Standardize naming conventions, indexing, and transformations
Schemas are:
Automatically generated via FTS parsing (for 2011+ ResDAC files)
Programmatically introspected from SAS7BDAT files (for 1999–2010 Medicare data)
Medicare Tables
See also:
Main Tables:
Federated / intermediate SQL Views:
medicare.ps
Union of raw data for patient summariesmedicare.ip
Union of raw data for inpatient admissionsmedicare._ps
medicare._beneficiaries
medicare._enrollments
Medicaid Tables
See also:
Tables:
medicaid.beneficiaries
detailsmedicaid.enrollments
detailsmedicaid.eligibility
detailsmedicaid.admissions
details
Federated / intermediate SQL Views:
medicaid.monthly
medicaid._eligibility
SQL Utilities
Stored Procedures
📄 Procedures help scale population of large tables like eligibility, by batching inserts by beneficiary, or by year and state. This avoids resource exhaustion in large transactions.
Date Parsing Functions
📄 Custom SQL functions to parse non-standard date formats commonly found in legacy Medicare files and ResDAC data.