Health Data in Dorieh (Medicare and Medicaid)

Overview of Health Data Processing

Dorieh includes Data processing pipelines to ingest and process health datasets provided by the Centers for Medicare & Medicaid Services (CMS) via ResDac.

These pipelines build a data warehouse from ResDAC-delivered raw data files (both Medicare and Medicaid), preparing the data for analysis and visualization. The data warehouse can be used as Feature Store for building ML/AI models.

The workflow performs:

  • Ingestion of raw fixed-width and SAS-based files into a relational database

  • Data cleansing and deduplication (where possible)

  • Creation of standardized and federated tables

  • Computation of quality metrics (QC)

  • Optimization for efficient querying

For more details, refer to:

Medicare processing workflow includes a pipeline to automatically generate Quality COntrol (QC) Tables.

These tables can be visualized in the included Apache Superset dashboard.

Project Structure

Top level directories at the repository root are:

- doc
- src
  • The doc/ directory contains this documentation.

  • The src/ directory contains source code, organized as follows:

    • cwl

    • python

CWL Workflows

The cwl/ folder contains reusable Common Workflow Language (CWL) tools and workflows. Each CMS data processing step ( e.g. ingest, combine, transform) is implemented as a modular CWL tool.

CWL tools are documented individually Tools are combined into full workflows, such as:

Python Utilities

The python/ directory provides CLI tools and reusable modules, documented in:

These include:

  • Parsing fixed-width data layouts from FTS files

  • Working with SAS7BDAT files

  • Generating dynamic schemas (YAML models)

  • Data loading and validation

  • PostgreSQL utilities (indexing, vacuuming)

Data Model for health data

We define a YAML-based data model (schema) to describe each processing table.

This model is used to:

  • Automatically generate SQL DDL statements

  • Control how data is read from files and loaded into the database

  • Standardize naming conventions, indexing, and transformations

Schemas are:

  • Automatically generated via FTS parsing (for 2011+ ResDAC files)

  • Programmatically introspected from SAS7BDAT files (for 1999–2010 Medicare data)

Medicare Tables

See also:

Main Tables:

Federated / intermediate SQL Views:

Medicaid Tables

See also:

Tables:

Federated / intermediate SQL Views:

  • medicaid.monthly

  • medicaid._eligibility

SQL Utilities

Stored Procedures

📄 Procedures help scale population of large tables like eligibility, by batching inserts by beneficiary, or by year and state. This avoids resource exhaustion in large transactions.

Date Parsing Functions

📄 Custom SQL functions to parse non-standard date formats commonly found in legacy Medicare files and ResDAC data.

Documentation Indices