Dorieh CMS Package (manipulating with Health Data)

Pipelines to process CMS data: Medicaid and Medicare

Overview of health data (Medicare and Medicaid)

We use health data provided by Centers for Medicare & Medicaid Services (CMS)

Data processing pipelines included in this package create a data warehouse with health data (Medicare and Medicaid). They perform ingestion of raw data into the database, data cleansing and deduplication , when possible, data quality analysis and optimization of the tables for efficient queries.

Please see the following documents for details:

Medicare processing now includes a pipeline to automatically create QC Tables. These tables are used by Apache Superset dashboard that visualizes QC results.

Project Structure

Top level directories are:

- doc
- src

Doc directory contains documentation.

Src directory contains software source code. The directories under sources are:

- cwl
- python

CWL

CWL folder contains reusable workflows, packaged as tools that can and should be used by all Dorieh pipelines.

Each processing step of CMS data is packaged as a standalone tool that can be run individually. Each tool is individually documented. The tools are combined into a workflow represented by medicaid.cwl and medicare.cwl files.

Python

Python packages and modules are described in the Python Package Description document.

Included are utilities to:

  • Parse FTS format and generate database schema

Data Model for health data

The data model in YAML format is used to generate database schema and processing code to ingest data into the database. Read more about the modeling in the
Data Modeling.

The model for raw data is automatically generated by parsing FTS files or analyzing SAS data.

The following models are defined here:

SQL

File procedures addresses the problem that creating Medicaid eligibility table in a single transaction requires too much time and memory. The stored procedures in this file split populating this table with data either by beneficiary or by year and state. Splitting by beneficiary (i.e. using one database transaction per beneficiary) works best.

File functions contain helper functions to parse dates in non-standard formats that are encountered in raw medicare files that we have.

Documentation Indices