Dorieh Data Platform: Documentation Home

User and Development Documentation

Index

Introduction to Data Platform

Dorieh Data Platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL).

We have tested deployment with the following CWL implementations:

The data produced by the data processing workflows is eventually stored in either CSV files, a PostgreSQL DBMS or Parquet files. Dorieh also supports storing results in FST and HDF5 files.

Some of the included data processing workflows use “Extract, Load, Transform,” (ELT) paradigm rather than more traditional “Extract, Transform, Load” ETL. It means that these workflows perform calculations, translations, filtering, cleansing, de-duplicating, validating, and data analysis or summarizations inside a DBMS using DBMS internal tools.

The data platform supports tools written in widely used languages such as Python, C/C++ and Java, R and PL/pgSQL.

A discussion on what are the aims of this data platform and how reproducible research can benefit from such product is provided in the What is Data Platform section.

The data platform is deployed as a set of Docker containers orchestrated by Docker-Compose. Conda (package manager) environment files and Python requirements are used to build Docker containers satisfying the dependencies. Specific parameters can be customized via environment files and shell script callbacks.

Building Blocks

Dorieh Utilities

The dorieh.utils package is intended to hold python code that will be useful across multiple portions of the Dorieh pipelines.

The included utilities are developed to be as independent of specific infrastructure and execution environment as possible.

Included utilities:

Core Platform

The data platform provides generic functionality for Dorieh Data Platform with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.

Some mapping (or crosswalk) tables are also included in the Core Platform module. These tables include between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for US states and counties. See more information in the Mapping between different territorial codes

See also: Managing database connections.

Dorieh GIS Utilities

Per USGS, a Geographic Information System (GIS) is a computer system that analyzes and displays geographically referenced information. It uses data that is attached to a unique location.

This dorieh.gis library contains several modules, aimed to work with US Census shape files. They fall into two categories:

  • Utilities to download appropriate shapefiles for a given geography type and year

  • Utilities to aggregate raster data over given shapefiles

Dorieh Documentation Utilities

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Documentation utilities to simplify creation of consistent documentation for Dorieh platform

  • cwl2md Generates Markdown documentation for a CWL tool or workflow

  • collector Generates automatic reStructuredText templates for all Python modules

  • copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file

Data Processing and Loading Pipelines

See dedicated Pipelines page for additional details.

Fully tested and supported pipelines are listed in the Pipelines page. At this moment, we have published processing pipelines for all Data Domains except Demographics. However, it is not possible to test health data processing pipelines without access to the same health data that was used for their development.

To include additional data in a deployed data-platform instance go to Adding more data section.

Pipelines can be tested with DBT Pipeline Testing Framework

Working with NSAPH containerized apps

National Studies on Air Pollution and Health organization (NSAPH) publishes containerized applications to produce certain types of data. These applications are published on the NSAPH Data Production GitHub.

The Pipeline Generator generates a CWL pipeline to execute the app and ingest the data it produces into Dorieh Data warehouse.

The process of data ingestion consists of two steps:

  1. Generation of the piepline for data ingestion

  2. Execution of the pipeline

Deployment

Dorieh can be deployed either as a Python virtual environment or as a Docker container.

Python package can be signalled using a simple python command:

pip install dorieh

or, if FST support is desired:

pip install dorieh[FST]

To run workflows one also needs a CWL implementation.

We have tested deployment with the following CWL implementations:

We suggest using Toil. To install Toil just run the following command in your Python Virtual Environment:

pip install "toil[cwl,aws]"

A prebuilt Docker image with Dorieh is available from DockerHub. Pull it to your local machine using

docker pull forome/dorieh

command. The image is built for Intel/AMD and ARM CPUs. ARM architecture is used in AWS Graviton2 processors that, according to AWS, deliver up to 40% better price performance. ARM CPUs are also used by latest Mac computers.

If you would like to modify the container please refer to the README in the docker directory.

Using the Database

For a sample to query the database, please look at Sample Query

A discussion of querying of health data can be found in this document

Terms and Acronyms

Included Glossary provides some information about acronyms and other terms used throughout this documentation.

Additionally, General Index and Python Module Index provide direct access to the Dorieh components.

Building Platform documentation

The documentation contains general documentation pages in MarkDown format and a build script that goes over all other platform repositories in the platform and creates a combined GitHub Pages site. The script supports links between repositories.

To build documentation:

  1. Clone Dorieh project:

     git clone https://github.com/NSAPH-Data-Platform/dorieh.git
    
  2. Cd into the prject directory:

     cd dorieh
    
  3. Create virtual environment (e.g., named .dorieh):

     python -m venv .dorieh
    
  4. Run build_documentation shell script:

     source .dorieh/bin/activate && ./build_documentation.sh
    

To integrate Markdown with Sphinx processing we use MyST Parser.

See Documentation Utilities package.