Dorieh Data Platform: Documentation Home
User and Development Documentation
Introduction to Data Platform
Dorieh Data Platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL).
We have tested deployment with the following CWL implementations:
Toil.
CWL reference implementation, primarily using cwlref-runner package
CWL-Airflow that provides a very nice Airflow graphical user interface (GUI) for running workflows.
The data produced by the data processing workflows is eventually stored in either CSV files, a PostgreSQL DBMS or Parquet files. Dorieh also supports storing results in FST and HDF5 files.
Some of the included data processing workflows use “Extract, Load, Transform,” (ELT) paradigm rather than more traditional “Extract, Transform, Load” ETL. It means that these workflows perform calculations, translations, filtering, cleansing, de-duplicating, validating, and data analysis or summarizations inside a DBMS using DBMS internal tools.
The data platform supports tools written in widely used languages such as Python, C/C++ and Java, R and PL/pgSQL.
A discussion on what are the aims of this data platform and how reproducible research can benefit from such product is provided in the What is Data Platform section.
The data platform is deployed as a set of Docker containers orchestrated by Docker-Compose. Conda (package manager) environment files and Python requirements are used to build Docker containers satisfying the dependencies. Specific parameters can be customized via environment files and shell script callbacks.
Building Blocks
Dorieh Utilities
The dorieh.utils package is intended to hold python code that will be useful across multiple portions of the Dorieh pipelines.
The included utilities are developed to be as independent of specific infrastructure and execution environment as possible.
Included utilities:
Interpolation code
Reading FST files from Python The pyfst Module
Reading FWF files The fwf Module
various I/O wrappers The io_utils Module
An API and CLI framework The context Module
Helper wrappers to get currently allocated memory The profile_utils Module
QC Framework
Core Platform
The data platform provides generic functionality for Dorieh Data Platform with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.
Some mapping (or crosswalk) tables are also included in the Core Platform module. These tables include between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for US states and counties. See more information in the Mapping between different territorial codes
See also: Managing database connections.
Dorieh GIS Utilities
Per USGS, a Geographic Information System (GIS) is a computer system that analyzes and displays geographically referenced information. It uses data that is attached to a unique location.
This dorieh.gis
library contains several modules, aimed to work with US Census shape files.
They fall into two categories:
Utilities to download appropriate shapefiles for a given geography type and year
Utilities to aggregate raster data over given shapefiles
Dorieh Documentation Utilities
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Documentation utilities to simplify creation of consistent documentation for Dorieh platform
cwl2md Generates Markdown documentation for a CWL tool or workflow
collector Generates automatic reStructuredText templates for all Python modules
copy_section Copies a specified section from one markdown document to another. This way we can collect summaries in one file
Data Processing and Loading Pipelines
See dedicated Pipelines page for additional details.
Fully tested and supported pipelines are listed in the Pipelines page. At this moment, we have published processing pipelines for all Data Domains except Demographics. However, it is not possible to test health data processing pipelines without access to the same health data that was used for their development.
To include additional data in a deployed data-platform instance go to Adding more data section.
Pipelines can be tested with DBT Pipeline Testing Framework
Working with NSAPH containerized apps
National Studies on Air Pollution and Health organization (NSAPH) publishes containerized applications to produce certain types of data. These applications are published on the NSAPH Data Production GitHub.
The Pipeline Generator generates a CWL pipeline to execute the app and ingest the data it produces into Dorieh Data warehouse.
The process of data ingestion consists of two steps:
Generation of the piepline for data ingestion
Execution of the pipeline
Deployment
Dorieh can be deployed either as a Python virtual environment or as a Docker container.
Python package can be signalled using a simple python command:
pip install dorieh
or, if FST support is desired:
pip install dorieh[FST]
To run workflows one also needs a CWL implementation.
We have tested deployment with the following CWL implementations:
Toil.
CWL reference implementation, primarily using cwlref-runner package
CWL-Airflow that provides a very nice Airflow graphical user interface (GUI) for running workflows.
We suggest using Toil. To install Toil just run the following command in your Python Virtual Environment:
pip install "toil[cwl,aws]"
A prebuilt Docker image with Dorieh is available from DockerHub. Pull it to your local machine using
docker pull forome/dorieh
command. The image is built for Intel/AMD and ARM CPUs. ARM architecture is used in AWS Graviton2 processors that, according to AWS, deliver up to 40% better price performance. ARM CPUs are also used by latest Mac computers.
If you would like to modify the container please refer to the README in the docker directory.
Using the Database
For a sample to query the database, please look at Sample Query
A discussion of querying of health data can be found in this document
Terms and Acronyms
Included Glossary provides some information about acronyms and other terms used throughout this documentation.
Additionally, General Index and Python Module Index provide direct access to the Dorieh components.
Building Platform documentation
The documentation contains general documentation pages in MarkDown format and a build script that goes over all other platform repositories in the platform and creates a combined GitHub Pages site. The script supports links between repositories.
To build documentation:
Clone Dorieh project:
git clone https://github.com/NSAPH-Data-Platform/dorieh.git
Cd into the prject directory:
cd dorieh
Create virtual environment (e.g., named
.dorieh
):python -m venv .dorieh
Run build_documentation shell script:
source .dorieh/bin/activate && ./build_documentation.sh
To integrate Markdown with Sphinx processing we use MyST Parser.
See Documentation Utilities package.