The data_loader Module

Implements parallel loading data into a PostgreSQL database. It is also responsible for loading DDL and creation of view, both virtual and materialized.

Usage

Dorieh Data Loader

API

Domain Data Loader

Provides Command line interface for loading data from a single or a set of column-formatted files into NSAPH PostgreSQL Database.

Input (aka source) files can be either in FST or in CSV format.

class DataLoader(context: LoaderConfig = None)[source]

Class for data loader

set_table(table: str = None)[source]

print_ddl()[source]

print_table_ddl(table: str)[source]

static execute_sql(sql: str, connxn)[source]

insert_from_select()[source]

is_parallel() → bool[source]

get_connections() → List[connection][source]

get_connection()[source]

get_files() → List[Tuple[Any, Callable]][source]

has_been_ingested(file: str, table)[source]

reset()[source]

drop()[source]

run()[source]

commit()[source]

rollback()[source]

close()[source]

load()[source]

import_data_from_file(data_file)[source]

Configuration

Common options for data manipulation

class DBConnectionConfig(subclass, doc)[source]

Configuration class for connection to a database

Creates a new object

Parameters:

subclass¶ – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:
description¶ – Optional text to use as description. If not specified, then it is extracted from subclass documentation

autocommit: Use autocommit

db: Path to a database connection parameters file

connection: Section in the database connection parameters file

location: URI or path to file(s) or directory containing data (e.g., in Parquet format). Wildcards are supported

verbose: Generate verbose output

dryrun: Dry run: do no database modifications

class DBTableConfig(subclass, doc)[source]

Creates a new object

Parameters:

subclass¶ – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:
description¶ – Optional text to use as description. If not specified, then it is extracted from subclass documentation

table: Name of the table to manipulate

class CommonConfig(subclass, doc)[source]

Abstract base class for configurators used for data loading

Creates a new object

Parameters:

subclass¶ – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:
description¶ – Optional text to use as description. If not specified, then it is extracted from subclass documentation

domain: Name of the domain

registry: Path to domain registry. Registry is a directory or an archive containing YAML files with domain definition. Default is to use the built-in registry

Domain Loader Configurator

Intended to configure loading of a single or a set of column-formatted files into NSAPH PostgreSQL Database. Input (aka source) files can be either in FST or in CSV format

Configurator assumes that the database schema is defined as a YAML or JSON file. A separate tool is available to introspect source files and infer possible database schema.

class Parallelization(*values)[source]

class DataLoaderAction(*values)[source]

class LoaderConfig(doc)[source]

Configurator class for data loader

Creates a new object

Parameters:

subclass¶ – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:
description¶ – Optional text to use as description. If not specified, then it is extracted from subclass documentation

action: DataLoaderAction | None: If this option is given, then the whole domain schema will be dropped

data: Path to a data file or directory. Can be a single CSV, gzipped CSV or FST file or a directory recursively containing CSV files. Can also be a tar, tar.gz (or tgz) or zip archive containing CSV files

reset: Force recreating table(s) if it/they already exist

page: Explicit page size for the database

log: Explicit interval for logging

limit: Load at most specified number of records

buffer: Buffer size for converting fst files

threads: Number of threads writing into the database

parallelization: Type of parallelization, if any

pattern: pattern for files in a directory or an archive, e.g., “**/maxdata_*_ps_*.csv”

incremental: Commit every file and skip over files that have already been ingested

sloppy: Do not update existing tables and views

validate(attr, value)[source]

Subclasses can override this method to implement custom handling of command line arguments

Parameters:

attr¶ – Command line argument name
value¶ – Value returned by argparse

Returns:

value to use

Models Specification

Data Modelling for Dorieh Data Platform