Project (Directory) Loading Utility
Overview
Project Loader
is a command line tool to introspect and ingest into a database
a directory, containing CSV (or CSV-like, e.g. FST, JSON, SAS, etc.) files.
The directory can be structured, e.g. have nested subdirectories. All files
matching a certain name pattern at any nested subdirectory level
are included in the data set. It can also load a single file if a file
rather than a directory is given as --data
argument.
In the database, a schema is crated based on the given project name. For each file in the data set a table is created. The name of the table is constructed from the relative path of the incoming data file with OS path separators (e.g. ‘/’) being replaced with underscores (‘_’).
It might be a good idea, before actually ingesting data into the database to do a dry run and visually examine the database schema created by Introspection utility.
Loading into the database is performed using Data Loader functionality.
Configuration options
Configuration options are provided by LoaderConfig object. Usually, they are provided as command line arguments but can also be provided via an API call.
Some configuration options can be provided in the registry YAML file. By default, if registry does not exist, a new YAML file will be created with the following parameters:
header: True ## i.e. CSV files are expected to have header line
quoting: QUOTE_MINIMAL, ## i.e. only strings with whitespaces are expected to be quoted
index: “unless excluded” ## We will build indices for every column unless it is explicitly excluded
See Domain options for the descriptions of these parameters.
When a registry file is created it can be manually edited by user. The manual modifications will be preserved for subsequent runs.
Usage from command line
python -u -m dorieh.platform.loader.project_loader
[-h] [--drop]
[--data DATA [DATA ...]]
[--pattern PATTERN [PATTERN ...]]
[--reset]
[--incremental]
[--sloppy]
[--page PAGE]
[--log LOG]
[--limit LIMIT]
[--buffer BUFFER]
[--threads THREADS]
[--parallelization {lines,files,none}]
[--dryrun]
[--autocommit]
[--db DB]
[--connection CONNECTION]
[--verbose]
[--table TABLE]
--domain DOMAIN
[--registry REGISTRY]
optional arguments:
-h, --help show this help message and exit
--drop Drops domain schema, default: False
--data DATA [DATA ...]
Path to a data file or directory. Can be a single CSV,
gzipped CSV or FST file or a directory recursively
containing CSV files. Can also be a tar, tar.gz (or
tgz) or zip archive containing CSV files, default:
None
--pattern PATTERN [PATTERN ...]
pattern for files in a directory or an archive, e.g.
`**/maxdata_*_ps_*.csv`, default: None
--reset Force recreating table(s) if it/they already exist,
default: False
--incremental Commit every file and skip over files that have
already been ingested, default: False
--sloppy Do not update existing tables, default: False
--page PAGE Explicit page size for the database, default: None
--log LOG Explicit interval for logging, default: None
--limit LIMIT Load at most specified number of records, default:
None
--buffer BUFFER Buffer size for converting fst files, default: None
--threads THREADS Number of threads writing into the database, default:
1
--parallelization {lines,files,none}
Type of parallelization, if any, default: lines
--dryrun Dry run: do not load any data, default: False
--autocommit Use autocommit, default: False
--db DB Path to a database connection parameters file,
default: database.ini
--connection CONNECTION
Section in the database connection parameters file,
default: nsaph2
--verbose Verbose output, default: False
--table TABLE, -t TABLE
Name of the table to manipulate, default: None
--domain DOMAIN Name of the domain
--registry REGISTRY Path to domain registry. Registry is a directory or an
archive containing YAML files with domain definition.
Default is to use the built-in registry, default: None
Sample command
The following command creates a schema named my_schema
and loads
tables from all files with extension .csv
found recursively under the
directory /data/incoming/valuable/data/
:
python -u -m dorieh.platform.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --registry my_temp_schema.yaml --reset --pattern *.csv --db database.ini --connection postgres
It uses database.ini
file (see Managing database connections for details) in the current directory
(where the program is started) and a section named postgres
inside it.
It creates temporary file
my_temp_schema.yaml
also in the current directory. If such a file
already exists, it will be loaded and the settings found in it will override
the defaults. Option --reset
would delete all existing tables with
the same names and recreate them.
The following is the same command but with parallel execution using 4 threads writing into the database and with increased page size for writing into the database. It is optimized for hosts with more RAM.
python -u -m dorieh.platform.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --reset --registry my_temp_schema.yaml --pattern *.csv --db database.ini --connection postgres --threads 4 --page 10000
To load a single file one can use a command like this:
python -u -m dorieh.platform.loader.project_loader --domain my_schema --data /data/incoming/valuable/test_file.csv --registry my_temp_schema.yaml --reset --db database.ini --connection postgres
Dry runs (introspect only)
To just introspect files in a directory and generate YAML schema for
the project (see domain schema specification for
the description of the format) without modifications in the database,
use dry run. On the command line, just give :code:--dryrun
option.
Dry run will create “registry” file that can be manually examined and modified. The following command described above will perform dry run:
python -u -m dorieh.platform.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --registry my_temp_schema.yaml --dryrun --pattern *.csv
This command will create file named my_temp_schema.yaml
.
API Usage
Example of API usage retrieving command line arguments:
loader = ProjectLoader()
loader.run()
More advanced usage:
config = LoaderConfig(__doc__).instantiate()
config.pattern = "**/*.csv.gz"
loader = ProjectLoader(config)
loader.run()