The pg_export_parquet Module

This utility uses PyArrow Library to write Parquet files. It is capable of writing very large files with hundreds of billions of records.

To write huge files one should use hard partitioning to work around a memory leak in the Arrow library.

It can export either a table of a result of an SQL query.

Usage

python -u -m dorieh.platform.util.pg_export_parquet
                        [-h] [--sql SQL] [--schema SCHEMA] [--table TABLE]
                        [--partition PARTITION [PARTITION ...]] --output
                        OUTPUT --db DB --connection CONNECTION
                        [--batch_size BATCH_SIZE] [--hard]


options:
  -h, --help            show this help message and exit
  --sql SQL, -s SQL     SQL Query or a path to a file containing SQL query
  --schema SCHEMA       Export all columns for all tables in the given schema
  --table TABLE, -t TABLE
                        Export all columns a given table (fully qualified name
                        required)
  --partition PARTITION [PARTITION ...], -p PARTITION [PARTITION ...]
                        Columns to be used for partitioning
  --output OUTPUT, --destination OUTPUT, -o OUTPUT
                        Path to a directory, where the files will be exported
  --db DB               Path to a database connection parameters file
  --connection CONNECTION, -c CONNECTION
                        Section in the database connection parameters file
  --batch_size BATCH_SIZE, -b BATCH_SIZE
                        The size of a single batch
  --hard                Hard partitioning: execute separate SQL statement for
                        each partition (for writing huge files).

API

A command line utility to export results of SQL query as Parquet files

result_set(conn: connection, sql: str, cursor_name: str, batch_size: int)[source]

index_of(text: str, token: str)[source]

class PgPqBase(cnxn: connection, sql: str, destination: str)[source]

setup_schema()[source]

classmethod type_pg2pq(vtype: str)[source]

metadata_sql() → str[source]

get_metadata() → List[Tuple][source]

set_partitioning(columns: List[str])[source]

abstract export()[source]

dryrun()[source]

classmethod run(arguments=None)[source]

classmethod export_sql(arguments, sql)[source]

classmethod export_table(arguments, table: str)[source]

classmethod export_schema(arguments)[source]

class PgPqSingleQuery(cnxn: connection, sql: str, destination: str, mode: str, schema: Optional[Schema] = None)[source]

static normalize_value(v)[source]

transform(row: Dict) → Dict[source]

set_partitioning(columns: List[str], values: Optional[List] = None)[source]

export()[source]

batch(data: List[Dict])[source]

batches()[source]

class PgPqPartitionedQuery(cnxn: connection, sql: str, destination: str, partition: List[str])[source]

static qualify_column(sql: str, column: str, idx_select: int)[source]

set_partition_values()[source]

set_partitioning(partition: List[str])[source]

export()[source]