The pg_export_parquet Module
This utility uses PyArrow Library to write Parquet files. It is capable of writing very large files with hundreds of billions of records.
To write huge files one should use hard partitioning to work around a memory leak in the Arrow library.
It can export either a table of a result of an SQL query.
Usage
pg_export_parquet.py [-h] [--sql SQL] [--schema SCHEMA] [--table TABLE]
[--partition PARTITION [PARTITION ...]] --output
OUTPUT --db DB --connection CONNECTION
[--batch_size BATCH_SIZE] [--hard]
options:
-h, --help show this help message and exit
--sql SQL, -s SQL SQL Query or a path to a file containing SQL query
--schema SCHEMA Export all columns for all tables in the given schema
--table TABLE, -t TABLE
Export all columns a given table (fully qualified name
required)
--partition PARTITION [PARTITION ...], -p PARTITION [PARTITION ...]
Columns to be used for partitioning
--output OUTPUT, --destination OUTPUT, -o OUTPUT
Path to a directory, where the files will be exported
--db DB Path to a database connection parameters file
--connection CONNECTION, -c CONNECTION
Section in the database connection parameters file
--batch_size BATCH_SIZE, -b BATCH_SIZE
The size of a single batch
--hard Hard partitioning: execute separate SQL statement for
each partition (for writing huge files).
API
A command line utility to export results of SQL query as Parquet files