Parallel CSV dumping with Postgres

I recently had to dump a large table (18 GB) to a CSV file. I didn’t need a CSV file, I actually needed a Parquet file to analyze locally with Polars. Since the database on AWS RDS, I couldn’t utilize the pg_parquet extension to dump directly to Parquet.

Dumping to CSV and then converting that into Parquet seemed the fastest way to obtain what I wanted. I also considered:

Unfortunately this wasn’t viable as I actually needed to dump the table and join in a lot of relevant data from other tables. I would’ve had to dump the join’d tables as well and perform the join locally.

Naive

psql -c "\copy (<my massive query>) to 'myfile.csv' csv header" <db url>

After observing the progress a bit, I realised it would take 18 hours to dump in this manner. Far too long.

Batching the work

I thought I could split the work into batches and dump the CSV files in parallel to speed things up. Unfortunately I didn’t find any good tooling to do this.

Batching is fastest if its done by some indexed column. I figured that I could write a quick script to come up with a list of ID ranges to batch the entire table into chunks of 500k rows:

with connection.cursor() as cursor:
    cursor.execute('select min(id), max(id) from mytable')
    min_id, max_id = cursor.fetchone()

batch_size = 500000
batches = [
    (id_start, min(max_id, id_start + batch_size - 1))
    for id_start in range(min_id, max_id + 1, batch_size)
]

for index, (id_start, id_end) in enumerate(batches):
    command = f"psql \"$DATABASE_URL\" -c \"\copy (SELECT * FROM mytable WHERE id >= {id_start} AND id <= {id_end}) TO './my_table{index}.csv' CSV HEADER"
   print(command)

Just running my command would produce psql commands to dump the table into N CSV files:

$ python batches.py
psql "$DATABASE_URL" -c "\copy (SELECT * FROM mytable WHERE id >= 0 AND id <= 500000)"
psql "$DATABASE_URL" -c "\copy (SELECT * FROM mytable WHERE id >= 500001 AND id <= 1000000)"
...

Parallelizing it

Now that we have a list of commands we can use GNU parallel to execute the batches in parallel:

python batches.py | parallell -j8 :::

I was able to keep an eye on the progress by opening another connection and watching the COPY progress (using sql)

SELECT * FROM pg_stat_progress_copy \watch 1

Converting to parquet

Once the dump finished, I was able to use DuckDB to shove the CSV files into a Parquet file using the folowing SQL script:

CREATE TABLE mytable (
    id BIGINT,
    ...<other columns>
);

INSERT INTO mytable SELECT * FROM read_csv('./my_table_0.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_1.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_2.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_3.csv');
...

COPY mytable TO 'mytable.parquet'

duckdb -init import.sql

Other options

I later realised I could’ve also used a materialized view on the Postgres server to work around the fact that I actually needed to dump the output of a large query.

CREATE MATERIALIZED VIEW myview AS SELECT * FROM mytable INNER JOIN ... WHERE ...;

pg_dump -Fd -j8 -t myview --file "mydump"

pg_restore --exit-on-error -Fd -d "mylocaldb" "mydump"

After restoring the dump locally I could’ve dumped a parquet file to disk using:

psql "$DATABASE_URL" -c "COPY (ELECT * FROM myview) TO STDOUT (FORMAT parquet)" > mytable.parquet

Of course, this assumes you have write access to create the materialized view. Without that, my initial solution would’ve been the most straightforward.