Date: 2025-06-08
I recently had to dump a large table (18 GB) to a CSV file. I didn’t
need a CSV file, I actually needed a Parquet file to analyze locally
with Polars. Since the database on AWS
RDS, I couldn’t utilize the pg_parquet
extension to dump directly to Parquet.
Dumping to CSV and then converting that into Parquet seemed the fastest way to obtain what I wanted. I also considered:
pg_dump with the -Fd format and the
-j flag to dump the table in parallel. I could’ve loaded
the dump into my local database and then dump it to parquet using
pg_parquet.Unfortunately this wasn’t viable as I actually needed to dump the table and join in a lot of relevant data from other tables. I would’ve had to dump the join’d tables as well and perform the join locally.
My first attempt was:
psql -c "\copy (<my massive query>) to 'myfile.csv' csv header" <db url>After observing the progress a bit, I realised it would take 18 hours to dump in this manner. Far too long.
I thought I could split the work into batches and dump the CSV files in parallel to speed things up. Unfortunately I didn’t find any good tooling to do this.
Batching is fastest if its done by some indexed column. I figured that I could write a quick script to come up with a list of ID ranges to batch the entire table into chunks of 500k rows:
with connection.cursor() as cursor:
cursor.execute('select min(id), max(id) from mytable')
min_id, max_id = cursor.fetchone()
batch_size = 500000
batches = [
(id_start, min(max_id, id_start + batch_size - 1))
for id_start in range(min_id, max_id + 1, batch_size)
]Now I could construct the commands needed to dump N CSV files:
for index, (id_start, id_end) in enumerate(batches):
command = f"psql \"$DATABASE_URL\" -c \"\copy (SELECT * FROM mytable WHERE id >= {id_start} AND id <= {id_end}) TO './my_table{index}.csv' CSV HEADER"
print(command)Just running my command would produce psql commands to
dump the table into N CSV files:
$ python batches.py
psql "$DATABASE_URL" -c "\copy (SELECT * FROM mytable WHERE id >= 0 AND id <= 500000)"
psql "$DATABASE_URL" -c "\copy (SELECT * FROM mytable WHERE id >= 500001 AND id <= 1000000)"
...Now that we have a list of commands we can use GNU parallel to execute the batches in parallel:
python batches.py | parallell -j8 :::I was able to keep an eye on the progress by opening another
connection and watching the COPY progress (using sql)
SELECT * FROM pg_stat_progress_copy \watch 1Once the dump finished, I was able to use DuckDB to shove the CSV files into a Parquet file using the folowing SQL script:
CREATE TABLE mytable (
id BIGINT,
...<other columns>
);
INSERT INTO mytable SELECT * FROM read_csv('./my_table_0.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_1.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_2.csv');
INSERT INTO mytable SELECT * FROM read_csv('./my_table_3.csv');
...
COPY mytable TO 'mytable.parquet'One simple command and a few seconds of patience and I had my parquet file:
duckdb -init import.sqlI later realised I could’ve also used a materialized view on the Postgres server to work around the fact that I actually needed to dump the output of a large query.
I could have created a materialized view as:
CREATE MATERIALIZED VIEW myview AS SELECT * FROM mytable INNER JOIN ... WHERE ...;I could’ve then dumped this in paralell using:
pg_dump -Fd -j8 -t myview --file "mydump"I could’ve restored that locally as:
pg_restore --exit-on-error -Fd -d "mylocaldb" "mydump"After restoring the dump locally I could’ve dumped a parquet file to disk using:
psql "$DATABASE_URL" -c "COPY (ELECT * FROM myview) TO STDOUT (FORMAT parquet)" > mytable.parquetOf course, this assumes you have write access to create the materialized view. Without that, my initial solution would’ve been the most straightforward.