DuckDB httpfs column pushdown over Parquet on S3

When querying a Parquet file on S3 with httpfs, DuckDB pushes column projections and row group filters down to the HTTP range requests — meaning it only fetches the bytes it needs.

Drag the controls below to see how stats, row-group size, selectivity and column projection change how much of the file actually crosses the wire:

interactive · pushdown simulator

Column statistics in footer

Row-group size64 MBPredicate selectivity (rows matched)5%Columns projected2 / 12

Full scan (all columns, all row groups)100%

Bytes DuckDB actually reads2%

98% I/O saved

Stats valid → DuckDB skips row groups and reads only the columns you select. 98% less data over the wire.

Model assumes the predicate column is clustered. Bigger row groups make skipping coarser, so smaller groups + valid stats win.

INSTALL httpfs;
LOAD httpfs;
 
SET s3_region = 'eu-west-1';
 
-- Only fetches the 'ts' and 'value' columns from each row group
-- Row groups where ts < '2026-01-01' are skipped entirely (if stats valid)
SELECT ts, value
FROM read_parquet('s3://my-bucket/events/*.parquet')
WHERE ts >= '2026-01-01'
  AND value > 100
LIMIT 1000;

The catch: pushdown only works if:

Column statistics are present in the Parquet footer (written by Spark/pandas with write_statistics=True)
Row group size is reasonable — very large row groups (512MB+) reduce pushdown effectiveness
The predicate column has high selectivity

Check your file's statistics:

import pyarrow.parquet as pq
 
pf = pq.ParquetFile("events.parquet")
print(pf.metadata.row_group(0).column(0).statistics)
# => <pyarrow._parquet.Statistics object>
#    has_min_max: True, min: 2026-01-01, max: 2026-01-31

If has_min_max is False — you're doing a full scan regardless of your WHERE clause.

Discussion