When querying a Parquet file on S3 with httpfs, DuckDB pushes column projections and row group filters down to the HTTP range requests — meaning it only fetches the bytes it needs.
INSTALL httpfs;
LOAD httpfs;
SET s3_region = 'eu-west-1';
-- Only fetches the 'ts' and 'value' columns from each row group
-- Row groups where ts < '2026-01-01' are skipped entirely (if stats valid)
SELECT ts, value
FROM read_parquet('s3://my-bucket/events/*.parquet')
WHERE ts >= '2026-01-01'
AND value > 100
LIMIT 1000;The catch: pushdown only works if:
- Column statistics are present in the Parquet footer (written by Spark/pandas with
write_statistics=True) - Row group size is reasonable — very large row groups (512MB+) reduce pushdown effectiveness
- The predicate column has high selectivity
Check your file's statistics:
import pyarrow.parquet as pq
pf = pq.ParquetFile("events.parquet")
print(pf.metadata.row_group(0).column(0).statistics)
# => <pyarrow._parquet.Statistics object>
# has_min_max: True, min: 2026-01-01, max: 2026-01-31If has_min_max is False — you're doing a full scan regardless of your WHERE clause.