So with traditional Parquet this is usually handled through “sane” partitioning.
Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.
If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.
Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.
Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.
Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows.
If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on.
Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query.
Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later.