This is awesome, thanks for creating this. I've had to write some absolutely wonky scripts to dump a PostgreSQL database into Parquet, or read a Parquet file into PostgreSQL. Normally some terrible combination of psycopg and pyarrow, which worked, but it was ad-hoc and slightly different every time.
A lot of other commenters are talking about `pg_duckdb` which maybe also could've solved my problem, but this looks quite simple and clean.
I hope for some kind of near-term future where there's some standardish analytics-friendly data archival format. I think Parquet is the closest thing we have now.
Looks like he got a master's degree from UIUC and did some research on FFT implementations. Seems to have been successful. What makes you say he was 'considered a "bad student"'?
(This is a genuine question. I've never met Alex in person, but if an applicant to my lab spent their free time diving into SIMD implementations and breaking records for computing mathematical constants, I'd rush to hire them. Not that either of those two things is a requirement, of course.)
"However, ever since grade school, I've always sucked in terms of grades and standardized tests. I graduated from Palo Alto High School in the bottom quartile among all the college-bound students. My GPA was barely a 3.0 at graduation, so it was somewhat miraculous that I got accepted into Northwestern University at all."[1]
I know he has a masters and all, but he is spectacular. Hundreds of thousands of people have masters degrees. He is more impressive than 99% of professors, and academia doesn't even acknowledge him as a peer of them.
The fact that he got into Northwestern is proof in itself that the “system” didn’t consider him to be a bad student. It actually seems like the system did a pretty good job of identifying sheer intellectual horsepower and potential despite the self-professed low GPA and standardized test scores.
In the grand hierarchy of college admissions committees ranking people, "getting into northwestern" means roughly "the 20,000th best student in his class year."
and that was a "catastrophic" outcome? When I think of "catastrophic", it would be something like ending up institutionalized, or dead, not ranked in the top 1% of college applicants.
I'll put it this way. He clearly loves teaching. He's better at it than almost any teacher I ever had. He's clearly very intelligent. He loves doing work that interests him, as opposed to lucrative professional tasks.
I'm not in his head, but it sounds like his ideal job would be professor.
Let's presume that second paragraph is true. He doesn't have that job, let alone have it at a top university. He hasn't been given a PhD (a de facto minimum requirement for that job). UIUC is great, but in a tier below other schools, who presumably didn't admit him (or made him pay too much).
But saying he's "ranked in the top 1%" is very disingenuous to this conversation. The 99th percentile basketball doesn't make a college team. And I'm saying he belongs on the NBA All-Star team. And yes, it would be a "catastrophically bad" bad process if Anthony Davis couldn't have made a college team. But the difference between 99th percentile and 99.9999th percentile is enormous, and mistaking one for the other in any context is terrible.
If you double the number of keys and you double the number of bins (load factor stays constant), then the problem becomes much worse very quickly.
If you double the number of keys and you double the size of each bin (load factor stays constant), then the problem diminishes as you suggest. BUT, larger bins are more sensitive to changes in load factor.
This is my post from 2018 (I didn't submit it to HN), and it could definitely use a "here's what practical systems do" update! I'll put it on the TODO list...
Your point about systems dealing with a relatively small number of large objects vs. small objects also makes sense: this is essentially the "cost" of an overflow (4kb spills once in a blue moon? Oh well, handle that as a special case. 4TB spills once in a blue moon? The system might crash). This is more obvious, as you also point out, in load balancing.
One aspect I found very counter-intuitive: before this investigation, I would've guessed that having a large number of large bins makes overflow increasingly unlikely. This is only partially true: more bins is obviously good, but larger bins are actually more sensitive to changes in load factor!
Overall, I think you are right that this is not really a concern in modern systems today. Compared to Dynamo, I still think Vimeo's solution (linked at the bottom of the post) is both intuitive and low-complexity. But regardless, more of an interesting mathematical diversion than a practical systems concern these days.
I'm curious to hear a bit more of your opinion. For example, I'm surprised that syscall latency is something near the top of your list. I think the usual wisdom in the DB community is that the cost models are mostly fine, but the cardinality estimation is really bad.
In terms of the deferred / alternative planning, do you think adaptive query execution is a reasonable way to achieve this? It certainly allows for information early in the query execution to impact later plans. My worry with these approaches is that if you get the first couple of joins wrong (which is not uncommon), unless you have something like Yannakakis/SIPs, you still can't recover.
I am obviously biased on the whole "ML for query optimization" thing. One thing I would note is that every "ML for planning" approach I've seen does, under the hood, use ML for cost discovery/estimation. These approaches are just trying to balance the data they collect (exploration) with the quality of the plans they produce (exploitation). Interestingly, if you use ML in a way that is completely removed from planning, you actually get worse query plans despite more accurate estimates: https://people.csail.mit.edu/tatbul/publications/flowloss_vl... (again, I've got a horse in this race, so my opinion should come with a side of salt :D)
As a DBA I have a very hard time properly tuning parameters like random_page_cost and the default of 4.0 is no longer applicable for most database servers.
I don't want to be tuning this, it takes a lot of time to do it properly and I haven't retested this in a long time. I just set something that has worked in the past which is probably bad.
I completely agree that Postgres should be able to figure this out on its own. This is just an example, there are more such parameters which should be adjusted to the hardware but most people will leave the defaults.
Syscall latency was a much bigger deal back when we had spinning disks, but it still matters today (you can get dramatically different plans depending on slightly different costs of pulling a page from disk) and I find it silly that we've never even measured our those costs. A bigger impact might be for SQL functions...all of the Postgres SQL functions have configured cost values, but they could easily be measured. Also a simple cost model for functions can be a dramatic oversimplification. For example, some PostGIS functions have O(n) or O(n^x) behavior depending on the size and complexity of the input geometry. If we could measure exact costs, or model costs with statistical distributions, or possibly predict with ML, that would be a huge improvement.
My opinion on ML is that there is nothing in the execution planning side that couldn't be modeled and solved as a linear program, with extremely fast and mathematically-optimal results. By trying to use ML for the planning part, you're really just using ML to reverse engineer LP solvers, and it is a poor use of the compute resources.
The reason why some ML planners might have better results than typical SQL query planners is because typical SQL engines are optimized towards OLTP workloads that require small transactions executed very quickly. In order to do that, they purposefully don't explore the true planning space...they might explore 3-10 alternative ways of executing, whereas there might be hundreds or thousands of ways to do the same thing. While Postgres has explicitly chosen to not implement planning pragmas to override planner behavior, it would be really cool if you could have multiple planners optimized for different types of workloads, and be able to explicitly choose a query planner that takes 3 seconds to plan for a query that takes 1hr to execute and for which a better plan could save several minutes. I would even love a fairly naive query planner which does index scans for a deterministic and exact cardinality before planning joins.
BTW, I really like your blog and your research focuses. You're working on exceptionally hard problems that have a huge impact on computing.
Where I think ML would be much better than what we (postgres) do, is iteratively improving selectivity estimation. In today's postgres there's zero feedback from noticing at runtime that the collected statistics lead to bad estimates. In a better world we'd use that knowledge to improve future selectivity estimates.
> In order to do that, they purposefully don't explore the true planning space...they might explore 3-10 alternative ways of executing, whereas there might be hundreds or thousands of ways to do the same thing.
FWIW, often postgres' planner explores many more plan shapes than that (although not as complete plans, different subproblems are compared on a cost basis).
> While Postgres has explicitly chosen to not implement planning pragmas to override planner behavior, it would be really cool if you could have multiple planners optimized for different types of workloads,
FWIW, it's fully customizable by extensions. There's a hook to take over planning, and that can still invoke postgres' normal planner if the query isn't applicable. Obviously that's not the same as actually providing pragmas.
My guess is that the reason for putting syscall latency high is that it should be easy to fix. Cardinality tracking is a hard problem, but running a loop on install that measures the cost of a couple dozen syscalls really could be done automatically.
PostgreSQL has coarse-grain query hints (like `enable_hashjoin`), and the excellent `pg_hint_plan` extension allows you to specify a complete or partial execution plan: https://github.com/ossc-db/pg_hint_plan
It is certainly possible that the plans are similar, and that improvements to the execution engine are being measured. The join order benchmark was designed to test optimizer quality. It is worth noting that in addition to trying to measure the number of pages read from disk, the PG optimizer also tries to reduce the number of tuples examined by the CPU, the number of predicate evaluations, etc. All these numbers are rolled up into a "cost," which is the function that the optimizer minimizes.
It is also true that measuring cold cache and warm cache performance can produce different results, and this experiment is certainly in the warm cache scenario. But, the cold cache scenario suffers from the problem you mention as well: an improvement to PG's B-tree that saves a few IOs will dominate any kind of CPU-based improvement (at least at the data size of the join order benchmark).
FWIW, the plan for the query with the P90 latency changes from a plan that uses loop and merge join in PG8.4 to a plan that uses hash join in PG16 (where it is no longer the P90 query), which is at least some evidence of optimizer improvements.
> It is certainly possible that the plans are similar, and that improvements to the execution engine are being measured. The join order benchmark was designed to test optimizer quality.
I don't think that it's possible to test optimizer quality in isolation -- not really, not if it's to be in any way practical. Many (most?) individual improvements to the optimizer are hopelessly intertwined with executor enhancements. This is usually fairly obvious, because the same commit changes both the optimizer and the executor together. Sometimes it's much less obvious, though, because its more a case of the optimizer and executor coevolving.
It's probably still the case that a lot of the improvements seen here are pure executor enhancements, but I can't say that I'm very confident about that. (As the main person behind those B-Tree IO savings you might expect me to have more confidence than that, but I find it horribly difficult to tease these issues apart while keeping the discussion high level/relevant.)
I totally agree -- I picked the latest version for each major version using the semver interpretation of the version numbers, which is not how PostgreSQL has traditionally done major version numbers (i.e., PG 8.2 and 8.1 are different major versions, but I interpreted the version numbers as if they were minor versions).
The main reason I did this was to reduce the number of versions I had to test, but I agree a more complete analysis would test each (true) major version.
"Tail latency has improved by YMMV for everything else" => yes, I think that's a valid (but conservative) read. Of course, in many (most?) applications, tail latency is very important. Tail latency also tends to be the thing that optimizer engineers target (i.e., reduce the runtime of the longest running query).
This is the main motivation behind learned "steering" query optimizers: even if a DBA finds the right hint for a query, it is difficult to track that hint through data changes, future DB versions, and even query changes (e.g., if you add another condition to the WHERE clause, should you keep the hints or drop them?). Turns out, with a little bit of systems work, ML models can do a reasonably good job of selecting a good hint for a particular query plan.
A lot of other commenters are talking about `pg_duckdb` which maybe also could've solved my problem, but this looks quite simple and clean.
I hope for some kind of near-term future where there's some standardish analytics-friendly data archival format. I think Parquet is the closest thing we have now.