Definitely they should include D4M and GraphQL [1],[2].
Not only D4M can cater for structured relational data, it's also suitable for non-structured and sparse data in spreadsheet, matrices and graph. It's essentially a generalization of SQL but for all things data.
There's also integration of D4M with SciDB [3].
[1] D4M: Dynamic Distributed Dimensional Data Model:
The object storage stuff is new, but it's mostly confirmed that the older architecture works. MPP with shared (S3) storage and everything above that on local SSD and compute delivers the best performance. Even Snowflake finally came out with "interactive" warehouses with this architecture.
Parquet, Iceberg, and other open formats seem good, but they may hit a complexity wall. There's already some inconsistency between platforms, eg with delete vectors.
Incremental view maintenance interests me as well, and I would like to see it more available on different platforms. It's ironic that people use dbt etc. to test every little edit of their manually coded delta pipelines, but don't look at IVM.
It should be this way. Clients should have some protocol to communicate the schema they expect to the database probably with some versioning scheme. The database should be able to serve multiple mutually compatible views over the schema (stay robust to column renames for example). The database should manage and prevent the destruction of in use views of that schema. After an old view has been made incompatible, old clients needing that view should be locked out.
> The database should manage and prevent the destruction of in use views of that schema. After an old view has been made incompatible, old clients needing that view should be locked out.
this is the interesting part where the article's prpcess matters. how do you make incompatible changes without breaking clients?
You’re right that it would run on a block chain, but that fact would primarily exist to power some marketing. Everybody would end up interacting with it through a single centralized web site and API because it’s the only usable way to get it to work.
Thank you for writing this. This comes up constantly, and it'll be great to have another reference to cite.
Another interesting thing about TPC-C is how the cross-warehouse contention was designed. About 10% of new order transactions need to do a cross-warehouse transaction. If you can keep up with the workload, then the rate of contention is relatively low; most of the workload isn't pushing on concurrency control. If, however, you fall behind, and transactions start to take too long, then the contention will pile up.
When you run without the keying time, it turns out that concurrency control begins to dominate. For distributed databases, concurrency control and deadlock detection is fundamentally more expensive than it can be for single-node databases -- so it makes sense that a classically single-node database would absolutely trounce distributed databases. I like to think of tpcc "nowait" as really a benchmark of concurrency control because, due (I believe) to Amdahl's law the majority of its execution time ends up just in the contended portion of the workload.
Also very interesting that, as Justin points out, the workload sets up the warehouses so there is never cross-node contention. That's wild! I'm glad they didn't go and benchmark against even more distributed databases (like YugabyteDB, Spanner, or CockroachDB) and call it a fair fight.
Folks, for the love of god, please please stop running TPC-C without the “keying time” and calling it “the industry-standard TPCC benchmark”.
I understand there are practical reasons why you might want to just choose a concurrency and let it rip at a fixed warehouse size and say, “I ran TPC-C”, but you didn’t!
TPC-C when run properly is effectively an open-loop benchmark that scales where the load scales with the dataset size by having a fixed number of workers per warehouse (2?) that each issue transactions at some rate. It’s designed to have a low level of builtin contention that occurs based on the frequency of cross warehouse transactions, I don’t remember the exact rate but I think it’s something like 10%.
The benchmark has an interesting property that if the system can keep up with the transaction load by processing transactions quickly, it remains a low contention workload but if it falls behind and transactions start to pile up, then the number of contending transactions in flight will increase. This leads to non-linear degradation mode even beyond what normally happens with an open loop benchmark — you hit some limit and the performance falls off a cliff because now you have to do even more work than just catching up on the query backlog.
When you run without think time, you make the benchmark closed loop. Also, because you’re varying the number of workers without changing the dataset size (because you have to vary something to make your pretty charts), you’re changing the rate at which any given transaction is going to be on the same warehouse. So, you’ve got more contending transactions generally, but worse than that, because of Amdahl’s law, the uncontended transactions will fly through, so most of the time for most workers will be spend sitting waiting on contended keys.
percona/sysbench-tpcc has been subsequently updated to include a stronger disclaimer that it's "TPC-C-like" and doesn't comply with multiple TPC-C requirements. Fingers crossed that this helps stop vendors from doing non-TPC-C benchmarking without realizing it.
Some time ago I worked on cockroachdb and I was working on implementing planning for complex online schema changes.
We really wanted a model that could convincingly handle and reasonably schedule arbitrary combinations of schema change statements that are valid in Postgres. Unlike mysql postgres offers transactional schema changes. Unlike Postgres, cockroach strives to implement online schema changes in a protocol inspired by f1 [0]. Also, you want to make sure you can safely roll back (until you’ve reached the point where you know it can’t fail, then only metadata updates are allowed).
The model we came up with was to decompose all things that can possibly change into “elements” [1] and each element had a schedule of state transitions that move the element through a sequence of states from public to absent or vice versa [2]. Each state transitions has operations [3].
Anyway, you end up wanting to define rules that say that certain element states have to be entered before other if the elements are related in some way. Or perhaps some transitions should happen at the same time. To express these rules I created a little datalog-like framework I called rel [4]. This lets you embed in go a rules engine that then you can add indexes to so that you can have sufficiently efficient implementation and know that all your lookups are indexed statically. You write the rules in Go [5]. To be honest it could be more ergonomic.
The rules are written in Go but for testing and visibility they produce a datomic-inspired format [6]. There’s a lot of rules now!
The internal implementation isn’t too far off from the search implementation presented here [7]. Here’s unify [8]. The thing has some indexes and index selection for acceleration. It also has inverted indexes for set containment queries.
It was fun to make a little embedded logic language and to have had a reason to!
While that’s sort of true, there’s a lot of language specific things that go into making the UX of a debugger pleasant (think container abstractions, coroutines, vtables and interfaces). Specifically async rust and Tokio gets pretty interesting for a debugger to deal with.
Also, there’s usually some language (and compiler) specific garbage that makes the dwarf hard to use and requires special treatment.
My biggest issue with rust after two years is just as you highlight: the mod/crate divide is bad!
I want it to be easier to have more crates. The overhead of converting a module tree into a new crate is high. Modules get to have hierarchy, but crates end up being flat. Some of this is a direct result of the flat crate namespace.
A lot of the toil ends up coming from the need to muck with toml files and the fact that rust-analyzer can’t do it for me. I want to have refactoring tools to turn module trees into crates easily.
I feel like when I want to do that, I have to play this game of copying files then playing whack-a-mole until I get all the dependencies right. I wish dependencies were expressed in the code files themselves a la go. I think go did a really nice job with the packaging and dependency structure. It’s what I miss most.
It's a surprising choice that Rust made to have the unit of compilation and unit of distribution coincide. I say surprising, because one of the tacit design principles I've seen and really appreciated in Rust is the disaggregation of orthogonal features.
For example, classical object-oriented programming uses classes both as an encapsulation boundary (where invariants are maintained and information is hidden) and a data boundary, whereas in Rust these are separated into the module system and structs separately. This allows for complex invariants cutting across types, whereas a private member of a class can only ever be accessed within that class, including by its siblings within a module.
Another example is the trait object (dyn Trait), which allows the client of a trait to decide whether dynamic dispatch is necessary, instead of baking it into the specification of the type with virtual functions.
Notice also the compositionality: if you do want to mandate dynamic dispatch, you can use the module system to either only ever issue trait objects, or opaquely hide one in a struct. So there is no loss of expressivity.
The history here is very interesting, Rust went through a bunch of design iteration early, and then it just kinda sat around for a long time, and then made other choices that made modifying earlier choices harder. And then we did manage to have some significant change (for the good) in Rust 2018.
Rust's users find the module system even more difficult than the borrow checker. I've tried to figure out why, and figure out how to explain it better, for years now. Never really cracked that nut. The modules chapter of TRPL is historically the least liked, even though I re-wrote it many times. I wonder if they've tried again lately, I should look into that.
> Another example is the trait object (dyn Trait), which allows the client of a trait to decide whether dynamic dispatch is necessary, instead of baking it into the specification of the type with virtual functions.
Here I'd disagree: this is separating the two features cleanly. Baking it into the type means you only get one choice. This is also how you can implement traits on foreign types so easily, which matters a lot.
Sorry if my comment wasn't clear: I'm saying that I think in both the module and trait object case, Rust has done a good job of cleanly separating features, unlike in classic (Java or C++) style OOP.
I'm surprised the module system creates controversy. It's a bit confusing to get one's head around at first, especially when traits are involved, but the visibility rules make a ton of sense. It quite cleanly solves the problem of how submodules should interact with visibility. I've started using the Rust conventions in my Python projects.
I have only two criticisms:
First, the ergonomics aren't quite there when you do want an object-oriented approach (a "module-struct"), which is maybe the more common usecase. However, I don't know if this is a solvable design problem, so I prefer the tradeoff Rust made.
Second, and perhaps a weaker criticism, the pub visibility qualifiers like pub(crate) seems extraneous when re-exports like pub use exist. I appreciate maybe these are necessary for ergonomics, but it does complicate the design.
There is one other piece of historical Rust design I am curious about, which is the choice to include stack unwinding in thread panics. It seems at odds with the systems programming principle usecase for Rust. But I don't understand the design problem well enough to have an opinion.
> Rust's users find the module system even more difficult than the borrow checker. I've tried to figure out why, and figure out how to explain it better, for years now.
The module system in Rust is conceptually huge, and I feel it needs a 'Rust modules: the good parts' resource to guide people.
(1) There are five different ways to use `pub`. That's pretty overwhelming, and in practice I almost never see `pub(in foo)` used.
(2) It's possible to have nested modules in a single file, or across multiple files. I almost never see modules with braces, except `mod tests`.
(3) It's possible to have either foo.rs or foo/mod.rs. It's also possible to have both foo.rs and foo/bar.rs, which feels inconsistent.
(4) `use` order doesn't matter, which can make imports hard to reason about. Here's a silly example:
Full agree with 1, I do use 2 depending (if I'm making a tree of modules for organization, and a module only contains imports of other modules, I'll use the curly brace form to save the need of making a file), and I'm not sure why 4 makes it harder? Wouldn't it be more confusing if order mattered? maybe I need to see a full example :)
In `use foo::bar; use bar::foo;`, am I importing an external crate called foo that has a submodule bar::foo, or vice versa?
This bit me when trying to write a static analysis tool for Rust that finds missing imports: you essentially need to loop over imports repeatedly until you reach a fixpoint. Maybe it bites users rarely in practice.
Hard agree. Is retrospect I think the model of Delphi, where you must assemble `manually` a `pkg` so you can export to the world should have been used instead.
It also have solved the problem where you ended doing a lot of `public` not because the logic dictated it, but as only way to share across crates.
It should have been all modules (even main.rs with mandatory `lib.rs` or whatever) and `crate` should have been a re-exported interface.
> Hard agree. Is retrospect I think the model of Delphi, where you must assemble `manually` a `pkg` so you can export to the world should have been used instead.
How would you compare that to, say, go? I think the unit of distribution in go is a module, and the unit of compilation is a package. That being said, by using `internal` packages and interfaces you can similarly create the same sort of opaque encapsulation.
Literally all three of their required features which were not “standard” for swiss tables were requirements for the go implementation. See https://go.dev/blog/swisstable for a nice post on that project.
Many hash table designs, including Abseil’s Swiss Tables, forbid modifying the map during iteration. The Go language spec explicitly allows modifications during iteration, with the following semantics:
- If an entry is deleted before it is reached, it will not be produced.
- If an entry is updated before it is reached, the updated value will be produced.
- If a new entry is added, it may or may not be produced.
- Vector databases and hybrid search?
- Object storage for all the things? Lake houses. Parquet and beyond.
- Continuously materialized views? I'm not sure this one has made the splash but I think about Naiad (Materialize) and Noria (Readyset)
- NewSQL went mostly mainstream (Spanner wasn't included in the last one, but there's been more here with things like CockroachDB, TiDB, etc)
reply