Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Data denormalization is broken (hackernoon.com)
171 points by endswapper on Sept 30, 2016 | hide | past | favorite | 83 comments


I'm usually the guy arguing against microservices, but I think this is a good example of why they shine. It's much easier to manage the updating of these synthesized numbers if you have a discrete service that limits the things you can do at a higher level. So instead of worrying about who is updating the data store directly, you push it up to a higher service-level interface where you have a finite list of operations that are supported, and the scope is small enough that one architect can understand the system in its entirety and make smart decisions about how to compute/store/cache these synthesized data.

That said, if you have less than 10 engineers this shouldn't even be thinking about this. Way too many young engineers get hung up on scalability when they don't even have anything on the horizon that will require scaling. Couple that with a general ignorance of SQL because it's old and boring, and not realizing how much performance you can eke out of a well tuned RDBMS with proper schema and queries. A SQL DB is an incredible hedge against an uncertain future, keep your data tidy and you can keep pivoting until you find something you actually have to scale.


Speaking of RDBMSes, you can achieve this "limited API" by just creating an RDBMS user with limited privileges (can query a subset of tables; can make changes through a fixed set of stored procedures.)

Personally, I never understood why everyone thought it was necessary to create HTTP API layers in front of every database, rather than just standardizing a public-Internet SQL wire-protocol between different RDBMSes and giving them all the ability to enforce per-user constraints on things like maximum query execution cost.


> I never understood why everyone thought it was necessary to create HTTP API layers in front of every database

Mostly because you usually need to do some kind of logic that's either difficult or impossible to represent in SQL views or functions. For example, how would your SQL view access something like redis? or any kind of external cache? or an HTTP service? This is the kind of stuff that's pretty common in HTTP Controller functions, and would be a nightmare to move into the database.

(Note: I'm generally in favor of using SQL views/functions for data-facing logic, just, y'know, not a silver bullet.)


> For example, how would your SQL view access something like redis? or any kind of external cache? or an HTTP service?

Smug comeback: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

But your point still stands; no other RDBMS has those, so it can't really be an assumption.

Still, there is a mismatch of paradigms here. The people who built the RDBMSes assumed that everyone would import data into them from wherever else. You, the user (or you, the application developer) shouldn't need to query a third-party web service when doing your database query; something else was supposed to do that query in expectation of your need, and shove the results into the database, so you could just treat that data as another table of useful information to JOIN on.

People who front databases with web APIs tend to treat the database itself as a dumb store—one among many—that's supposed to quickly return "raw" data from single tables so that the web service can chew it together with other data to compute answers to questions. But, by doing so, they're in effect creating their own bespoke Database Management System: the web service becomes the thing that does efficient OnLine Analytical Processing of manifold backing datasets. (Or, at least, attempts to.)

A DBMS's job is to ingest data and process queries to spit out answers. It's not your job; it's not your app's job; it's the DBMS's job. The DBMS has decades of optimization for that use-case; and in fact, that's the meaning of the word "database management system", when defined in opposition to a "flat-file database" like BDB or LevelDB. The fact that there's a daemon sitting there between you and your databases allows the daemon to index and join and expose views and run checks and enforce rules and on and on and on. It allows the daemon to do asynchronous, streaming imports of data-sets from remote sources, so that its data is always up-to-date. It allows the daemon to be there for things like crawler agents to dump their results into, rather than dumping them to something like an S3 bucket where you later have to be the one to import them in. In other words, being a daemon allows the DBMS to be the service that you just come to and ask questions—arbitrarily-complex analytical queries—and get answers.

But that only works if the database management system itself is given access methods to all the databases that contain the data to answer your question. And, in modern practice, we've essentially forgotten that that is a thing that we ever wanted to do. Who would teach a database management system anything? Database connector libraries don't belong in DBMSes; they belong scattered throughout our toppling edifice of RESTful API microservices!

(And if you argue that we still need those RESTful APIs for things like browser-clients wanting to do queries using AJAX: there's no reason DBMSes—not just document stores but full-fledged OLAP DBMSes—can't speak HTTP as well. CouchDB is a great but strangely-unique example. Adapter gateways also exist, like PostgREST.)


>So instead of worrying about who is updating the data store directly, you push it up to a higher service-level interface where you have a finite list of operations that are supported, and the scope is small enough that one architect can understand the system in its entirety and make smart decisions about how to compute/store/cache these synthesized data.

Nothing about this requires a microservice.

A module / package / etc barrier in a "monolithic" app through a limited exposed interface is totally enough.


Yes yes, I know, I've made all the same arguments myself. But you have to admit, the isolation of a code base with a discrete test suite and a discrete team that has to think about nothing else gives a level of awareness to the subtleties on the potential issue that you don't get in a monolith. SOA is about scaling people more than anything.


That's not to mention simple app caching techniques for computed values.


Interestingly, this article reflects a really common misconception about database normalization:

The relational algebra that normal forms are expressed in has no concept of aggregate functions.

The example the article gives uses fields like numUnreadRooms or lastSeenTimestamp. These are derivable from the contents of other tables with SQL aggregates like COUNT() or MAX(), but the relational algebra itself would treat them as facts of the relation itself, dependent upon the primary key. And they'd be updated by the application itself.

In other words, having numUnreadRooms or lastSeenTimestamp in your schema does not make it denormalized.

I was under a similar misconception for the first 5-6 years of my career, and then when I realized that it's perfectly fine to store these aggregates inline, my database queries suddenly started performing acceptably.

As for how to update them - in non-broken databases, the solution is usually triggers. A trigger is a stored procedure that executes when a database operation makes some condition true, and then is executed entirely in the DB, with full transactional awareness. So you could listen for inserts and deletes and then update the count on them, and if your insert failed or the transaction was rolled back, your trigger would never be executed. This keeps data integrity in the database, but still gives you efficient queries and an application interface that's consistent.


Materialized views are a superior, or at least less error-prone alternative to triggers, since they don't depend on the programmer's ability to get state transitions right (whenever you manually update aggregates).


What's the performance characteristics of materialized views? My understanding is that they require running & storing the full query. For a SELECT COUNT(star) FROM TableName this is trivial, but if it's SELECT COUNT(star) FROM BigTableName WHERE ComplexWhereCondition it can get very expensive.

For the simple count/max aggregates on a single relation that the blog post is talking about, the trigger logic seems straightforward - you just need a trigger on a single table, for insert/delete/truncate, + update for max or if there are conditions. If there's more than one relation involved or if the condition is complex, then it can get pretty complicated.


> My understanding is that they require running & storing the full query.

This is right. Having a single materialized view containing nested queries is not a good idea. But you can factor out the nested queries into their own materialized views. The propagation of updates between materialized views can achieve an effect similar to that of complex triggers.

I should've been more careful with my original comment, though. I'm not saying triggers are worthless. But, whenever a materialized view can do the job, it's usually preferable.


PipelineDB (based on Postgres) seems to get the best of both worlds by allowing you to specify a query, like materialized views, but refreshing those results based on inserts/updates, like triggers: https://www.pipelinedb.com/

Not affiliated, not even a user, just found it interesting when it was posted here.


Yeah, the whole time I was reading this I was thinking - wouldn't you just do an insert trigger to increment your unread message counter?


Except you have the same problem, but are simply moving the complexity from the application to the database. It's not just a single insert trigger. You need one or more delete triggers. You need update triggers for things like unread or soft delete flags. It's also more difficult to maintain 5-10 triggers at the database level where there are few tools to see how multiple triggers for the same aggregation interact with each other.

The point is you're suffering the same problem. Whether you do it at the application level or the database level, it's a mess.


Right, but the point is that by moving it to the database level, you give the applications above it some guarantees about the state of the DB. You can be sure that some other application hasn't gone and written to the DB without firing the trigger. You can be sure that you don't get race conditions from an application not properly using transactions when updating the derived data. And you can be sure that if a transaction is rolled back, the derived fields will still be in a consistent state.

It doesn't free you from programming, from reasoning about your code, or from ensuring that this code is free of bugs. But it does free you from worrying about whether the bugs exist in multiple places or whether there are subtle race conditions involved in the code interacting with itself.

Whether this important depends on how many applications are accessing the same database. There's a strong argument for one app = one DB, if you can get away with it, and then none of this is relevant. This is not always possible.


Other application? There is no other application, old man. What you talkin' 'bout?

... (2 years later) ... Oh, now I get it!


TLDR: cache invalidation is hard, even if you don't call it cache invalidation.

More seriously there are plenty of ways to make this less painful, eg, adding a "lastMessageTime" to the Room table, which would reduce the work to a simple intersection of RoomUser and Room. Facebook uses a more complicated system involving "graph" relations and a global ID space, but it's more or less the same thing.

Also, "normalized" is not a binary thing. There are several levels from 1NF (flat file, fully denorm) on through 3NF and BCNF (what the SQL books usually call "normalized") on up to 4th, 5th, and 6th normal forms, some of which are academic party tricks but sometimes see use in production.


To add to this: every serious production system tends toward a slurry of 2NF-3NF-BCNF models, mostly for performance reasons. The case of lastMessageTime is a simpler one, as it might drift out of date but generally you get false positives but never a false negative. Note that people complain about Facebook message jewels showing up when there's nothing new but seldom complain about not getting notified.

Another curious example is columnar databases. They are basically a 4NF storage layout with a veneer of 3NF syntax for ease of querying.


I don't think, as the article claims, data denormalization is broken. I think most programmer's knowledge of SQL is scrawny --- unexpectedly so for a language that's 30 years old and underpins most applications. SQL was obviously meant to be simple, almost declarative, and resembling plain-English sentences. And yet it's the first thing that developers abstract away with an Object Relational Mapping, much to their loss.

The article's example is one-to-one chat with millions of users. It says that counting unread messages is too slow unless you denormalize it.

First of all, the Rooms aren't a core table. If each Room has just two people, as the article says, then all you need is one table, messages:

    a  |   b   |        sent         |      body      |        read
  -----+-------+---------------------+----------------+---------------------
   ed  | tom   | 09/29/2016 14:22:10 | Hi             | 09/30/2016 02:02:15
   tom | ed    | 09/30/2016 02:03:15 | What up?       | 09/30/2016 02:04:03
   sue | sally | 09/30/2016 08:19:20 | Catch ya later |
So the count of unread rooms is:

  select count(1) from (select distinct a from messages where b = :user and read is null) t;
If the columns are indexed, I would think PostgreSQL could handle such queries in a split second, even with millions of rows.

At Facebook's scale, where you have hundreds of millions of users, then you will have hundreds of billions or rows, maybe trillions. Okay, so it might start to get slow that this point. But let's stop for a second and remember that 99.999% of web applications will never reach Facebook's scale.

In Facebook's case, I read a long time ago that they make a database for each user. If so, the aforementioned table and query would still work --- in fact without the need for the b column.


I agree with most of what you said, but one thing bothers me:

> And yet [SQL]'s the first thing that developers abstract away with an Object Relational Mapping, much to their loss.

I see that sentiment repeated often, but I really don't understand it. A decent ORM does two things:

1) It makes writing simple queries better – more maintainable, easier to compose, easier to secure against injection, smoother to write without constantly translating between your domain language and SQL.

2) It gets out of your way when you need to take the reins. If it is difficult to make your ORM run raw SQL queries and map the results back into domain models and primitives like array, hash, whatevs – it's time to find a better ORM.

For me and many decent devs I've worked with, an ORM is not about hiding SQL. It's about automating the simple, routine stuff and letting me focus on the important, complicated parts – including highly-optimized SQL queries, at times.


People tend to talk around each other when discussing "ORM"s because the term conflates a few different things.

The most fundamental thing that is sometimes included as part of the "ORM" definitions is translating the data received from a query into objects the host programming environment understands, eg. a PostgreSQL varchar into a Ruby String. This is basically what you're talking about in (2). Everyone (I think?) agrees this is useful.

Next up the chain of usefulness is providing a version of the query interface that uses concepts native to the host environment, like chainable methods. This is what you're talking about in (1). Some people debate the usefulness of this over SQL strings, but most will agree that it is compelling because it lets you build queries out of composable parts, and to separate the parts of the final SQL string that are part of the query syntax vs. part of the query data (ie. for protection against injection).

Past this, there are a bunch of features like mapping to a domain model and validation and lifecycle management and on and on and on that are very debatable. If we would separate those first two things ("data translation" and "query composition" maybe?) from this other grab-bag of things ("ORM stuff"?) when talking about this, I think the discussions would be more enlightening.


If someone's primary perspective of the database is through the lens of an OO graph, they will write code that performs abysmally. Graph navigation from row to row is antithetical to all the power relational algebra gives you. That's the biggest risk: viewing the DB primarily as an object store.

For the app I work on, the DB is only viewed as an object store for very small sets of data - configuration, specifically. It's impossible to map it to objects in any realistically performant way for our user's data, not least because that would require the data round-tripping through the application. It's much, much more efficient to execute my multi-million row updates and inserts on the database server directly rather than drag the data down to the app, tweak it slightly, then persist it back.


This is true.

ORMs force the thinking "objects first" (ie. work with objects, the SQL will be handled for you). This translates into a design of "per object" architecture. Business logic is abstracted for handling single rows only - one at a time - encapsulation being the design idiom. However, this is a trap that immediately becomes apparent when you realise that you've forgone the powerful set-based capabilities of SQL. SQL doesn't "need to see" a single row of a table as anything special to warrant ring-fencing in the name of encapsulation. The whole (related) data model is a single algebraic representation of state.


On the other hand, isn't that just a sign of a weak ORM? For example, in the platform I use, when I "fetch" some records by ID, it doesn't actually issue a SELECT; it just collects the ID and returns a proxy object. Only when I access an attribute of that object, it'll issue a SELECT for all the IDs in question.

One could extend that for more complex operations - instead of immediately executing the operations, it'd collect the operations and convert them into efficient queries. In a way, this is already done in "fluent interfaces" like LINQ, but with a more flexible language (Lisp?), I'm sure one could use standard control flow instead of a custom layer.


> For example, in the platform I use, when I "fetch" some records by ID, it doesn't actually issue a SELECT; it just collects the ID and returns a proxy object. Only when I access an attribute of that object, it'll issue a SELECT for all the IDs in question.

I'm my opinion this is the least desirable behavior. It abstracts the data layer away to the point where you don't think about what's happening under the hood when you type foo = Klazz.new("id"); foo.bar; foo.baz;

Was that one, two or three separate queries ? Did a query run at all? Does it matter, after all you have an object to work on, right in your language of choice.

Sure the ORM may be smart about it and do the right thing - I have no doubt that most decent ones day. But that misses the point - as the person using this interface you get divorced from the direct awareness that this data has an origin, and there is a cost to access it. After all, you're just instantiating an object.

Patterns that use interfaces that make it so simple to do a thing that you don't need awareness that you're doing it tend to degrade quickly after the original engineers who did have that understanding leave a project.


That's a fair point; on the other hand, aren't SQL queries themselves divorced from the actual data layer? Knowing whether the ORM issued one or two queries doesn't tell you if you're fetching data from disk or memory, whether it's making an index scan or full table scan, whether the join fits in working memory or will hit the disk, etc.

Isn't knowing what queries are issued giving you a false sense of knowledge about how the data is actually fetched?

And much the same way you can request a query plan, the ORM could provide the list of queries issued for a given piece of code.


This is what the Django ORM does when used with some degree of care.


I know it uses lazy results, but can you actually loop over the results without triggering a query? Could you write something like:

  records = ORM.search(type='X')
  for record in records:
    record.a = record.b + 30
and only trigger a single UPDATE query?


There is an update method that can be used with F expressions to achieve this.

This is the example in their docs:

    Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)


To write it in pure SQL would ensure the single query you're looking for and be less to type:

  update records
  set a = b + 30
  where type = 'X'


I've never quite worked out why graphs and relations can't be unified in some way. They both rely on first order logic, after all.


The select N+1 problem is a problem because of the overhead on each select, if that was free the graph navigation and relational set operations would match.

Reminds me of, "in theory, there's no difference between theory and practice, in practice there is".


The way you're using an ORM is fine. Do you think, though, that ORMs encourage people to delay learning the highly optimized SQL that you acknowledge is sometimes needed?

I myself have yet to embrace anything beyond what they call a database-abstraction layer, where instead of pg_query and mysql_query, you just have db_query, but you still have to write the SQL. So:

  db_query('select id, name from table where foo = :bar').values({ foo: 'bar' });
Is that so much worse than the more involved ORMs that let you write:

  db_select([ 'id', 'name' ]).from('table').where({ foo: 'bar' });


I think SQLAlchemy is the only ORM I've used that has support for most/all of:

- Subqueries - HAVING clause - USE index hint - WHERE ... IN - SELECT DISTINCT - ON DUPLICATE KEY UPDATE - SELECT ... FOR UPDATE

The trouble comes when you're using an ORM that has some but not all of the features, then the abstraction falls apart and you need to go down to using SQL anyway.


Compared to the example that you gave, it's not. Where the latter becomes more interesting is when the identifiers are no longer string literals, but actual identifiers (i.e. once SQL records are mapped to some kind of tuple, structure or object type in the language). Then you get things such as static type verification, code completion, refactoring etc.

Note that this doesn't require a full fledged ORM. True object-relational mapping requires dealing with things such as object identity.


A lot of the time, when people speak of "ORM", what they actually mean is a "strongly typed DSL for SQL, using DTOs". They are not actually the same thing. What you describe seems to be more of the latter.


Thus my fear and loathing of "ORM".

Service gets rows from table - essentially a set of name-value pairs or N sets of similar name-value pairs. Service writes data to consumer as JSON - also name-value pairs.

We could just (usually) run a prepared statement that safely inserts any search criteria into the query, then convert the DB name-values (possibly with column aliases) directly to the JSON.

In some cases, we might need to validate, or run server side business logic on some parts of a tuple, so, under duress, we could write a wrapper class that manages the name-values of a tuple, which are actually used - and not just shoveled over the fence, for business logic.

Or, we could define a layer of junk throw-away classes, or hell, maybe even 2 layers of junk throw-away classes, and maybe some other classes specific to each query or CRUD access type. Then we can shovel stuff into and back out of them, and immediately throw them away. Lazy loading apparently doesn't apply to all the gratuitous value mapping and copying that is always done, even if not looked at, etc.

The length of the code bloat, and the depth of hiding the database, in this style of code is appalling. (Not that you shouldn't have an access layer, but you can keep your HQL and rubbish like that, thank you very much)

IDE users like languages and tools that like IDEs :-(

Contrariwise, I have worked (sometimes) with languages since the mid 80s that don't insist on every identifier binding specified at compile time (vs mandatory early-as-possible - compile time - binding, so the IDE can autocomplete and rename, er, "refactor" everything for baby, regardless of how much repetitive drivel has to be shoveled into the project)

THAT'S what I think about, when somebody says "ORM" :-(

(sorry, I guess it's been a long week)


So I'm doing a lot of academic reading around databases, and people have been thinking about this sort of thing for a while. Here's one approach:

> In this paper, we study the complementary approach, addressing index maintenance as part of query processing using continuous physical reorganization, i.e., cracking the database into manageable pieces. The motivation is that by automatically organizing data the way users request it, we can achieve fast access and the much desired self-organized behavior.

> The cracking approach is based on the hypothesis that index maintenance should be a byproduct of query processing, not of updates. Each query is interpreted not only as a request for a particular result set, but also as an advice to crack the physical database store into smaller pieces.

http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf

The references are also pretty good.

edit:

Here's one more specifically about (de)normalization:

http://stratos.seas.harvard.edu/publications/main-memory-ada...

But it's much newer and lighter on info.


Interesting read, thank you.

Where can I track recent interesting papers in databases?


Unfortunately I'm as clueless on this as you - I'm mostly going on the papers that my professor suggests (cs165 and cs265 at harvard, the syllabus is public). Though VLDB and cidrdb seems like two conferences that seem to publish the most famous papers. Eventually you start learning the names of the authors, etc.


Martin Kleppmann was presenting architectures that solve this problem:

http://www.confluent.io/blog/turning-the-database-inside-out...

http://www.confluent.io/blog/bottled-water-real-time-integra...

He does not just identify secondary indices and materialized views as possible candidates, but also other applications such as caching and search indexing as within the same class of problem.

Here, rather than trying to figure out a tradeoff between normalized and denormalized views, Samza and Kafka are used as the heart of what the above article was talking about as "generalized denormalization engine". It's done by capturing data change as a stream to incrementally build whatever denormalized view you want.

The downside is that this is still fairly clunky, if you have this expectation that this should be built into the database. Kleppmann's ideas breaks it out into a whole infrastructure to support that. On the other hand, if you do have a full time ops team, it sounds like it can work very well at scale.


I skipped around a bit in the article, but didn't see them consider having database updates simply trigger a cache invalidation (i.e. set it null) of the denormalized data instead of doing a full recomputation every time the data changes. That way you only pay the price when you need it, and it still allows for caching. You can prewarm the cache with a cron while still allowing for ondemand calculation if the cache is cold.


I think TFA was more about "when to invalidate/recompute" than "how to invalidate/update" -- that you need to take care to enumerate all of the times the denormalized value may get out of sync with the rest of the data.

(This is a reasonable idea though, depending on access patterns. Can increase latency, can decrease computation.)


I've been thinking about this a lot recently, and I'm convinced we need a Storm[1]-like architecture for databases. Something which would allow views / denormalization / indexing to run in separate processes. Preferably in the same language the app developers speak. And at the same time be part of the database transaction logic - triggered idempotently by the database so you always see a consistent view of the data no matter how you update & query it. (You can do this efficiently with MVCC if you proxy queries through the indexer / denormalizer / view logic / whatever you want to call it).

I've scattered some more thoughts about this here, but its a big topic: https://josephg.com/blog/composing-databases/

[1] http://storm.apache.org/


You may find this article about "turning the database inside-out" using Apache Samza to be quite interesting: http://www.confluent.io/blog/turning-the-database-inside-out...

From my experience, a decoupling of the functions that a database management system provides (storage, optimization, handling of streams, etc.) and the ability to think about those capabilities as modules within a framework is a powerful concept.


"triggered idempotently by the database so you always see a consistent view of the data no matter how you update & query it"

As I sit here trying to build the thing in my head that would do that, I see some major performance issues arising. Specifying how you want the data denormalized isn't that big a deal. Taking a frozen normalized database and denormalizing it according to that spec is a somewhat bigger deal; the performance implications are non-obvious because the denormalization can invoke arbitrary amounts of data in arbitrary combinations in arbitrary orders, even before we let the denormalization specification be Turing complete. Mathematically it all works but if in practice my denormalization pass is randomly accessing huge swathes of the database every time I insert a row that's not going to work very well.

If we're trying to avoid slow queries when the user asks for data, and we're trying to denormalize in advance, we're basically trying to have our already-presumably-pretty-busy database pre-emptively guess all possible slow queries it might run in the future, and run them now.

And then run them again when a small change comes in, having possibly never used the first set of denormalized, expensive queries I just ran.

That seems... likely to generate performance problems, to understate my case by a lot. Even accounting for possible efficiencies the denormalizer might be able to take advantage of; there's still the bound of what has to be written out.

OK, so that doesn't really work. But it doesn't need to; the database could do this more lazily. But, doing it fully lazily is just the same as the case where you use the normalized data, so we can't do it fully lazily. Somehow the database needs to figure out what it should and should not precompute. Well, that doesn't sound like that big a deal, right? No, it's a nightmare to even begin to specify what that might be, in the face of full Turing chaos. I can almost see how one might try to apply some basic "learning" algorithms to the task of pre-computing things partially, but it's getting pretty complicated in the backend there. There's an awful lot of symbolic computation there where I really want things to just be computing for performance.

Furthermore, if one insists on consistency in the denormalized fields... yeow, that's complicated. Again, it's not hard to specify the math but making it happen would be crazy insane. Oh, and I don't mean distributed consistency... I mean even consistency on a single busy host is going to be a challenge. A feasible challenge, certainly, but a challenge that shouldn't be ignored.

It seems to me there might be a use case for this to be a bullet-point feature for a database, for developers to very, very carefully use on very high-use fields. But it wouldn't be very hard for a real-world denormalization engine to become a performance nightmare. Or to end up taking more "index tuning" to be usable engine than the code we already have.


When an update happens the trick is going to be figuring out an update mechanism for the derived data without needing to run more queries. You're probably right - in the general case thats really hard to do automatically. But lots of aggregate values can be calculated based on an update without reference to the original set. `SUM()` and `MEAN()` are easy. `MAX()` and `MEDIAN()` are hard. In the case of the notification count query from blog post, you can do that efficiently with 3 map reduces:

1. Given a list of `(message, room, timestamp)` triples, generate `(room, max(timestamp))` tuples. In general max is expensive to calculate, but in this case the message collection is immutable, and the rule to update max(timestamp) when an edit happens doesn't need any extra processing based on an insert.

2. Given `(room, message_timestamp, read_timestamp)` generate `(room, message_timestamp > read_timestamp)`. Easy.

3. Then a reduce to calculate the sum.

One of the things I really like about the multi-process architecture is the diversity of solutions we should have access to here. Right now your data lives in one vendor's system. Its really hard to use a map-reduce job if your primary data store doesn't support it. And map reduces might be too complicated for simple workloads - so with small data sets you might want something simpler. With this architecture different vendors could provide different indexing systems and you could swap them out as your requirements / data size changes.

Finally as for consistency, I've got a bunch of ideas of different ways to make it work. Its not so bad because its not a generic graph or consensus problem. Its just a DAG. You can give each primary store a unique name and version using vector clocks. (If each primary store is itself distributed thats fine - just use whatever versioning system the primary store uses as the value for that database in the vector clock.) Then when a query comes in that needs to hit both an index and a primary store, pick the most recent database version and query asking for the result set at that version. You'll need some MVCC to make it work - but thats a solvable problem. If the named version is too new for the index, the query stalls until the index sees the corresponding update from the primary store. If the version is too old (and its expired out of the MVCC working set) then we punt back to whatever is running the query and it retries. The whole system is much simpler than a distributed consensus problem because each node in the DAG can be thought of as authoritative, even if they're individually implemented as a small redundant set.


If I've understood correctly, it feels like what the author wants is a DSL for event-driven FRP[0]. If so, I'm 100% on board. It's something I've wanted for a long time.

I've found it easy enough to implement a naive version. The tricky part, unsurprisingly, is making it performant.

[0] https://en.wikipedia.org/wiki/Functional_reactive_programmin...


Author here. Yep, I basically want my database to be a giant Excel spreadsheet that includes formula cells (a.k.a. denormalized fields).


To some degree, this looks like premature optimization. That query (on the normalized tables), with the proper indexes and decent hardware, should be plenty fast, even with millions of rows. I would potentially start my optimization by considering if messages could be 'archived' to other tables or considering how long I could cache the query results vs. how often the client would poll.

I've seen small-medium sized systems go way overboard with complicated mechanisms for handling performance problems that they don't, and may never, have.


You have to grant blog posts about issues that only appear at scale a certain amount of leeway with their examples. In the wild, if you're writing normalized schemas, traversing 5 tables isn't that difficult and traversing a couple dozen comes up more often than you'd like. It's not hard to run past what a query optimizer can handle.


So I agree, premature optimization can often not be necessary.

But let's be crazy for a minute. How much CPU/RAM/disk IO does that query on normalized tables end up using? How much electricity does that query use?

Maybe from a pure efficiency perspective, it takes less energy to query the pre-computed value, than to constantly be recalculating it on every query.


Craziness comes later. Usually few years after several developers added not normalized tables, and it all grew organically with multiple apps depending on the data.

When it comes to rdbms shema, you want it optimized and enforced from day one, or the price paid later will be significant.


One technique that can be useful is imposing domain-specific restrictions, then passing responsibility down the chain.

For example, most people only have a handful of recent conversations; can you get away with ignoring distant-past conversations when it comes to read/unread marking? If so, you could send recent normalized data to the frontend and let it derive the unread state client-side. React has some nice tools for efficiently deriving state from server-sent data.

This isn't always the best solution, but it's something to consider.


Unless the amount of data you need to transmit is large, I always prefer that approach as it enables other UI features without any additional network trips by the client. Also, it lets one use the same mechanism for multiple uses (in this case, get latest messages), so it adds less complexity to the app.

In the articles example application, you can transmit the top X oldest messages after the last seen timestamp for each room, then have the client derive the unreadness of each room. You then can use that same info to do things like preview the oldest unread message for each room and hide latency when opening a room as you already have a screens worth of messages in hand. One doesn't need to get more recent messages as UIs generally don't have room to display large numbers and should switch to a "lots" indicator.


Sure, denormalize working storage in the app, fight to keep the "system of record" / "operational data store" clean.


Hmm. So just make a request for "(my) messages since timestamp-x" from a client, build the join-query between "message rooms for me" and "messages since timestamp-x" on the server, then send a list of messages back to the client for the client code to sort out?

That actually sound good enough, as far as I can see: sometimes the client code pulls up 3 days (or max-N rows, or whatever) worth of messages when started (no big deal for me, might be for my daughter...), but usually only asks for new messages in the last minute (or whatever polling interval), after which the database and any application/middleware server are left alone.


Smart people already have developed general solutions to this problem- This problem is the main reason "graph queries" were independently invented by both Facebook and Netflix.

See here for an exact solution to the problem posed in OP, using the OmNext graph query language: https://github.com/omcljs/om/wiki/Components,-Identity-&-Nor...


"Smart people already have developed general solutions to this problem"

Yes, they did, 50 years ago :) The article behind link refers to several-decades-old concepts of identity and normalization. That was not invented by Facebook or Netflix.


META:

When did we all start saying "performant", instead of "fast [enough]"? Was it about 5 years ago??? Maybe 10?

English, like SQL, is passe`.


I hate the word performant. People may argue that "fast" is a simpleton of a word, however "performant" is ambiguous as to what characteristic is performing well.


I can see why the normalized approach is difficult to make performant, but it seems much simpler.


You're correct on both counts. A normalized approach is simpler, and is more difficult to make performant. So you start there, but as you scale, it becomes more and more untenable for your users as your latency goes up the wazoo.

Thus, eventually your app hits a scale where you have to rewrite your code to explicitly denormalize things and maintain the denormalized views yourself. This makes any feature addition/backend update horrendously complicated, and everyone's life difficult. You slowly start losing consistency which causes "glitches in the matrix" for your users. To stop that you start adding in transactional semantics yourself at the application level, checking and cross-checking everything, and doing abort/restarts, and then finally, in an analogue to Greenspun's tenth rule[1], your application starts to contain an ad hoc, informally-specified, bug-ridden, slow implementation of a distributed database.

I think there's a CircleCI blog post parody hidden inside this comment, but I'm not clever enough to materialize it[2].

[1]: Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp., https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

[2]: A reference to https://circleci.com/blog/its-the-future/


Broadly: Normalised data is more performant on write, because for each piece of state, there will be one and only one place to store it.

It's less performant on read because you are constructing logical deductions out of a lattice of tiny pebbles of truth. It takes time to find, filter, sort and project those.

Denormalisation "works" by skipping to a read-optimised schema for those applications where reads greatly outnumber writes. Which is very many.

As an aside, "denormalised" doesn't have to mean "unstructured chaos". Do As Thou Wilt isn't necessarily the law of denormalising. There are well-studied, widely-used denormalisation techniques under the heading of "dimensional modelling".

It's just that most of these discussions are dominated by web folk, not data warehouse folk.

The shoutout to Event Sourcing is promising, given the interaction with the CQRS pattern. In short: read schema and write schema have different demand and performance characteristics. So split them up.


I don't know that you can always conclude that normalized data is more performant for writes.

Take for example the case where your normalized write consists of 5 insert statements into 5 tables, instead of 1 insert into 1 table that you'd do with a denormalized write.


As always, consult your doctor, accountant, lawyer and profiler.


What we need to solve this is a cell aggregation dependency function in the database.

Where we define a column and a query that will be used to define the values in the cells of the column. The database would use its planer to figure out what the dependency's are and update the cell every time one of the dependencys are changed.

That dependency thing is very complex but very necessary. If you dont have it you might as well just make a normal view.


Author here. Yep, I think you're getting to the heart of what characterizes a "denormalization engine".


What are the tradeoffs of making the data denormalization happen in database triggers? Fundamentally, you want to update certain fields whenever certain other fields get changed, and that's exactly what database triggers do. I'm not enough of a database wizard to know if there's good reasons why this isn't more popular, though.


It's not very efficient if you have a large number of triggers. There was an early research prototype called Triggerman (http://ieeexplore.ieee.org/document/754942/) which attempted to scale triggers, but most implementations are still clunky (probably because of the transactional guarantees required by triggers).

Materialized views (that the article talks about) is a related concept, but only a few databases really support that.

Both of these are also related to the work on 'Stream Databases', where the problems of efficient recomputation are central. There is a ton of work on this in academic research (http://www.springer.com/gp/book/9783540286073), but it hasn't been able to gain a foothold in the data management market (several of the academic research projects transitioned into startups).


Too many people argue "code doesn't belong in the database", but honestly, with everyone circling around to FRP it honestly makes me think maybe stored procedures and database triggers are going to make a comeback again. I never really bought into the whole "but you can't version stored procedures!" bit because that's precisely what schemas and the search path in PostgreSQL are for (triggers are another matter entirely, but I avoid those whenever possible - though it's exactly what I would use for a task like this).


Triggers are synchronous and is not much different than the application updating this value when the underlying data is updated, except for the reliability that always when the data is updated the trigger will for sure fire.

Postgres triggers are row by row so depending on your use cases they can be quite slow.

Ideally I'd cache all these values and invalidate the global cache through the application. This would require one set of services or each set of services managing their own cache. Anytime a room is updated, boot and reload the cache for that room.

Let the db stay normalized.


Materialized views where invented for moving computations into the updates.

No triggers needed.

Not sure what databased that supports it though.


There's a lot of confusion in this article. The author doesn't seem to be able to distinguish the logical view of data (in which we distinguish between primitive and derived data) from its physical representation (in which derived data may be physically stored redundantly for efficiency reasons).


ProxySQL has a mirroring feature that can act like a trigger on a separate thread: https://github.com/sysown/proxysql/blob/master/doc/mirroring... Add a cronjob to refresh results if you are worried about skews over time. Problem solved.


I identify with the problems this article eloquently describes. I think GraphQL could be a first piece of the solution.


I'm a fan of GraphQL, but that's like saying "object-oriented programming is the solution to people leaving variables on the heap after they go out of scope". No, the solution is managed memory. OOP is a higher level of abstraction that happens to work better with managed memory.

See the analogy?


One way to solve this particular problem of "number of unread rooms" is not to persist it in database at all. Send push notifications of new messages and have client app to compute it client-side.


Wasn't this essentially what CouchDB did?


Can someone explain why you wouldn't use a "read_at" flag on the message entity in this example?


which User read the Message?


Yup, that's the job!


I very much enjoyed reading this article, it explains a lot of the pain and calls out a lot of the current broken designs.

While the conclusions are correct, I feel like they could be simplified. If you try to use the vocabulary of the broken system you will get complicated phrases like "denormalization engines" (which I very much hope does not catch on despite agreeing with it).

Instead, if you look at correct systems - like Elm, Rx, Datomic.com , or https://github.com/amark/gun (my own, obviously I am biased) you get less complicated ideas:

- Event Driven. - Data Streams. - Data Flow. - Data Pipeline. - Reactive Programming.

All of these, despite "FRP" itself being jargony, solve the majority of the article's complaints. Largely because the solution he comes to at the end is the same thing as FRP.

Don't we wish we had all listened to Alan Kay now?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: