Reducing database queries to a minimum with DataLoaders

nwsm · on March 26, 2021

In C# this would just be called a Repository pattern, and often you would handle caching externally, in a datatstore separate from your database or at the request layer (though there are obviously mature in-memory cache options).

I haven't used GraphQL much but things like this make it seem that it's not a great general use case framework- if you don't really need it, save yourself from a bunch of extra logic in your application code.

beaconstudios · on March 26, 2021

DataLoaders are only necessary in GraphQL because GQL doesn't by default traverse relations with a map operation. The N+1 query problem crops up when your traversal from, say, User to [Tweet] then needs to be executed N times for N users. Ideally the transition would be programmed for [User] => [[Tweet]] and you could use a WHERE IN query. Data Loaders do this, but in a hackier way that relies heavily on NodeJS' implicit async handling using the event loop.

tveita · on March 27, 2021

In my experience the dataloader approach is simpler once you need even a modest amount of logic to decide what should be loaded. Basically a dynamic vs. static scheduling decision.

E.g. you have IDs for a heterogeneous collection of items, all of them have an owning user, some need additional lookups in type specific tables, some have comments which again have an attached user each. Each lookup should be done towards a cache like Memcached and then a SQL database for any entries not in cache.

If you want to batch those cache calls together as best possible you end up with some fairly ugly code. But with dataloader you can write nice and sequential code dealing with a single item at a time - try loading the item from cache, if not try the database, if found load related data, if not return not found.

exogen · on March 26, 2021

One nice thing about GraphQL though is that you have the context of the whole query available in resolvers, parsed into a tree (like an AST) for you. You're right that the most obvious/default approach would lead one down the path of making an N+1 query in this scenario, but it's really up to the GraphQL resolvers to decide how to approach it.

In some projects of mine I've taken advantage of this extra available context to make the most efficient bulk database query or REST API call based on the full query, instead of just the subset of input typically used by the resolver.

tehlike · on March 26, 2021

Postgraphile and hasura kind of addresses that using json_agg in a neat way. Give it a try.

beaconstudios · on March 26, 2021

thanks, I'll check it out! In hindsight, hasura might have been a good time investment over rolling my own graphQL service.

tough · on March 26, 2021

What did you use to roll your own? I've only done typegraphql-typeorm stuff before, trying now with prisma-nexus

beaconstudios · on March 27, 2021

i'm back too late to edit; I'll just double-reply. These two functions take a foreign key and a function to retrieve the data you're after and return you a dataloader with the correct indexing applied. Quite nifty, and you'd need the equivalent to both of these eventually.

    const oneToOne = (foreign, fn) => new Dataloader(async (keys) => {
      const results = await fn(keys);
      const index = {};
      for(const result of results) {
        index[result.get(foreign)] = result;
      }

      return keys.map(key => key in index ? index[key] : null);
    });

    const oneToMany = (foreign, fn) => new Dataloader(async (keys) => {
      const results = await fn(keys);
      const index = {};
      for(const result of results) {
        const foreignKey = result.get(foreign);
        if(!(foreignKey in index)) {
          index[foreignKey] = [];
        }

        index[foreignKey].push(result);
      }

      return keys.map(key => key in index ? index[key] : []);
    });

tough · on March 27, 2021

Thanks for such thorough answer! Will keep in mind when looking into dataloader,

looks nice.

beaconstudios · on March 26, 2021

I use bookshelf js for reading and knex for writing. The bookshelf models are a hangover from my previous REST API; I'd just use knex for reading now as a nicer syntax sugar over SQL.

I also have a couple of small helper functions for generating data loaders; I'll edit this comment and post them in a min when I'm back on my laptop.

mikewhy · on March 26, 2021

> It is important to emphasise that this learning does not only apply to databases. Imagine there was a RESTful service that had some sort of rate limiting for an API. Calls to this could be abstracted into a DataLoader

GraphQL isn't exclusive to databases.

beaconstudios · on March 26, 2021

neither is what I was saying, only the part about using a WHERE IN. You would hope to have the equivalent batch requesting on a REST service as well.

trollied · on March 26, 2021

I often find things like this to be workarounds for architectural flaws. For example, the 2rd part of the article looks like it would be better suited to a traditional ETL pipeline, with a staging table & upsert/merge in the database.

hinkley · on March 26, 2021

I wonder if there’s some HTTP3 style transform we can do where requests that show up in clumps get sent to the backend together after the ones that have already been cached are culled.

Particularly with normalized data, I’ve found it quite common that 80% of my data for some workflows is highly predictable but that last 20% that makes the interaction unique either causes the request to become whole cloth or results in very complicated code, sometimes with very bad (worse than nothing) cold cache behavior.