How to level up beyond ETLs

dijksterhuis · on Sept 6, 2023

Not sure if it’s just me, but the whole mention of the Stitch Fix blog post and ‘democratising data pipelines’ really threw me off.

> The idea here is the engineering team creates ‘Lego blocks’ that consumers then assemble into end-to-end data pipelines.

Might be my personal biases, but it doesn’t really seem to set the scene for the rest of content?

The rest of it was good enough common sense “non exhaustive list of stuff to always be working on” as a data engineer or something.

kurinikku · on Sept 6, 2023

The best practices book definitely changes quickly in a new area, I can imagine a new / different set on the back of data pipelines for generative models.

jstx1 · on Sept 6, 2023

Isn't this very very basic? I'm not seeing the "beyond" element, it's more like "How to level up to almost not zero".

kurinikku · on Sept 6, 2023

Fair point. Perhaps title can be 'How to keep things interesting when you're not happy serving customer requests all day'. On the flip side, not suggesting that you don't serve customer requests either so more of a way to keep your head up while doing day-to-day.

dijksterhuis · on Sept 6, 2023

I was a bit confused myself.

Then I realised this is probably a good reference for “things to be think about” as a decent-ish data engineer.

whobre · on Sept 6, 2023

This is a joke. You don’t “level up beyond ETL” by speeding up SQL queries. In fact, there’s pretty much nothing about ETL in that article.

FrustratedMonky · on Sept 6, 2023

Thought this too. I liked article, had some interesting points. But never found the ETL part. Kept wondering when I'd scroll and hit the ETL section.

kurinikku · on Sept 6, 2023

Thanks! There's ETL in the traditional sense of moving production data to the data warehouse (hence the 'reverse ETL' moving data back into prod), as well as the term increasingly being used for data pipelines in general.

On the former, we had custom tooling at the time of writing but now use GCP Dataflow (which I can say I'm rather partial to).

floorballchamp · on Sept 6, 2023

Maybe somebody trying out GPT for generating blog content? :mewonders:

floorballchamp · on Sept 6, 2023

Perhaps start by explaining what ETL stands for and what it is.

smif · on Sept 6, 2023

It seems like the article is written by someone just starting to get into the data engineering subfield and they thought they were going to be writing python (pyspark is my guess) to support some kind of ML effort, but they got saddled with a bunch of SQL/data warehousing stuff to support business intelligence/analytics instead. I'd say normally what you say makes sense especially when you're pulling in abbreviations that are not related to the topic at hand or you're introducing new people to the field, but ETL is a pretty basic concept in data engineering and it's a web search away (should be the top result), so I'm not sure if it would really add all that much to their article to start with definitions.

It sounds to me like the author got thrown to the wolves in an environment of what data engineering looked like before "big data" and ML took off (and before it was even really called data engineering). There are a lot of enterprises that are still working in this mode because they are not Google and they don't have the same level of sophistication and automation when it comes to this stuff.

There is some bad information no doubt in the article, but if we're being charitable, it feels like it's someone who took a wrong turn somewhere and is struggling to find their feet in an unfamiliar place without the proper guidance and mentorship and that's a bit admirable at least that they're trying on their own.

There is no direct bearing on ETL in the article, aside from the focus on SQL queries and data validation hints that they might be talking about ELT (Extract-Load-Transform) as the level beyond ETL, but it's not clearly explained. It's clear to me that they are at the start of their journey and they are gonna learn things the hard way without guidance from someone more experienced.

kurinikku · on Sept 6, 2023

> There is some bad information no doubt in the article

Could you share more specific details? Happy to look over / revise where needed.

More broadly is the issue of the gap of what you think the role is, and what the role actually is when you join. There are definitely cases where this is accidental. The best way I can think of to close the gap is to maybe do a short-term contract, but may be challenging to do under time constraints etc.

smif · on Sept 6, 2023

> Could you share more specific details? Happy to look over / revise where needed.

Sure thing! I'd say first off, the solutions may look different for a small company/startup vs. a large enterprise. It can help if you explain the scale at which you are solving for.

On the enterprise side of things, they tend to buy solutions rather than build them in-house. Things like Informatica, Talend, etc. are common for large enterprises whose primary products are not data or software related. They just don't have the will, expertise, or the capital to invest in building and maintaining these solutions in-house so they just buy them off the shelf. On the surface, these are very expensive products, but even in the face of that it can still make sense for large enterprises in terms of the bottom line to buy rather than build.

For startups and smaller companies, have you looked at something like `dbt` (https://github.com/dbt-labs/dbt-core) ? I understand the desire to write some code, but often times there are already existing solutions for the problems you might be encountering.

ORM's should typically only exist on the consumer-side of the equation, if at all. A lot of business intelligence / business analysts are just going to use tools like Tableau and hook up to the data warehouse via a connector to visualize their data. You might have some consumers that are more sophisticated and may want to write some custom post-processing or aggregation code, and they could certainly use ORM's if they choose, but it isn't something you should enforce on them because it's a poor place to validate data since as mentioned there are different ways/tools to access the data and not all of them are going to go through your python SDK.

Indeed in a large enough company, you are going to have producers and consumers that are going to use different tools and programming languages, so it's a little bit presumptuous to write an SDK in python there.

Another thing to talk about, and this probably mostly applies to larger companies - have you looked at an architecture like a distributed data mesh (https://martinfowler.com/articles/data-mesh-principles.html)? This might be something to bring to the CTO more than try to push for yourself, but it can completely change the landscape of what you are doing.

> More broadly is the issue of the gap of what you think the role is, and what the role actually is when you join. There are definitely cases where this is accidental. The best way I can think of to close the gap is to maybe do a short-term contract, but may be challenging to do under time constraints etc.

Yeah this definitely sucks and it's not an enviable position to be in. I guess you have a choice to look for another job or try to stick it out with the company that did this to you. It's possible there is a geniune existential crisis for the company and a good reason why they did the bait-and-switch. Maybe it pays to stay, especially if you have equity in the company. On the other hand, it could also be the case that it is the result of questionable practices at the company. It's hard to make that call.

kurinikku · on Sept 7, 2023

Perhaps the first thing I’d clarify is not all the ‘bad’ things described happened to me personally, and out of the ones that did, I employed artistic licence in the recollection.

We did start integrating dbt towards the end of my time in the role. Our data stack was built in 2018, so a fair bit of time before data infra-as-a-service became a thing. The idea is dbt would help our internal consumers to more easily self serve. That said I did see complaints about dbt pricing recently; as they say there’s no free lunch.

Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation.

ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).

Interestingly enough, I currently work on ML and impressed to see how much modeling can be done in the cloud compared to my earlier stint in the space (which had a dedicated engineering team focused on features and inference). On the flipside I similarly see an explosion of SQL strings, some parts handled with care more than others.

I’ve not looked into a data mesh but a friend did mention pushing his org to embrace it - self note to follow up to see how that's going. Looks like there are a couple of ‘dimensions’ to it; my broader take is that keeping things sensible is both a technical and organizational challenge.

I look forward to future blog posts on ‘how we refactored our SQL queries’, maybe there’s a startup idea there somewhere.

smif · on Sept 7, 2023

> Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation.

> ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).

I get what you're saying, but think about a large org with a lot of different teams and heterogenous data stores - it's gonna be pretty hard to implement a top-down directive to tell everyone to use such and such ORM library, or to ensure a common level of ownership and craft. This is where SQL is the lingua franca and usually the native language of the data stores themselves and is a common factor between most/all of them. This is also where tools like Trino / PrestoSQL can come in and provide a compatibility layer at the SQL level while also providing really nice features such as being able to do joins across different kinds of data stores / query optimization / caching / access control / compute resource allocation.

In general it's hard to get things to flow "top down" in larger orgs, so it's better to address as much as you can from the bottom up. This includes things like domain models - it's gonna be tough to get everyone to accept a single domain model because different teams have different levels of focus and granularity as they zoom into specific subsets so they will tend to interpret the data in their own ways. That's not to say any of them are wrong, there's a reason why that whole data lake concept of "store raw unstructured data" came in where the consumer enforces a schema on read. This gives them the power to look at the data from their own perspective and interpretation. The more interpretation and assumptions you bake into the data before it reaches the consumers, the more problems you tend to run into.

That's not to say that you can't have a shared domain model between different teams. There are unsurprisingly also products out there that provide the enterprise the capability to collaboratively define and refine shared domain models, which can then be used as a lens/schema to look at the data. Crucially the domain model may shift over time, so this decoupling of the domain model from the actual schema of the stored data allows for the domain model to evolve over time without having to go back and fix the stored data because we have not baked in any assumptions or interpretations into the stored data itself.

johngladtj · on Sept 6, 2023

Extract Transform Load

omegabravo · on Sept 6, 2023

This is correct from my understanding, and glossing over the article and links is precisely what they're talking about.

Pipeline of taking data from one system, transforming it and loading it into another. There is a whole industry full of software products that facilitate this such as MuleSoft. It is referred to as ETL

I was surprised to see your succinct correct answer greyed out.

floorballchamp · on Sept 6, 2023

Thanks for your explanation.

Maybe the greyed out was because just resolving the acronym is not very helpful if from the words it's not clear what that is. Like in this case.

JoeOfTexas · on Sept 6, 2023

Right, through the OP article and linked article inside both articles did not define or describe ETL. If its in your title, please explain, I was lost.

Acronyms that aren't defined only hurt.

datadrivenangel · on Sept 6, 2023

Writing better code does not solve the listed problem of low context.

"The tricky part is when context goes ‘over the wall’ - the consumer finds an issue with the data but the producer is not familiar with the domain to see it’s a problem. Over time, information transfer between the two sides becomes the bottleneck and issues pile up."

Data engineering teams tend to have the least context and the most responsibility to fix everything.

kurinikku · on Sept 6, 2023

It can often feel like a thankless job :cry:

zurfer · on Sept 6, 2023

What an honest an refreshing perspective! I loved the quote from Donald Knuth:

    If you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice.

kurinikku · on Sept 6, 2023

Totally. That being said I'm on 'soft skills' phase now :joy:

tqi · on Sept 6, 2023

Personally, I've never seen SQL unit tests/asserts demonstrate value in practice. The vast majority of tests are simple uniqueness or non-null tests that would only catch the sloppiest mistakes. I feel like it's emblematic of a Data Engineering culture that is more interested in the Engineering part than the actual Data products they are supposed to be building.

gcarvalho · on Sept 6, 2023

> Personally, I've never seen SQL unit tests/asserts demonstrate value in practice.

Do you mind expanding why? As someone currently looking for ways to integrate standard software engineering practices into (analytical) SQL code, I am curious about your reasoning.

> The vast majority of tests are simple uniqueness or non-null tests that would only catch the sloppiest mistakes

But don't they still happen? Perhaps they're not useful now, but may reveal clear problems when the original implementation is stretched in ways that weren't anticipated -- which is a common argument in favor of testing.

To add to this, I've been keeping an eye on sqlmesh and I like the distinction they make between testing [1] (asserting logic) and auditing [2] (validating expectations on the data). Their builtin audits go beyond "not-null" and "unique", including statistical checks, for example.

[1] https://sqlmesh.readthedocs.io/en/stable/concepts/tests/

[2] https://sqlmesh.readthedocs.io/en/stable/concepts/audits/

tqi · on Sept 6, 2023

Sure thing. Thinking back, the "data issues" I've encountered roughly fall into two buckets - something changed in the source data, and something broke in the code.

My first reservation with tests is that the vast majority of issues fall into the first bucket. Audits are a solution to the source data change problems (though trying to anticipate the myriad ways in which source data can change feels almost impossible, such that audits end up becoming more of a "lets not fuck it up THAT way again"), but Tests are not. Yet writing and debugging tests takes as much or more effort.

My other reservation is that generally, I see a lot of VERY basic tests that only check for a limited, very simple subset of possible errors. So even for data issues that fall into the latter bucket, the tests aren't catching any issues. Things like adding something to a where clause that references a column from a LEFT JOINed table or using "= TRUE" instead of "IS TRUE" generally isn't something that test cases or asserts will catch (in my experience).

I understand that ensuring code quality requires redundant layers of checks, so tests should act as an additional safety net for developer skill and code reviews. But I also think that teams have limited bandwidth, and I would prefer efforts be focused on higher ROI activities than writing tests.*

*Which is not to say I write SQL without testing it. I prefer to think about each change I am make as requiring a unique set of checks, based on what could go wrong with the specific changes I'm making, and manually test outputs accordingly. ie run with old and new versions of the code, and comparing to see if row counts change (or don't change) as expected.

edmundsauto · on Sept 6, 2023

I don't disagree with anything you said, but I think you are underindexing on the importance of being a resilient data organization. Few things reduce customer trust like having the same error happen multiple times, especially when consumers are the ones identifying the problem.

Additionally, changes in the source data are quite common and very hard to detect. I agree that changes in code are another source of the issue, but that is usually under control of the data team, which is a totally different type of error compared to things outside your locus.

Automated statistical tests don't work because variance is common even in stable data streams; these add too much noise. It's critical for an organization to be able to patch a hole after it's been found, and this is where I think SQL tests shine. They need to be really easy to implement and monitor, but they serve a critical purpose for ensuring continued operational quality and downstream trust.

tqi · on Sept 6, 2023

Hm I think you and I might actually be saying the same thing, just using different terms? Tests as I understand them (based on reading the article shared by the person I was responding to[1]) run on mock data, rather than real source data. Those are the ones that I feel have limited ROI.

I definitely agree there is a lot of value in tests that check against actual source data (which that person's link refers to as audits[1]).

[1] https://sqlmesh.readthedocs.io/en/stable/concepts/tests/ [2] https://sqlmesh.readthedocs.io/en/stable/concepts/audits/

edmundsauto · on Sept 7, 2023

Ah yes, I think we are in agreement :). Understanding terms of art becomes harder every year - the differentiation between tests and audits makes sense but adds so much friction as a noob that I wish they had just stuck to prior art.

tqi · on Sept 6, 2023

> clear problems when the original implementation is stretched in ways that weren't anticipated -- which is a common argument in favor of testing.

On this point in particular, I'm not sure of this is happens very often in data engineering (especially with data transforms), since I don't think tables experience the type of scope creep the way app components or APIs might. Once a table is shipped, almost all every subsequent change is either adding/removing columns (which I think should be written in a way that means it's impossible to change the grain), fixing bugs (in which case tests are not relevant), or internally refactoring for performance (tests can help, but usually only cover very basic issues).

The latter case is actually one where I think automated, generic testing would be helpful, but I'm not aware of any existing tools make easy? Ideally, I would want a test suite to run new and old versions of the code in parallel, and confirm that outputs are unchanged.

kurinikku · on Sept 6, 2023

Totally. Wouldn't recommend all queries have it, but there are some critical tables that should.

kurinikku · on Sept 6, 2023

We got a lot of value using SQL unit tests on very complex queries with lots of edge cases. There was one which involved breaking up events into session (a session is a single uninterrupted 'seating' using the product, an interruption being 30 mins gap or more). Re: edge cases, we'll check sessions are broken up as expected when the user changes, when the browser changes, when it goes over midnight UTC etc.

The table was ~250mm rows per day. The alternative would be doing it in memory; Python unit tests do feel a bit more natural but for a table that size we decided it was best to do it on disk / let the DBMS deal with it. Perhaps Spark is an idea but then there's the trade-off of customers losing context.

It also helped us 'refactor with confidence' - we can happily change the query to incorporate new use cases while knowing the core logic is still sound.

rantthrowaway69 · on Sept 6, 2023

"While offering to rewrite the stack in Rust might be hard sell"

... proceeds to rewrite the stack in Python.

kurinikku · on Sept 6, 2023

Once that's done, we'll bump it up to Mojo next!