Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sqlfluff the SQL Linter for Humans (sqlfluff.com)
210 points by rbanffy on Oct 6, 2021 | hide | past | favorite | 92 comments


There surely is a need for a sql linter, but if it wants to be strict about syntax style, it should better familiarize itself with not-noob styles first.

Eg. some people like to align the clauses vertically and use a leading, not trailing, comma.

    select a
         , b
         , count(*) as count
      from table
     where cond1
       and cond2
     group by a, b
     order by count desc


Yes! I write SQL this way. I grant the comma thing is more a matter of taste, but aligning clauses vertically and maximizing rivers is super useful to me and it forever mystifies me that it isn't standard. This is an extreme example, but I like to keep join clauses vertically aligned as well:

    select a
         , b
         , count(*) as count
      from table
      join table2 on table.key1                    = table2.key1
                 and table.key2                    = table2.key2
                 and DATEADD('day', 1, table.key3) = table2.key3
     where cond1
       and cond2
     group by a, b
     order by count desc
      
For me this lets me read the SQL query almost like a picture at first glance, identifying blocks of rivers as semantically connected groups of statements. Some SQL formatters out there love to put join clauses at the same indentation level as the join itself, drives me nuts.


This looks nearly the same, but it's much easier to write and generates less spurious diffs:

    select a
        , b
        , count(*) as count
    from table
    join table2
        on table.key1 = table2.key1
        and table.key2 = table2.key2
        and DATEADD('day, 1, table.key3) = table2.key3
    where cond1
        and cond2
    group by a, b
    order by count desc
The largest difference is that the equal operators aren't aligned anymore, but on my opinion, aligning them is really not worth the cost.


Nice changes, I like it! I will say I really like aligning the join clauses, but I'll try writing things this way for a little while and see if it sticks.


The only thing I require from an SQL formatter is to make typos and boolean mistakes super easy to spot, especially as it grows in size. At first glance it might seem strange, but I like indentation on even the first term of an AND.

    SELECT
        a,
        b,
        count(*) as count
    FROM
        table
    JOIN
        table2
    ON
            table.key1 = table2.key1
        AND
            table.key2 = table2.key2
        AND
            -- 1 day apart
            DATEADD('day', 1, table.key3) = table2.key3
    WHERE
            cond1
        AND
            cond2
        AND
            (
                cond3
            OR
                cond4
            )
    GROUP BY
        (a, b)
    ORDER BY
        count desc,
        a asc
    LIMIT
        1000
    ;
Yes, that's a lot of whitespace, but it helps for complex queries.


Well, there's an inherent conflict between making individual queries easier to read and making your entire code easier to read.

How much whitespace you need is completely dependent on what exactly you are writing.


Do you format your code this way by hand or does your IDE help you? Also, I can only imagine how noisy your diffs would be when you make changes and have to indent everything.


I'll use PyCharm's SQL formatter sometimes to massage spaghetti code that someone else wrote, but mostly I just write it this way by hand.

I will say my use case is probably different than others, I'm a Data Scientist so I write a ton of throwaway SQL and stuff that will only be seen by me. If I'm dealing with an existing SQL codebase or production stuff, definitely try to keep it more easily diff-able (like left justifying the keywords, really only caring about rivers in the join/where clauses), or just follow whatever patterns are already there.


> it forever mystifies me that it isn't standard

I can give a few thoughts on why here:

- it doesn't work for collaborative efforts unless everyone is using the same linter or IDE with the same configuration. Not friendly to edits outside of that ecosystem (e.g. open source). - there are other ways of writing SQL (e.g. dbt style guide) which are still very readable while also being much faster to write by hand without the need for an auto-formatter. - some of these patterns such as the from and join indentations have nothing to do with readability and are purely stylistic. Nothing wrong with that, but there's nothing objectively better about this pattern - it just comes down to how you're used to reading SQL.


> and it forever mystifies me that it isn't standard

I can take a stab at justifying that - commas are used to form lists (eg, 1, 2, 3) and text (like "order by count") are supposed to be left justified.

You may be right that there are good reasons to format SQL while ignoring the commas and rules for writing English text. I personally think you are right. But if most people believed that then there'd be enough support to rewrite SQL with a decent modern syntax. Say allowing training commas before a from.

So the existence of SQL is, in a weird way, evidence that your view is uncommon.


That formatting makes my eyes bleed.


If SQL would just allow a trailing comma on the last column, none of this leading comma shit would be necessary. I still use trailing commas, because code readability is more important that writeability IMO, but the fact that I'm forced to sympathize with such an ugly shit sandwich is really frustrating.

Can't sympathize with the clause alignment though. And I can't figure out why someone who does leading commas for editability would chose a style that forces you to use different indentation levels for different lines. Oh wait, you wanna make that join a left join? Time to go re-indenting my whole damn query...

It's not even an improvement in readability...who actually reads text like it's a column format? It's not a spreadsheet.


Yes, despite knowing how to do it right those damn comma issues always come up at the worst time and to add insult to injury they're reported by most of the tools I have available like they're totally insoluble to the computer (even though far more taxing things are routinely fixed automatically).


The code you get with a trailing comma is valid SQL just like the one you get without, and both have different meanings.

The language is just too lenient with syntax, what is quite on the line for a "human legible" language from the late 70's and earlier 80's, but a really undesirable characteristic in practice. Unfortunately, nobody is doing a "better SQL", just radically different things that often not even support the same DB engines. And those have a real hard time getting popular, so we are stuck with the late 70's speculative innovations.


I hate you and and the awful layout you have just introduced me to... Also, that is how I will be formatting all my SQL from now on.


alt styling

    SELECT 
        a, b, COUNT(*) as count
    FROM
        table
    WHERE
        cond1 AND cond2
    GROUP BY
        a, b
    ORDER BY
        count DESC


That's the one I'm familiar with. The capitalized keywords help give you that Relational Database feel.


Capitalized keywords and section declarations on their own lines helps with quick comprehension for me at least


I've seen this one in the wild and have objections to it.

  - why capitalize KEYWORDS. We're no longer in 1970s. We use colors
  - why waste so much space. Why not just put table on the same line with from. With joins this just bloats up
  - for complicated conditions they have to be put on their own lines. Do I give each AND a separate line and waste ever more space?


- Half my SQL queries are written inside some other script, like Python or bash, so capitalization is nice.

- Space is not wasted when it it makes code more readable.

- Yes, and also give the columns in select their own lines. So nice to read. Space well spend!


Good points. Maybe there is a difference between a SQL script embedded occasionally somewhere in python, for which this style probably makes sense.

For heavy SQL codebases found in data analytics, there are better styles.


My editors and code review tools do not format or highlight sql for me when it's embedded in other files - eg java. I find capitalized keywords to be very convenient. Of course at that point, I can't run sqfluff on it either but this thread has already devolved fairly quickly.


I’ve always formatted my SQL like this, but I don’t remember learning it from anyone else or seeing anyone else do it.


Or the ultimate pedant (pedultimate?) order which matches the SQL order of operations (which for some queries can be illuminating):

    from ...
    where ...
    group ...
    having ...
    select ...
    order by ...
    limit ...


Is that legal SQL? I'd love to write in that order.

When working in Big Query, it would be so nice to first declare the table, so the online IDE can autocomplete columns.


You might like QUEL, precursor to SQL (and the original language of Ingres and Postgres) where range variables came first.

https://en.wikipedia.org/wiki/QUEL_query_languages#Usage


Standard SQL and most dialects enforce that in the syntax anyway no?

Edit: misread and select was in the middle in that comment


No, all databases support the common order of SELECT ... FROM ... WHERE ....


You're right, I somehow missed that select was in a different order.


Most of the SQLFluff style was based around the Mozilla SQL style guide and the dbt labs style guide: https://docs.telemetry.mozilla.org/concepts/sql_style.html https://github.com/dbt-labs/corp/blob/master/dbt_style_guide...

The GitLab SQL style guide isn't far off either: https://about.gitlab.com/handbook/business-technology/data-t...

Rather than supporting _all_ ways of formatting SQL, the aim of the tool was to fairly flexibly support the most mainstream ones with a view to becoming more opinionated in future to converge on a more unified style in future. Similar to what black has done with python.


This whole discussion here shows that the best time to introduce a linter/autoformatter is at the same time as the language it's supposed to lint/format (like Go did). Otherwise you will have interminable discussions and everyone will want their favorite style supported. I'm afraid that ship has sailed for SQL though...


Leading comma sort of works for SQL, since 'a' will often be a much less likely to change ID or something. Really I just wish it allowed a final trailing comma though, so each could be on its own line:

    select
      a,
      b,
    from ...
without the comma diff when you add one to the end.


BigQuery supports trailing commas in select statements: https://cloud.google.com/bigquery/docs/reference/standard-sq...


Leading comma is configurable, but I wouldn't know if the linter allows for that layout. Can't say I've seen it before, is it specific to a particular development environment?


I've seen it a lot in Data Science/Analytics environments, as it allows for faster query iterations than trailing commas.


Yes you can find sources advocating for leading commas and it's commonly used. I have "Google BigQuery: The Definitive Guide" on my desktop, it was written by senior Google employees involved in the steering and use of the product inside, for what it's worth, and it advocates for leading comma and SHOUTING THE KEYWORDS.

I personally disobey both recommendations (I mean, the book is worth it for other reasons) but both those conventions are very commonly used.

I meant more the the "center" layout that the commenter above was describing, it'd seem quite laborious to maintain by hand in most environments, if not using a particular one that supports it out of the box.


jetbrain IDEs have it configured by default for most situations, so its generally just an auto-format away if you're actually writing sql and not embedding sql queries in some other code.


It’s an informal way some people write SQL if they’re developing a script.

I’ve done it this way before because it was quicker to edit, and and remove lines. But it’s not a format I’d particularly like to share with others because it looks weird.


Please do not use leading commas. It doesn’t make any sense. Where else do you ever use them? Do you use them with an IN clause? What about in JSON?


could a synonym for "more not-noob" perhaps be "more esoteric"?


I think this is primarily aimed at analytics and data modelling SQL, the type of SQL in dbt [], though I'd be happily corrected by the team.

As far as I understand, this has been quite a long time coming, and is far from trivial due to the diversity of SQL dialect.

My use case would be to have this attached to a dbt project that ensures the dbt SQL files all confirm to a passable degree of consistency when working across a team. Especially as a CICD automated bot action thing on github.

[] https://www.getdbt.com/


Most of the rules seem reasonable, but what's the justification for this?

Rule_L031: Avoid table aliases in from clauses and join conditions.

I can't see what makes aliases clarifying in `select` or `where` but not in `join`. Also, inlined subqueries in a join must be given an alias, so there are cases where this rule can't be applied consistently.


I think it depends on if your table names are readable or not. If you table names are "sales" and "orders" then I would argue that makes the sql a lot more readable i.e

sales.amount orders.due_date

vs

s.amount ord.due_date

In the second example I have to flip back and forth between the join statement in order to know where the columns are coming from.

If you have awful table names though like qq123xc then I think aliasing would be appropriate


The moment you have the same table used more than once as a table source (in a JOIN) then this goes out the window.

Moreover, there's nothing wrong with saying `SELECT ... FROM sales sales ...`.


Wholeheartedly agree with this. The problem isn't aliasing, it's lazy aliasing which is more I think what GP was referring to (e.g. emp.id instead of employee.id).


Yes, that's often important. Project at work has lots of tables with very long names, which I alias frequently to obvious short names:

    schema.adjective_noun_adjective_noun as anan


Aliases rarely add clarity in my experience. An alias is just an extra lookup when I need context. It’s a waste of mental swap space.

Inlined subqueries are creating entirely new projections so they need descriptive names. If you are doing this you should consider a WITH anyway.


This rule is no good.


I wanted to use this linter, but they don't support many things that are used in real production code. By example, a simple postgres function like this one can't be parsed:

  CREATE OR REPLACE FUNCTION public.setof_test()
   RETURNS SETOF text
   LANGUAGE sql
   STABLE STRICT
  AS $function$
    select unnest(array['hi', 'test'])
  $function$
  ;


The code looks really great [1], very declarative, looks like it should be pretty easy to add better PostgreSQL support. I also wonder if how easy it would be to plug it into VS Code, and build in an auto formatter (♥ω♥ ). At a glance it looks like it's parsing AST, which is really hard to find for SQL, so auto formatter could be possible to build from this. I notice SQLite could use some improving too, seems like a great project to coalesce around.

[1] https://github.com/sqlfluff/sqlfluff/blob/main/src/sqlfluff/...


Auto-formatter is already built in for some simple violations: https://docs.sqlfluff.com/en/stable/cli.html#sqlfluff-fix



Looks like we didn't support SETOF.

Added this here: https://github.com/sqlfluff/sqlfluff/pull/1522

Demonstrating how easy it is to add this sort of thing to the project thanks to how the code is structured! :-)


I really personally dislike how you fail to shout the keywords in the above function "that actually matter" - all the set dressing and obvious stuff is capitalized - but then the meat of the function is all in lowercase hiding in the middle.


This is a linter which complains about spaces and newlines, but lets "WHERE col = NULL" pass (which can never be meaningful) just fine?

Obviously this depends on the rules configuration, but just judging that first smell test, it seems very ill thought out.


In MS SQL Server, with ANSI_NULLS set to OFF, it will select all rows having a NULL value for col. AFAIK not used that often.


That is a serious gotcha of SQL Server. It (and Oracle) has some serious problems with unique constraints because of that kind of stuff.


Seems like a good thing to check for! So we added a rule for this: https://github.com/sqlfluff/sqlfluff/pull/1527

Including autofixing if wanted.

Will be in the next release.


I guess it depends on what you want out of a linter e.g. if you want it to check just syntax or semantics as well. Is “= NULL” syntactically correct?


A linter is supposed to tell you "you are probably making a mistake."

"= NULL" is as syntactically correct in SQL as "if (x = null)" in C.

If it wasn't, code highlighting or compiling would break, both of which are much much more obvious.


The compiler checks syntax - if that’s all a linter did it wouldn’t be very useful.


`= NULL` is perfectly legal syntax - but it isn't doing what you think it is.


That's what parent is saying. He is contrasting a compiler (which this isn't) to a linter (which this arguably also isn't).


I wondered how much good you could do with an SQL linter that didn't have any access to the underlying schema, and the answer seems to be "mostly whitespace and capitalization issues".


I think a not-so-small-part of the success of JS/CSS linting was that large companies like AirBnB, GitHub, and Google made their rules public and helped with sane defaults. Almost every project I came across used one of these "defaults" and added custom rules as needed.

Sqlfluff seems to ship with 48 "built-in" rules and opinions on spacing and capitalization. If it catches on, I think the larger community will mostly benefit after other companies "doing work" with it release their own collection of rules.


Makes you wonder if this tool would be better a code formatter rather than linter?


It seems to have a formatter option in the `fix` argument you can give it.

The authors should probably be selling it on that point since it seems to be its strength, it doesn't have many rules along the lines of the static analysis that people look for in linters.

Some people (tech leads, CTOs) care a lot about formatting but I wouldn't make teammates go through a linter for that, it's a rather surface concern.


The reason that those people care about formatting is that it matters when the code is used to communicate with other people, not just with machines. Having a consistent style within the team, means the team can communicate more easily - even if the net effect for an individual working on a codebase alone is minimal.


Yes but realistically, save for the code that someone who's really inexperienced or sloppy would write, fixating over whether someone is writing

     select a, b, c
     from foo
     where bar = 'quux'
rather than

     SELECT
         a,
         b,
         c
     FROM
         foo
     WHERE
         bar = 'quux'
is petty and immaterial to what the code communicates.

Worse if it's about details within the roughly the same style.

Edit: but if there's a code formatter that can be seamlessly added to the development environment, it makes sense to use it. It's just that if there's no tool support, such things are of little contribution.


In trivial examples when we are all relaxed then yes, you would be right. It’s non-trivial occasions when we are stressed and trying to read someone elses code at two in the morning when I really value strict code style guidelines.

The ability to read complex code when we are in situations where we are not functioning at 100% is vastly under valued when one talks about developers capacity for comprehension.


Code isn't meant to be read by computers - it's meant to be read by humans (including yourself in three years when you've forgotten everything). Code sugaring and style adherence are vital portions of retaining a maintainable and flexible codebase.


Yeah I know the spirit. I'm saying it amounts to absolutely nothing in readability. If there's any formatting at all, it's structure and not particular whitespace and comma placement conventions that make or break readability, the rest is code quality theater.


It does actually matter quite a bit in readability in my experience - when you have a bunch of noobs, and a bunch of senior folks, and a bunch of interns, and a bunch of contractors all slinging code around for review to each other.

Many folks will get sloppy (in syntax and cleanliness) or misunderstand things - which requires the reviewer be able to clearly and quickly see what they are trying to do to catch that - especially as deadlines approach. The more it slips, the more slipping becomes the norm, and the messier and harder to understand everything gets - which makes later work harder as well.

If what you're showing is the standard, I guarantee 25% or more of the requested changes (which already won't meet whatever standard ANYONE sets strictly when first proposed) will be even worse. If that even worse becomes standard, etc, etc.

Part of their job is to set the 'reasonable' ideal, and attempt to enforce it. It won't happen universally, or even necessarily often, but it pushes things more towards maintainability and obviousness, which is opposite of the normal trend in any group of people. The larger the group, the more of a problem it is, and the harder they need to work to do it.


I've been using sqlfluff with dbt for a while now, and not only has it been a delight to use, the maintainers are incredibly responsive.


Yeah, we can see that in this post today. I saw two issues implemented and fixed while reading it.


Sorry - didn't get a chance to go through the documentation much yet... A problem we are looking to catch early in the development process is for things we don't want folks to do, even if they are technically correct. For example, dropping a column or altering the datatype/nullability in such a way that the table goes into reorg pending. I know we can write our own regex to look for that specific syntax, but I've not yet found a tool with those kinds of rules either build out already, or easy to add/maintain.


It's pretty new functionality in sqlfluff, but it now supports user defined plugins for org-specific rules if you want to forbid something more obscure. Documentation is sketchy, but you can see the proof of concept here: https://github.com/sqlfluff/sqlfluff/tree/main/plugins/sqlfl...


Nice, need to forbid things like grants and setting owners that are being handled by a centralized routine.


Thanks, I'll take a closer look there...


After relying on Prettier for a while, it feels very strange to me to have a linter that complains over minor formatting differences instead of just a tool that automatically fixes them.


It does both. Linting and fixing are done at different times.


I am spec’ing a data pipeline right now and the team is so excited to try this out. Here’s a great intro video about the tool and their motivations: https://m.youtube.com/watch?v=veYB9uh0RCM&t=1105s

My goal is to use it with a Rockset database, so I’ll see how far I can get with one of the existing dialects, or how hard it is to extend an existing one to make one for Rockset


Fascinating to see the focus on SQL and continuous integration, https://docs.sqlfluff.com/en/stable/inthewild.html#inthewild.... DBT has really changed things in that regard. @tomhallett's YT video seems to describe using this with a pre-release build of DBT to tightly couple it to sqlfluff.


I’ve been using this for a while and I’ve much enjoyed both the autoformatting as well as the codebase (when submitting PRs).


We use this at my company to lint dbt sql code within our shared repositories. We are able to easily craft a set of rules that fit our style guides and ignore the rest.


I used `pip install sqlformat` in the past but it seems no longer working for python3.8+, could not even find the executable anymore.


Is there a vim plugin for this yet? :)



This is a nice idea, but my problem with most of these tools is that they work with pure SQL/JS/whatever files, which isn't always the case.

For example, if you use something like myBatis, you'll probably build the SQL dynamically, which could be stored in a mapper XML file: https://mybatis.org/mybatis-3/dynamic-sql.html

Not only that, but certain frameworks have you embedding SQL inside of the main application code, when you're not using a DSL like HQL: https://www.tutorialspoint.com/hibernate/hibernate_query_lan... and https://www.tutorialspoint.com/hibernate/hibernate_native_sq...

Furthermore, there seems to be this odd separation between "regular SQL" and "procedural SQL" in some DBVSes, such as in Oracle - there you might have to use "DECLARE ... BEGIN ... END;" in certain contexts, or maybe "EXECUTE IMMEDIATE". Then again, Oracle isn't among the supported dialects, which is perfectly understandable, because of both how the DBVS is positioned, as well as because of how large an undertaking supporting it would be: https://docs.sqlfluff.com/en/stable/dialects.html

In short, the only passable tools that i've found for checking SQL are those that are integrated within the IDE, or using a specialized IDE with any plugins for specific technologies, like JetBrains DataGrip (commercial project): https://www.jetbrains.com/datagrip/

But then you run into the issue of code style preferences: some companies have style guides where you have to use reserved words in uppercase with the dev created content in lowercase, like:

  SELECT some_column, another_column, yet_another_column AS one_more 
  FROM some_table 
  WHERE some_column > SYSDATE
while certain tools (like SQL Developer) are more than happy to use autocomplete which doesn't work well with that:

  SELECT (start typing in "som") FROM some_table ...
  Oracle will autocomplete to: SOME_COLUMN
And then you run into people's personal preferences, for example some format the columns that they want to select like this:

  SELECT
      some_column
    , another_column
    , yet_another_column
  FROM ...
which makes sense when you want to edit single lines, but could have been avoided by the SQL standard allowing dangling commas, like some other languages do, for example:

  SELECT
    some_column,
    another_column,
    yet_another_column,
  FROM ...
but as it currently stands, that's not possible and the other way of formatting code also tends to break linters. And since the actual formatting of SQL doesn't matter as much as it would in Python, we end up with every IDE and other tool having their own preferences.

I guess the lesson here is to avoid complexity as much as possible and just settle on whatever works.


You're totally right that there isn't any consistency in this space at the moment. It's part of the justification for tools like SQLFluff starting to align people on a common way of writing SQL. None of the incumbent players have an incentive to do that.


Yeah, you need to read more of the sqlfluff documentation. :-)


I haven't used sqlfluff, but they support jinja templating and in those cases they ignore spacing problems. So maybe your use case can be supported?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: