There surely is a need for a sql linter, but if it wants to be strict about syntax style, it should better familiarize itself with not-noob styles first.
Eg. some people like to align the clauses vertically and use a leading, not trailing, comma.
select a
, b
, count(*) as count
from table
where cond1
and cond2
group by a, b
order by count desc
Yes! I write SQL this way. I grant the comma thing is more a matter of taste, but aligning clauses vertically and maximizing rivers is super useful to me and it forever mystifies me that it isn't standard. This is an extreme example, but I like to keep join clauses vertically aligned as well:
select a
, b
, count(*) as count
from table
join table2 on table.key1 = table2.key1
and table.key2 = table2.key2
and DATEADD('day', 1, table.key3) = table2.key3
where cond1
and cond2
group by a, b
order by count desc
For me this lets me read the SQL query almost like a picture at first glance, identifying blocks of rivers as semantically connected groups of statements. Some SQL formatters out there love to put join clauses at the same indentation level as the join itself, drives me nuts.
This looks nearly the same, but it's much easier to write and generates less spurious diffs:
select a
, b
, count(*) as count
from table
join table2
on table.key1 = table2.key1
and table.key2 = table2.key2
and DATEADD('day, 1, table.key3) = table2.key3
where cond1
and cond2
group by a, b
order by count desc
The largest difference is that the equal operators aren't aligned anymore, but on my opinion, aligning them is really not worth the cost.
Nice changes, I like it! I will say I really like aligning the join clauses, but I'll try writing things this way for a little while and see if it sticks.
The only thing I require from an SQL formatter is to make typos and boolean mistakes super easy to spot, especially as it grows in size. At first glance it might seem strange, but I like indentation on even the first term of an AND.
SELECT
a,
b,
count(*) as count
FROM
table
JOIN
table2
ON
table.key1 = table2.key1
AND
table.key2 = table2.key2
AND
-- 1 day apart
DATEADD('day', 1, table.key3) = table2.key3
WHERE
cond1
AND
cond2
AND
(
cond3
OR
cond4
)
GROUP BY
(a, b)
ORDER BY
count desc,
a asc
LIMIT
1000
;
Yes, that's a lot of whitespace, but it helps for complex queries.
Do you format your code this way by hand or does your IDE help you? Also, I can only imagine how noisy your diffs would be when you make changes and have to indent everything.
I'll use PyCharm's SQL formatter sometimes to massage spaghetti code that someone else wrote, but mostly I just write it this way by hand.
I will say my use case is probably different than others, I'm a Data Scientist so I write a ton of throwaway SQL and stuff that will only be seen by me. If I'm dealing with an existing SQL codebase or production stuff, definitely try to keep it more easily diff-able (like left justifying the keywords, really only caring about rivers in the join/where clauses), or just follow whatever patterns are already there.
- it doesn't work for collaborative efforts unless everyone is using the same linter or IDE with the same configuration. Not friendly to edits outside of that ecosystem (e.g. open source).
- there are other ways of writing SQL (e.g. dbt style guide) which are still very readable while also being much faster to write by hand without the need for an auto-formatter.
- some of these patterns such as the from and join indentations have nothing to do with readability and are purely stylistic. Nothing wrong with that, but there's nothing objectively better about this pattern - it just comes down to how you're used to reading SQL.
> and it forever mystifies me that it isn't standard
I can take a stab at justifying that - commas are used to form lists (eg, 1, 2, 3) and text (like "order by count") are supposed to be left justified.
You may be right that there are good reasons to format SQL while ignoring the commas and rules for writing English text. I personally think you are right. But if most people believed that then there'd be enough support to rewrite SQL with a decent modern syntax. Say allowing training commas before a from.
So the existence of SQL is, in a weird way, evidence that your view is uncommon.
If SQL would just allow a trailing comma on the last column, none of this leading comma shit would be necessary. I still use trailing commas, because code readability is more important that writeability IMO, but the fact that I'm forced to sympathize with such an ugly shit sandwich is really frustrating.
Can't sympathize with the clause alignment though. And I can't figure out why someone who does leading commas for editability would chose a style that forces you to use different indentation levels for different lines. Oh wait, you wanna make that join a left join? Time to go re-indenting my whole damn query...
It's not even an improvement in readability...who actually reads text like it's a column format? It's not a spreadsheet.
Yes, despite knowing how to do it right those damn comma issues always come up at the worst time and to add insult to injury they're reported by most of the tools I have available like they're totally insoluble to the computer (even though far more taxing things are routinely fixed automatically).
The code you get with a trailing comma is valid SQL just like the one you get without, and both have different meanings.
The language is just too lenient with syntax, what is quite on the line for a "human legible" language from the late 70's and earlier 80's, but a really undesirable characteristic in practice. Unfortunately, nobody is doing a "better SQL", just radically different things that often not even support the same DB engines. And those have a real hard time getting popular, so we are stuck with the late 70's speculative innovations.
I've seen this one in the wild and have objections to it.
- why capitalize KEYWORDS. We're no longer in 1970s. We use colors
- why waste so much space. Why not just put table on the same line with from. With joins this just bloats up
- for complicated conditions they have to be put on their own lines. Do I give each AND a separate line and waste ever more space?
My editors and code review tools do not format or highlight sql for me when it's embedded in other files - eg java. I find capitalized keywords to be very convenient. Of course at that point, I can't run sqfluff on it either but this thread has already devolved fairly quickly.
Rather than supporting _all_ ways of formatting SQL, the aim of the tool was to fairly flexibly support the most mainstream ones with a view to becoming more opinionated in future to converge on a more unified style in future. Similar to what black has done with python.
This whole discussion here shows that the best time to introduce a linter/autoformatter is at the same time as the language it's supposed to lint/format (like Go did). Otherwise you will have interminable discussions and everyone will want their favorite style supported. I'm afraid that ship has sailed for SQL though...
Leading comma sort of works for SQL, since 'a' will often be a much less likely to change ID or something. Really I just wish it allowed a final trailing comma though, so each could be on its own line:
select
a,
b,
from ...
without the comma diff when you add one to the end.
Leading comma is configurable, but I wouldn't know if the linter allows for that layout. Can't say I've seen it before, is it specific to a particular development environment?
Yes you can find sources advocating for leading commas and it's commonly used. I have "Google BigQuery: The Definitive Guide" on my desktop, it was written by senior Google employees involved in the steering and use of the product inside, for what it's worth, and it advocates for leading comma and SHOUTING THE KEYWORDS.
I personally disobey both recommendations (I mean, the book is worth it for other reasons) but both those conventions are very commonly used.
I meant more the the "center" layout that the commenter above was describing, it'd seem quite laborious to maintain by hand in most environments, if not using a particular one that supports it out of the box.
jetbrain IDEs have it configured by default for most situations, so its generally just an auto-format away if you're actually writing sql and not embedding sql queries in some other code.
It’s an informal way some people write SQL if they’re developing a script.
I’ve done it this way before because it was quicker to edit, and and remove lines. But it’s not a format I’d particularly like to share with others because it looks weird.
I think this is primarily aimed at analytics and data modelling SQL, the type of SQL in dbt [], though I'd be happily corrected by the team.
As far as I understand, this has been quite a long time coming, and is far from trivial due to the diversity of SQL dialect.
My use case would be to have this attached to a dbt project that ensures the dbt SQL files all confirm to a passable degree of consistency when working across a team. Especially as a CICD automated bot action thing on github.
Most of the rules seem reasonable, but what's the justification for this?
Rule_L031: Avoid table aliases in from clauses and join conditions.
I can't see what makes aliases clarifying in `select` or `where` but not in `join`. Also, inlined subqueries in a join must be given an alias, so there are cases where this rule can't be applied consistently.
I think it depends on if your table names are readable or not. If you table names are "sales" and "orders" then I would argue that makes the sql a lot more readable i.e
sales.amount
orders.due_date
vs
s.amount
ord.due_date
In the second example I have to flip back and forth between the join statement in order to know where the columns are coming from.
If you have awful table names though like qq123xc then I think aliasing would be appropriate
Wholeheartedly agree with this. The problem isn't aliasing, it's lazy aliasing which is more I think what GP was referring to (e.g. emp.id instead of employee.id).
I wanted to use this linter, but they don't support many things that are used in real production code.
By example, a simple postgres function like this one can't be parsed:
CREATE OR REPLACE FUNCTION public.setof_test()
RETURNS SETOF text
LANGUAGE sql
STABLE STRICT
AS $function$
select unnest(array['hi', 'test'])
$function$
;
The code looks really great [1], very declarative, looks like it should be pretty easy to add better PostgreSQL support. I also wonder if how easy it would be to plug it into VS Code, and build in an auto formatter (♥ω♥ ). At a glance it looks like it's parsing AST, which is really hard to find for SQL, so auto formatter could be possible to build from this. I notice SQLite could use some improving too, seems like a great project to coalesce around.
I really personally dislike how you fail to shout the keywords in the above function "that actually matter" - all the set dressing and obvious stuff is capitalized - but then the meat of the function is all in lowercase hiding in the middle.
I wondered how much good you could do with an SQL linter that didn't have any access to the underlying schema, and the answer seems to be "mostly whitespace and capitalization issues".
I think a not-so-small-part of the success of JS/CSS linting was that large companies like AirBnB, GitHub, and Google made their rules public and helped with sane defaults. Almost every project I came across used one of these "defaults" and added custom rules as needed.
Sqlfluff seems to ship with 48 "built-in" rules and opinions on spacing and capitalization. If it catches on, I think the larger community will mostly benefit after other companies "doing work" with it release their own collection of rules.
It seems to have a formatter option in the `fix` argument you can give it.
The authors should probably be selling it on that point since it seems to be its strength, it doesn't have many rules along the lines of the static analysis that people look for in linters.
Some people (tech leads, CTOs) care a lot about formatting but I wouldn't make teammates go through a linter for that, it's a rather surface concern.
The reason that those people care about formatting is that it matters when the code is used to communicate with other people, not just with machines. Having a consistent style within the team, means the team can communicate more easily - even if the net effect for an individual working on a codebase alone is minimal.
Yes but realistically, save for the code that someone who's really inexperienced or sloppy would write, fixating over whether someone is writing
select a, b, c
from foo
where bar = 'quux'
rather than
SELECT
a,
b,
c
FROM
foo
WHERE
bar = 'quux'
is petty and immaterial to what the code communicates.
Worse if it's about details within the roughly the same style.
Edit: but if there's a code formatter that can be seamlessly added to the development environment, it makes sense to use it. It's just that if there's no tool support, such things are of little contribution.
In trivial examples when we are all relaxed then yes, you would be right. It’s non-trivial occasions when we are stressed and trying to read someone elses code at two in the morning when I really value strict code style guidelines.
The ability to read complex code when we are in situations where we are not functioning at 100% is vastly under valued when one talks about developers capacity for comprehension.
Code isn't meant to be read by computers - it's meant to be read by humans (including yourself in three years when you've forgotten everything). Code sugaring and style adherence are vital portions of retaining a maintainable and flexible codebase.
Yeah I know the spirit. I'm saying it amounts to absolutely nothing in readability. If there's any formatting at all, it's structure and not particular whitespace and comma placement conventions that make or break readability, the rest is code quality theater.
It does actually matter quite a bit in readability in my experience - when you have a bunch of noobs, and a bunch of senior folks, and a bunch of interns, and a bunch of contractors all slinging code around for review to each other.
Many folks will get sloppy (in syntax and cleanliness) or misunderstand things - which requires the reviewer be able to clearly and quickly see what they are trying to do to catch that - especially as deadlines approach. The more it slips, the more slipping becomes the norm, and the messier and harder to understand everything gets - which makes later work harder as well.
If what you're showing is the standard, I guarantee 25% or more of the requested changes (which already won't meet whatever standard ANYONE sets strictly when first proposed) will be even worse. If that even worse becomes standard, etc, etc.
Part of their job is to set the 'reasonable' ideal, and attempt to enforce it. It won't happen universally, or even necessarily often, but it pushes things more towards maintainability and obviousness, which is opposite of the normal trend in any group of people. The larger the group, the more of a problem it is, and the harder they need to work to do it.
Sorry - didn't get a chance to go through the documentation much yet... A problem we are looking to catch early in the development process is for things we don't want folks to do, even if they are technically correct. For example, dropping a column or altering the datatype/nullability in such a way that the table goes into reorg pending. I know we can write our own regex to look for that specific syntax, but I've not yet found a tool with those kinds of rules either build out already, or easy to add/maintain.
It's pretty new functionality in sqlfluff, but it now supports user defined plugins for org-specific rules if you want to forbid something more obscure. Documentation is sketchy, but you can see the proof of concept here: https://github.com/sqlfluff/sqlfluff/tree/main/plugins/sqlfl...
After relying on Prettier for a while, it feels very strange to me to have a linter that complains over minor formatting differences instead of just a tool that automatically fixes them.
I am spec’ing a data pipeline right now and the team is so excited to try this out. Here’s a great intro video about the tool and their motivations: https://m.youtube.com/watch?v=veYB9uh0RCM&t=1105s
My goal is to use it with a Rockset database, so I’ll see how far I can get with one of the existing dialects, or how hard it is to extend an existing one to make one for Rockset
Fascinating to see the focus on SQL and continuous integration, https://docs.sqlfluff.com/en/stable/inthewild.html#inthewild.... DBT has really changed things in that regard. @tomhallett's YT video seems to describe using this with a pre-release build of DBT to tightly couple it to sqlfluff.
We use this at my company to lint dbt sql code within our shared repositories. We are able to easily craft a set of rules that fit our style guides and ignore the rest.
Furthermore, there seems to be this odd separation between "regular SQL" and "procedural SQL" in some DBVSes, such as in Oracle - there you might have to use "DECLARE ... BEGIN ... END;" in certain contexts, or maybe "EXECUTE IMMEDIATE". Then again, Oracle isn't among the supported dialects, which is perfectly understandable, because of both how the DBVS is positioned, as well as because of how large an undertaking supporting it would be: https://docs.sqlfluff.com/en/stable/dialects.html
In short, the only passable tools that i've found for checking SQL are those that are integrated within the IDE, or using a specialized IDE with any plugins for specific technologies, like JetBrains DataGrip (commercial project): https://www.jetbrains.com/datagrip/
But then you run into the issue of code style preferences: some companies have style guides where you have to use reserved words in uppercase with the dev created content in lowercase, like:
SELECT some_column, another_column, yet_another_column AS one_more
FROM some_table
WHERE some_column > SYSDATE
while certain tools (like SQL Developer) are more than happy to use autocomplete which doesn't work well with that:
SELECT (start typing in "som") FROM some_table ...
Oracle will autocomplete to: SOME_COLUMN
And then you run into people's personal preferences, for example some format the columns that they want to select like this:
SELECT
some_column
, another_column
, yet_another_column
FROM ...
which makes sense when you want to edit single lines, but could have been avoided by the SQL standard allowing dangling commas, like some other languages do, for example:
SELECT
some_column,
another_column,
yet_another_column,
FROM ...
but as it currently stands, that's not possible and the other way of formatting code also tends to break linters. And since the actual formatting of SQL doesn't matter as much as it would in Python, we end up with every IDE and other tool having their own preferences.
I guess the lesson here is to avoid complexity as much as possible and just settle on whatever works.
You're totally right that there isn't any consistency in this space at the moment. It's part of the justification for tools like SQLFluff starting to align people on a common way of writing SQL. None of the incumbent players have an incentive to do that.
Eg. some people like to align the clauses vertically and use a leading, not trailing, comma.