Few tips I gathered along the years: - Configure Vacuum and maintenance_work_mem...

masklinn · on April 25, 2023

> Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.

However if the field is highly biased (e.g. 90 or 99% one value) it can be useful to create a partial index on the rarer value. Though even better is to create a partial index on the other stuff filtered by that value, especially if the smaller set is the commonly queried one (e.g. soft-deletes).

Waterluvian · on April 25, 2023

Yeah. Finding ”open” tickets, for example. There’s actually some really good cases to index on a Boolean.

WirelessGigabit · on April 25, 2023

We work in different places. Here the index in closed tickets would be smaller. But you know, some sales guy called and they want this little feature NOW.

j45 · on April 25, 2023

Nice example.

ellisv · on April 25, 2023

Also good to remember that booleans can have 3 values: true, false, or null. Creating a partial index on `WHERE NOT NULL` can be helpful too.

masklinn · on April 25, 2023

Postgres uses distinct nulls. I've not checked, but I'd assume postgres simply does not index nulls, as it can't find them again anyway (unless you use the new "NULLS NOT DISTINCT" anyway). I think you need a separate index on the boolean IS NULL (which should probably be a partial index on whichever of IS NULL and IS NOT NULL is better).

marcosdumay · on April 25, 2023

Postgres absolutely adds nulls to its indexes. You can even control how they are ordered, and on the last few versions if nulls are equal or not.

A complete index over a column will have entries for all records, and can be used on "x is null" and "x is not null" filters.

ellisv · on April 25, 2023

You're correct, thanks for noting this – I had it backwards.

WirelessGigabit · on April 25, 2023

Even when the column is made with NOT NULL?

exac · on April 25, 2023

h1fra · on April 25, 2023

If it's highly biased indeed, in combinaison of a condition it's useful.

I was referring of indexing the column without distinction, the last time I checked (years ago) Postgres didn't do any statistical distribution so the query planner was always discarding the index anyway.

nvilcins · on April 25, 2023

> - Related: be sure to understand the difference between transaction vs explicit locking, a lot of people assume too much from transaction and it will eventually breaks in prod.

I recently went from:

  * somewhat understanding the concept of transactions and combining that with a bunch of manual locking to ensure data integrity in our web-app;

to:

  * realizing how powerful modern Postgres actually is and delegating integrity concerns to it via the right configs (e.g., applying "serializable" isolation level), and removing the manual locks.

So I'm curious what situations are there that should make me reconsider controlling locks manually instead of blindly trusting Postgres capabilities.

azurelake · on April 25, 2023

Just FYI if you didn’t already know this:

  Any transaction which is run at a transaction isolation   level other than SERIALIZABLE will not be affected by SSI. If you want to enforce business rules through SSI, all transactions should be run at the SERIALIZABLE transaction isolation level, and that should probably be set as the default.

Given that running everything at SERIALIZABLE probably isn’t practical for you, I think it’s more clear code wise to use explicit locks. That way, you can grep for what queries are related synchronization wise, vs. SERIALIZABLE being implicit.

wongarsu · on April 25, 2023

Continuing with the FYIs:

Explicit locks can mean just calling LOCK TABLE account_balances IN SHARE ROW EXCLUSIVE MODE; early in the transaction and then doing SELECT ... FOR UPDATE; or similar configurations to enforce business rules where it matters.

https://www.postgresql.org/docs/current/sql-lock.html

h1fra · on April 25, 2023

I think, in the using Postgres as a queue scenario, it's not fixing the problem that two processes can read the same row at the same time thus both executing the process.

If you manually SELECT FOR UPDATE SKIP LOCKED LIMIT 1, then the second process will be forced to select the next task without waiting for the lock.

nextaccountic · on April 25, 2023

> - You can speed up, by a huge margin, big string indices with md5/hash index (only relevant for exact match)

Do you mean a https://www.postgresql.org/docs/current/indexes-types.html#I... index? It's a 32-bit hash (but which hash is it, is it CRC32?). How to do a MD5 index?

Anyway, MD5 is slow, does Postgres offer fast hashes like SipHash (DoS resistant) or FNV (not DoS resistant)?

h1fra · on April 25, 2023

You can store the md5 (or any hash) in a new column and use it in the index instead of the string column. It will still be a string index but much shorter. You have to be aware of hash collision but in my case it was a multi column index so the risk was close to zero. MD5 was maybe not the best choice but it's builtin so available everywhere.

What I did to not maintain a second column is to use the function directly in the index:

``` CREATE UNIQUE INDEX CONCURRENTLY "groupid_md5_uniq" ON "crawls" ("group_id", md5("url")); ```

``` SELECT * FROM crawls WHERE group_id= $0 AND md5(url) = md5($1) ```

This simple trick, that did not required an extensive refactor, speed up the query time by a factor of thousand.

ellisv · on April 25, 2023

We do this by making the md5 a char(32) generated column of the text column.

somehnguy · on April 25, 2023

> - Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.

I’ve seen this advice elsewhere as well, but recently tried it and found it wasn’t the case on my data set. I have about 5m rows, with an extremely heavy bias on one column being ‘false’. Adding a plain index on this column cut query time in about half. We’re in the millisecond ranges here, but still.

dist-epoch · on April 25, 2023

Just index the less common value:

    CREATE INDEX ON session(is_active) WHERE is_active;

giovannibonetti · on April 25, 2023

There is no need for adding the boolean value to the index in this case, since it is constant (true). You can add a more useful column instead, like id or whatever your queries use:

CREATE INDEX ON session(id) WHERE is_active;

somehnguy · on April 25, 2023

I tested that and it seemed to make 0 difference between a basic 'create index on table(column)'.

dist-epoch · on April 25, 2023

Have you measured the disk size of the index? That's where you should see a difference, not in speed.

somehnguy · on April 25, 2023

It does appear smaller, but single digit megabytes on a table with millions of rows. Not a major difference for most use cases I think. But good to know for the few that it would make a difference.

klysm · on April 25, 2023

I know nothing about partial indices in Postgres, but it seems like for indexing a Boolean, you either index the true or false values right? I feel like Postgres could intelligently choose to pick the less frequent value

Someone · on April 25, 2023

Is that correct? I would think that, even with NOT NULL Boolean field, the physical table has three kinds of rows: those with a true value, those with a false value, and those no longer in the table (with either true or false, but that doesn’t matter)

If so, you can’t, in general, efficiently find the false rows if you know which rows have true or vice versa.

You also can only use an index on rows with true values to efficiently find those with other values if the index can return the true rows in order (so that you can use the logic “there’s a gap in the index ⇒ there are non-true values in that gap)

klysm · on April 26, 2023

Yeah that seems more like how it would work, I’m curious about the internals there

smilliken · on April 25, 2023

The benefit is a proportionally smaller index.

moring · on April 25, 2023

> Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.

Why? Is it because an index on the bool alone, with symmetric distribution, will still leave you with half the table to scan? In other words, does that statement apply to biased distribution (as mentioned by another response) or indices on multiple fields of which one is a boolean?

jeltz · on April 25, 2023

Yes, it is because it leaves you with half the table the scan while adding the overhead of doing an index scan. And of you have a biased distribution you probably want a partial index since those are smaller.

xwdv · on April 25, 2023

No, it’s rarely half the table, most bool columns are biased to one value.

code_biologist · on April 25, 2023

Half the rows to scan in 99% of cases means you’ll still hit every page and incur exactly the same amount of IO (the expensive part) as a full table scan.

webstrand · on April 25, 2023

Would periodically clustering the table on the boolean index help here? Since then the true rows would be in different pages than the false rows. Unless I misunderstand what clustering does.

marcosdumay · on April 25, 2023

The thing is that, since you can only cluster around a single ordering, a boolean column is very rarely the most useful one to use.

But then, given the number of things that very rarely happen in a database, you are prone to have 1 or 2 of them happening every time you do something. Just not that specific thing; but if you keep all of those rules in mind, you will always be surprised.

valenterry · on April 25, 2023

Yes, that would indeed help.

alberth · on April 25, 2023

> You can speed up, by a huge margin, big string indices with md5/hash index

Dumb question: what's the use case for having a md5/hash field in your database?

__s · on April 25, 2023

Postgres offers hash indexes as opposed to b-tree indexes: https://www.postgresql.org/docs/current/hash-intro.html

For equality comparisons of large types it's quite beneficial

h1fra · on April 25, 2023

I have answered here: https://news.ycombinator.com/item?id=35701126

In my case, I had to index big tables with URLs, with no need for partial match. I did it naively but did help a lot.

pier25 · on April 25, 2023

> Postgres as a queue is definitely working and scales pretty far

You mean with triggers and listen/notify ?

ellisv · on April 25, 2023

Probably meant as something like this: https://www.crunchydata.com/blog/message-queuing-using-nativ...

But I find that listen/notify seem to be drastically underused.