- Configure Vacuum and maintenance_work_mem regularly if your DB size increases, if you allocate too much or too often it can clog up your memory.
- If you plan on deleting more than a 10000 rows regularly, maybe you should look at partition, it's surprisingly very slow to delete that "much" data. And even more with foreign key.
- Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.
- Broad indices are easier to maintain but if you can have multiple smaller indices with WHERE condition it will be much faster
- You can speed up, by a huge margin, big string indices with md5/hash index (only relevant for exact match)
- Postgres as a queue is definitely working and scales pretty far
- Related: be sure to understand the difference between transaction vs explicit locking, a lot of people assume too much from transaction and it will eventually breaks in prod.
> Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.
However if the field is highly biased (e.g. 90 or 99% one value) it can be useful to create a partial index on the rarer value. Though even better is to create a partial index on the other stuff filtered by that value, especially if the smaller set is the commonly queried one (e.g. soft-deletes).
We work in different places. Here the index in closed tickets would be smaller. But you know, some sales guy called and they want this little feature NOW.
Postgres uses distinct nulls. I've not checked, but I'd assume postgres simply does not index nulls, as it can't find them again anyway (unless you use the new "NULLS NOT DISTINCT" anyway). I think you need a separate index on the boolean IS NULL (which should probably be a partial index on whichever of IS NULL and IS NOT NULL is better).
If it's highly biased indeed, in combinaison of a condition it's useful.
I was referring of indexing the column without distinction, the last time I checked (years ago) Postgres didn't do any statistical distribution so the query planner was always discarding the index anyway.
> - Related: be sure to understand the difference between transaction vs explicit locking, a lot of people assume too much from transaction and it will eventually breaks in prod.
I recently went from:
* somewhat understanding the concept of transactions and combining that with a bunch of manual locking to ensure data integrity in our web-app;
to:
* realizing how powerful modern Postgres actually is and delegating integrity concerns to it via the right configs (e.g., applying "serializable" isolation level), and removing the manual locks.
So I'm curious what situations are there that should make me reconsider controlling locks manually instead of blindly trusting Postgres capabilities.
Any transaction which is run at a transaction isolation level other than SERIALIZABLE will not be affected by SSI. If you want to enforce business rules through SSI, all transactions should be run at the SERIALIZABLE transaction isolation level, and that should probably be set as the default.
Given that running everything at SERIALIZABLE probably isn’t practical for you, I think it’s more clear code wise to use explicit locks. That way, you can grep for what queries are related synchronization wise, vs. SERIALIZABLE being implicit.
Explicit locks can mean just calling LOCK TABLE account_balances IN SHARE ROW EXCLUSIVE MODE; early in the transaction and then doing SELECT ... FOR UPDATE; or similar configurations to enforce business rules where it matters.
I think, in the using Postgres as a queue scenario, it's not fixing the problem that two processes can read the same row at the same time thus both executing the process.
If you manually SELECT FOR UPDATE SKIP LOCKED LIMIT 1, then the second process will be forced to select the next task without waiting for the lock.
You can store the md5 (or any hash) in a new column and use it in the index instead of the string column. It will still be a string index but much shorter. You have to be aware of hash collision but in my case it was a multi column index so the risk was close to zero.
MD5 was maybe not the best choice but it's builtin so available everywhere.
What I did to not maintain a second column is to use the function directly in the index:
```
CREATE UNIQUE INDEX CONCURRENTLY "groupid_md5_uniq" ON "crawls" ("group_id", md5("url"));
```
```
SELECT * FROM crawls WHERE group_id= $0 AND md5(url) = md5($1)
```
This simple trick, that did not required an extensive refactor, speed up the query time by a factor of thousand.
> - Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.
I’ve seen this advice elsewhere as well, but recently tried it and found it wasn’t the case on my data set. I have about 5m rows, with an extremely heavy bias on one column being ‘false’. Adding a plain index on this column cut query time in about half. We’re in the millisecond ranges here, but still.
There is no need for adding the boolean value to the index in this case, since it is constant (true). You can add a more useful column instead, like id or whatever your queries use:
It does appear smaller, but single digit megabytes on a table with millions of rows. Not a major difference for most use cases I think. But good to know for the few that it would make a difference.
I know nothing about partial indices in Postgres, but it seems like for indexing a Boolean, you either index the true or false values right? I feel like Postgres could intelligently choose to pick the less frequent value
Is that correct? I would think that, even with NOT NULL Boolean field, the physical table has three kinds of rows: those with a true value, those with a false value, and those no longer in the table (with either true or false, but that doesn’t matter)
If so, you can’t, in general, efficiently find the false rows if you know which rows have true or vice versa.
You also can only use an index on rows with true values to efficiently find those with other values if the index can return the true rows in order (so that you can use the logic “there’s a gap in the index ⇒ there are non-true values in that gap)
> Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.
Why? Is it because an index on the bool alone, with symmetric distribution, will still leave you with half the table to scan? In other words, does that statement apply to biased distribution (as mentioned by another response) or indices on multiple fields of which one is a boolean?
Yes, it is because it leaves you with half the table the scan while adding the overhead of doing an index scan. And of you have a biased distribution you probably want a partial index since those are smaller.
Half the rows to scan in 99% of cases means you’ll still hit every page and incur exactly the same amount of IO (the expensive part) as a full table scan.
Would periodically clustering the table on the boolean index help here? Since then the true rows would be in different pages than the false rows. Unless I misunderstand what clustering does.
The thing is that, since you can only cluster around a single ordering, a boolean column is very rarely the most useful one to use.
But then, given the number of things that very rarely happen in a database, you are prone to have 1 or 2 of them happening every time you do something. Just not that specific thing; but if you keep all of those rules in mind, you will always be surprised.
- Configure Vacuum and maintenance_work_mem regularly if your DB size increases, if you allocate too much or too often it can clog up your memory.
- If you plan on deleting more than a 10000 rows regularly, maybe you should look at partition, it's surprisingly very slow to delete that "much" data. And even more with foreign key.
- Index on Boolean is useless, it's an easy mistake that will take memory and space disk for nothing.
- Broad indices are easier to maintain but if you can have multiple smaller indices with WHERE condition it will be much faster
- You can speed up, by a huge margin, big string indices with md5/hash index (only relevant for exact match)
- Postgres as a queue is definitely working and scales pretty far
- Related: be sure to understand the difference between transaction vs explicit locking, a lot of people assume too much from transaction and it will eventually breaks in prod.