Even better, rearrange your database to store immutable facts instead of mutable state.
In this example, you might have a goal_completion table to which you append a row when the player completes a goal, instead of changing an existing row. The table could have suitable unique constraints to make sure that each player can only complete a goal once.
This way, it is much harder for your data to accidentally be modified in a way that it becomes wrong.
This also gives you a way to store facts about the completion for free, such as a timestamp for each goal completion. And having data like timestamps is really useful for auditing, debugging and testing.
I've heard you sometimes shouldn't actually delete data from a database and should use an "IsDeleted" flag instead, particularly with customer orders and such.
What are the proper circumstances to use immutable facts? Anytime you need an undo or audit features?
The isDeleted flag is good because it gives the appearance that the data is gone, without removing the context of the data itself.
Immutable facts are great if you want to:
1) Avoid a large amount of updates/race conditions on your tables. Immutable facts, being immutable, can be done using purely inserts.
2) Want to keep a trail or context to the data you're storing. This has a benefit of keeping a log that you can later use to trace issues, behaviours or generate statistics. If you're keeping a state in a database column, without external logs you aren't able to check when or why that state changes. Immutable facts keep track of this for you, but require quite a bit more overhead.
I might be trivializing this but you could create a view "ActiveUsers" / "DeletedUsers". Not saying this make the IsDeleted flag perfect, but it would alleviate that particular problem, IMO.
I believe this won't work on PostgreSQL, which will always say one row has been updated, even if the new value is the same as the old value.
There are numerous ways around that, from less fancy to really fancy.
1. use SELECT FOR UPDATE, which will lock the row (I'd say that's the normal way)
session1> BEGIN;
session1> SELECT * FROM goals WHERE player_id = ? FOR UPDATE
session2> BEGIN;
session2> SELECT * FROM goals WHERE player_id = ? FOR UPDATE
# session2 is now hanging
session1> UPDATE goals SET completed = true WHERE goal_id = ?
session1> UPDATE players SET points = points + 1 WHERE player_id = ?
session1> COMMIT;
# session2 now proceeds, sees that the goal has been completed, forfeits awarding the reward
2. use suppress_reduntant_updates_trigger (fun)
session> CREATE TRIGGER suppress_goal_t BEFORE UPDATE ON goals FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
session> UPDATE goals SET completed = true WHERE goal_id = ?
UPDATE 1
session> UPDATE goals SET completed = true WHERE goal_id = ?
UPDATE 0
# now you can use the approach mentioned in the article
3. use true serialisability
session1> BEGIN;
session1> SET transaction_isolation TO serializable;
session1> SELECT * FROM goals WHERE player_id = ?
session2> BEGIN;
session2> SET transaction_isolation TO serializable;
session2> SELECT * FROM goals WHERE player_id = ?
session1> UPDATE goals SET completed = true WHERE goal_id = ?
session1> UPDATE players SET points = points + 1 WHERE player_id = ?
session1> COMMIT;
session2> UPDATE goals SET completed = true WHERE goal_id = ?
ERROR: could not serialize access due to concurrent update
# session2 now has to rollback the transaction
There's a few more, but I ran out of steam typing ;) Yay, Postgres!
Really not that much of a subtle bug I'm afraid, the pattern of "mutating the condition itself on success" is error-prone in single-threaded code too.
A good example being asynchronous javascript, where you'd do this to prevent issues like the exposed in the article:
var some_mutable_condition = false;
var requested = false;
if (!some_mutable_condition && !requested){
requested = true; // ensure one request is performed at most
$.ajax(/*async code that'll set some_mutable_condition to true*/);
}
I like collecting tales like this one. They are useful for scaring new programmers who think they understand how to safely do multi-thread programming.
If I remember correctly, one of the ways around this is to do a no-op update before querying. So you'd update a field in the goal table to it's original value, which would cause the database to lock the table for the duration of the transaction.
Also, if I remember correctly, this is one of the things that can be solved with an MVCC aware engine.
Right, but that involves changing your database and adding hidden form fields to your app in order for it to work on a multi process environment (like every production server should be.)
My solution is basically the same amount of lines, requires no such changes.
Yes, but optimistic locking with a version is the general and known solution to this problem in all cases and is built into the framework already. Your solution works in just one case, not the general case.
Another solution would be to have e.g. a "Completions" table indexed on user and on date. Then the RDBMS wouldn't allow duplicate goal completions for the same user and day. If you wanted to allow up to X completions/day, just add a new column to track that, and limit that column to the range [0, X). I often find that updating rows in response to transactions is a code smell. Each transaction should write one or more rows to one or more tables. Of course, if you have denormalized summary tables, they might have to update, but you only do that when you know you have to.
There's a difference between a transaction that locks for writes and one that locks for reads. By default the reading is not locked in this (common) case.
The common case is going to be many readers with a few or even a single writer. In that scenario, a read locked database would destroy any chance of operating at scale.
For a lot of RDBMS, transaction semantics can be later with different transaction isolation levels. There levels like repeatable read which work in this way, but the default level is usually weaker.
"I was worried that I’d have to resort to semaphores or some kind of other exotic programming construct."
I'm not saying a semaphore would have been appropriate here but- For real? A semaphore is exotic? It is an one of the oldest original synchronization methods. Semaphore were developed by Dijkstra in 1965. The author's C-S program failed him.
My point was that a semaphore is freshmen or sophomore level undergrad stuff, something I would expect every professional developer to understand. Here someone who considers themselves a developer, which I assume means a professional software developer, considers it an exotic construct. Things like this shakes my faith in the computer science educational system.
I think you're being a bit dramatic about little part of one sentence.
There are many application developers out there using Rails, PHP or Django who can go years without invoking locks or semaphores due to the strengths of the APIs and frameworks they use.
As they learn and grow as developers, they will be introduced to new problems and solutions. A semaphore can be exotic, not because they've never heard of it before, but because they've not used them very much outside of that one assignment they did in CS 5 years ago.
How about instead of criticizing people's eductions, we encourage them to be mindful about what they don't know, and to learn whatever they can when they get a chance?
"How about instead of criticizing people's eductions, we encourage them to be mindful about what they don't know, and to learn whatever they can when they get a chance?"
I'm all for people learning things they were not taught in school and I commend you for that.
But as the field grows there will be more and more material that was traditionally taught, but is now dropped. For example, I could imagine a misguided department dropping a once required OS/Systems class to allow students to choose that as an elective, or maybe a node.js or Ruby elective instead. I think those kind of program changes are serious and need discussion-
Not everyone who does this was in a CS Program. When I learned about them, I went with a lock-free model for my own project because semaphores seemed to add more complexity than they solved. I can understand the author's desire not to use them.
I'm a CS student at a university known for it's CS program. I know about semaphores, but this has not been covered in any of the first or second year classes.
In the example code I gave, by default in Rails with Postgres (which supports MVCC) you'll still get the error. So it's not as simple as using a transactional database: you either use a trick like this one, an row-based version column, or you need to change your isolation levels which has other performance issues.
That introduces external dependencies though. Also, since the update has to occur anyway, I'd bet that my solution is faster than also making a call to another database for a lock.
In this example, you might have a goal_completion table to which you append a row when the player completes a goal, instead of changing an existing row. The table could have suitable unique constraints to make sure that each player can only complete a goal once.
This way, it is much harder for your data to accidentally be modified in a way that it becomes wrong.
This also gives you a way to store facts about the completion for free, such as a timestamp for each goal completion. And having data like timestamps is really useful for auditing, debugging and testing.