Practical Lock-Free Buffers

kmavm · on Aug 29, 2009

While this provides perfect ordering information, it does so at the cost of bouncing a cacheline (the head of the queue) around the entire machine. That's fine if perfect ordering information is really, really important, but if a lack of probe effect is more important, try the DTrace approach: each thread writes into a private buffer, with a best-effort timestamp (say, the CPU's TSC). Drains merge-sort all the thread's buffers by timestamp.

agazso · on Aug 29, 2009

I don't get it. At the beginning of the article he talks about the dangers of locking IPC (inter-process communication) calls, then to prevent this, he describes a lock-free mechanism using the CAS primitive, which can be used only in-process, not between processess.

The TransferString function he proposes seems overly complex to me, using locks would make it more simple and even faster. It would almost make the code look like "lock(); memcpy(); unlock();" which is not prone to deadlocking.

I even downloaded the source code to the article, but he uses threads to test it. Anyone care to explain this thing to me?

mustpax · on Aug 29, 2009

You can use the CAS primitive between processes. It is guaranteed to be atomic, which is why semaphores are usually built on top of CAS.

He essentially packs 4 byte strings into integers (assuming 32 bit integers) in an atomic manner. In the strictest sense there are no locks here, but at the end of the day he treats each memory location as its own lock. He also does some busy waiting, which is just a good old spin lock:

    while(!InsertAt(data,insertedAt));

Interesting concept but nothing too new. I would also caution against using this code without understanding it first. I don't see much error checking in there.

KateLawson · on Aug 29, 2009

I think "lock-free" is a misnomer. If you are using a test-and-set instruction, you have incurred all the performance cost of a lock. While much work has been done recently to improve the performance of instructions like CMPXCHG, you still need to treat them like a UDP broadcast message (avoid if possible).

One case I don't think he handles is if a user does a Lookup() for some value in a slot, then tries to delete that slot. In this case, the wrong value might be deleted if an InsertAt() happens in the meanwhile.

dmoney · on Aug 29, 2009

How is CAS guaranteed to be atomic? Is it actually implemented by a machine instruction, rather than the pseudocode in the article?

Freaky · on Aug 29, 2009

Yes, look up cmpxchg and friends.

FreeBSD has atomic.h providing a bunch of different atomic ops, e.g: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/include/...

I believe FreeBSD also uses this style of buffer for dmesg, which sadly also results in a lot of interlacing if two kernel threads are trying to write to the buffer at once.

rdtsc · on Aug 29, 2009

Yes, it is implemented by the hardware. In x86 I believe it is the CMPXCHG assembler instruction.

GCC, for example has __sync_val_compare_and_swap(...) function that give you access to the hardware compare and swap.

rdtsc · on Aug 29, 2009

> ... which can be used only in-process, not between processess.

You can use them between processes if you use shared memory.

This is useful in realtime or any other low latency applications where one reader or writer cannot be blocked for more than a couple of hundred microseconds.

It is basically reimplementing the semaphore but it is controlled by the application. There are no system calls, no threads being suspended. Instead of the busy while loop it might be possible to use a fixed count of iterations or a specific timeout value.

This is useful but in a very narrow context.

agazso · on Aug 29, 2009

Thanks, I completely forgot about using shared memory between processes. Lately I used exclusively sockets, as it provides a portable solution between languages and platforms. But you are right, it is not suitable for realtime or low latency applications.

Still I think the implementation in the article is bad, and you should avoid using it in production, because it uses too much busy waits and there is no error handling.

And if one don't use shared memory, just threads (as in the example code) you gain nothing from this solution. You'd better use a simple lock-based implementation, or if you need low latency a correct lock-free or even better, a wait-free queue.

tptacek · on Aug 29, 2009

I did something similar at Arbor; single producer, multiple consumers, high-volume message buffer (individual TCP connections off a monitored ISP core network). When I got in the door, it was SYSV semaphores. Don't ever use SYSV IPC. We needed an event loop, so we could do fine-grained timers. Instead of using locks, we did a distributed commit scheme (using an atomic increment), just as if we were synchronizing over the network.