Rabbitmq has at-least-once semantics (unless requeue = false, in which case all ...

antirez · on March 16, 2015

Thanks for your comment. I think there is some misunderstanding about AP / CP and message semantics. Messages in a (proper) AP system are immutable and replicated multiple times, and never dropped if not acknowledged by a client, so there is no case where the message with a replication level of N can miss delivering if just N-1 nodes failures occurred. This is the story of a Disque message:

1. A client produces the message into some node with replication factor of N.

2. The node replicates to additional N-1 nodes (or to N nodes under memory pressure to avoid to retain a copy).

3. When the replication factor is achieved, the client gets notified. Or after a timeout, an error is reported to the client, and a best-effort cluster-wide deletion of the message is performed (only contacting nodes that may have a copy).

4. At this point the message is queued only in a given node, but N have a copy.

5. The message is delivered, but let's imagine is not acknowledged by the client.

6. Every node that has a copy, after the retry time elapsed, will attempt to re-queue it again, using a best-effort algorithm to avoid requeueing multilpe times. However the algorithm is designed that under partitions what could happen is only multiple nodes putting the message ready for delivery again, not the contrary.

7. Eventually the message gets acknowledged. The ack is propagated across the cluster to the nodes that had a copy of the job, and the ack is retained (if there is no memory pressure) to make sure that no node will try to deliver again the message. However when all the nodes report to have the acknowledged, the message is garbage collected and removed from all the nodes having a copy.

So basically you can count on Disque trying to deliver the message at any cost UNLESS the specified message time to live is reached. You can optionally specify a max life for the message, for example 2 days, if after 2 days no delivery was possible, the message is deleted regardless of the acknowledged state or not. This is useful because sometimes to deliver a message after a given time does not make sense.

However if you set, for example, retry to 60 seconds and TTL to 2 days, it means that all the nodes having a copy will try every minute to deliver a non acknowledged message again for 2 days. There is just to keep in mind that TTL means to destroy messages after some time.

ekimekim · on March 16, 2015

So what if, during a partition, I publish another message? Does it get rejected for not being able to reach N servers within timeout, or has "N" been adjusted down to the size of the currently-reachable machines.

ie. is a partitioned cluster effectively "split-brained" where publishes only appear on one side, or does it stop accepting new messages?

antirez · on March 16, 2015

N is a number you specify via the API call with the REPLICATE option. By default is set to 3 to provide a reasonable durability. So if you are in any side of the partition with at least 3 nodes, you can continue without issues.

Two sides of a partitions are basically two smaller cluster that operate independently, if we consider new messages. But what about old messages? They'll wait (if there is no memory pressure) to get garbage collected if copies are split among the two nodes. However during the partition the side where the message gets acknowledged will stop ASAP from re-queueing it.

suchire · on March 16, 2015

Note that even in CP mode, committed messages can be dropped as well (one of the pitfalls I assume you've experienced) when the cluster heals from partition. cf link in my sibling comment.

ekimekim · on March 16, 2015

I suspect what happened there was to do with http://www.rabbitmq.com/ha.html#unsynchronised-slaves

When using mirrored queues, Rabbit does ensure all the active mirrors are written to before confirming a publish:

    "in the case of publisher confirms, a message will only be confirmed to the publisher when it has been accepted by all of the mirrors"

So if my understanding is correct, wiping the contents of a re-joining mirror shouldn't matter, since no new messages should have been accepted since the partition (unless the "pause" part of pause-minority is only happening after other things like re-election or dropping "dead" slaves, in which case yes pause-minority is useless - this seems doubtful, however).

Hence why I think the problem is synchronized slaves.

Basically, when a slave is created (eg. in response to another slave dying), it only receives NEW messages, not existing messages. So suppose the following sequence of events on a 2-mirror queue:

    Publish A
    Master and slave both contain A
    Slave dies
    New slave created
    Master contains A; Slave contains nothing
    Publish B
    Master contains A,B ; Slave contains B
    Master dies
    Slave promoted and new slave created
    A is lost

The way around this is with setting the policy "ha-sync-mode": "automatic". In which case the act of creating a new slave also replicates the current contents of the master. To the best of my knowledge, if the same Call Me Maybe tests were run with that policy in place, no messages should be lost.

But yes, this is precisely what I meant by "fraught with pitfalls". The pause while messages replicate can be disastrous on its own if the queue is large, another issue that has bit me in production.

I do love RabbitMQ but I wish there was a good, planned-from-the-beginning as clustered CP AMQP broker out there. Maybe I'll try to write one.