Redis has been working here on a high traffic site without any trouble for more than a year. Excellent software. The only software I might think that's bug free, due to the attitude of @antirez.
This thread will likely get a million "us too" style posts but Redis is core infrastructure at Shopify. It's been so solid that we recently waived our defensive requirements that the app remains working if Redis goes down. This allowed us to port our inventory reservation system (a huge point of potential lock contention) completely to the new server sided LUA scripts. We have seen a full order of magnitude speed increase from this. A reservation for a complicated order is now measured in µs instead of ms.
I'll put "us too" here, as we too now assume Redis to be up like our relational store(s) -- Percona MySQL. ~16GB at ~500 ops/s averaged over the last 3 months. We've been running Redis for core features for more than 21 months, and as a store, it has been the most stable and easiest to reason with for what performance we can expect and actually receive. Compared with "non-SQL" setups we deploy, if we were to start from scratch, we'd look to replace ActiveMQ, Solr, and a number of our jobs that we jam through MySQL.
Thanks Antirez, could you share more insight on TILT mode? Any other alternative approaches you considered? Why use a value of 30 seconds to leave TILT mode? If the time has shifted is it likely ever to be correct thereafter?
Hello, basically 30 seconds is exit time only if no other time shifts are detected, otherwise we set again the exit to now+30sec, and so forth.
30 seconds is set as 3 times the biggest period we have in info collection (INFO itself is sampled every 10 seconds, while PING is every 1 second), so that if there was a problem with the timer, in 30 seconds we are sure the new state will get new readings for every kind of request and information we collect, so when the TILT mode is exited, and the function to evaluate the state is called again, it should see clean values.
Note that from the point of view of Sentinel it is ok that the new time is wrong compared to what the real time is, we never use the absolute time. All we need is that we have a computer clock that more or less advances regularly.
I imagine there is some overlap in parts of the logic for Sentinel and repmgr (a similar tool for PostgreSQL). For example, checking to see if members of the cluster are in-service, and choosing a new master in the event of a failover.
I would love to see a generic tool for handing the clustering/failover problem.
It's true there is some overlap, but also Sentinel uses things that are specific of Redis. For instance for us two things are crucial:
1) The ability to use the master as a message bus to auto-discover things. This is possible because every Redis instance is also a Pub/Sub server.
2) The idea that after every restart of every Redis instance we have a "runid" that changes.
And in general the logic of the failover itself, the fact that the failure detection is precise (some specific reply codes are considered in a way, some others in another way), makes a non specific solution much harder to implement with the "methods" to perform the service-specific tasks that may end to be complex, or sometimes forces to completely change the logic of the system (lack of Pub/Sub).
"@cscotta Hi, you misunderstood how it works: Pub/Sub is only used for discovery on startup. Sentinel-to-sentinel p2p for critical stuff."
Pub/Sub is used to make the configuration simpler when you start a Sentinel cluster at a cold time when everything is working and your master is ok.
This allows us to auto-discover the other Sentinels, to check the slaves, and so forth.
Instead in order to understand if a system is down, who is the Sentinel that performs the failover, and for all the critical stuff, Sentinel to Sentinel messages are used without caring if the master Pub/Sub works.
Sorry I was not clear. Sentinel uses the hiredis C library itself in order to talk with other Redis instances. A bug in the C library crashes the library and the process it is running into.
I've written a little experimental python client that connects to a sentinel and keeps an image of the state of the monitored cluster as it changes.
http://bit.ly/NNrQdI