What I am not getting from my superficial knowledge is that why is Prometheus getting so much traction over elastic search. Elastic search claims to be as good for metrics and events. The ES database itself is more advanced with eventual consistency and search capability. It can do log analytics and it can be backend to tracing tool like Jaeger. Why so much investment in Prometheus. Disclaimer: I have not used Prometheus too much myself.
I think you'd be very hard pressed to scale an Elasticsearch cluster to 10s of millions of writes/s without breaking the bank (and even if you had a pile of money to light on fire I don't think an Elasticsearch cluster with the number of nodes you'd require to support that would work very well).
Elasticsearch is a great piece of technology and its very versatile which makes it a great fit for a lot of problems (Uber, where M3 was developed, is a heavy consumer of Elasticsearch for logging purposes for example), but for the types of metrics workloads and scale that M3/Prometheus were designed for Elasticsearch simply wouldn't work.
At their core these systems are basically specialized column stores, they have complete different read/write patterns to something like ES. The basic query unit for example is always going to be the scan, I'm not even aware of any monitoring system with some kind of secondary index capability. ES supports a bunch of nice result aggregation stuff on top of Lucene, whereas these systems are primarily /built for/ this use case
What's interesting about some of the more modern monitoring systems like M3 and Prometheus is that they have a reverse index on top of the column store entries to very quickly find the relevant metrics for a multi-dimensional query.
elasticsearch is the one thing i've worked with that i've had to learn to pretend i know nothing about. you do not want to get labeled the expert on that thing. it's nice and finicky.
Another viable player that took the path similar to M3DB is VictoriaMetrics [1]. This allowed implementing various features [2] and optimizations [3] without the need to negotiate their integration into upstream Prometheus. Such negotiations can stuck forever. [4]
> Chronoshere, a startup from two ex-Uber engineers, who helped create the open source M3 monitoring project to handle Uber-level scale, officially launched today with the goal of building a commercial company on top of the open source project.
I recall a thread here from 2-3 weeks ago about how “Uber-scale” wasn’t really Uber scale, and that most of these publicized “Uber-scale” projects ended up getting canned internally. Any insider insight to this M3 project?
Rob, co-founder and M3DB creator here, Uber collected billions of metric samples and we had tens of billions of metrics in M3 at Uber. Netflix for reference has not published any numbers higher than single digit billions of time series. The system has run in production for several years now at Uber now. That's my thoughts on the matter, hah.
> Released in 2015, M3 now houses over 6.6 billion time series. M3 aggregates 500 million metrics per second and persists 20 million resulting metrics-per-second to storage globally (with M3DB), using a quorum write to persist each metric to three replicas in a region.
So, if that's accurate, they're collecting one trillion data points every two seconds.
So we collected and aggregated more than 1 billion samples of metrics per second, which resulted in writing more than 30-40 million unique metric datapoints per second to storage. This resulted in more than 10 billion unique time series being stored (each with a very large number of distinct datapoints each).
This was 3.6 trillion metric samples per hour or 2.5 trillion metric datapoints stored a day (after aggregating samples).
No, they're collecting one BILLION (with a b) data points every two seconds. Gotta go to 2000 seconds (a little over half an hour) for the TRILLION.
With a 25:1 reduction/summarization before writing. If they're smart, they do that summarization on the way in, rather than at the back-end layer. That's a billion data points written per minute, or a trillion and a half written per day!
Congrats Martin and Rob on the launch. M3 is one of the best tools I used at Uber. Something which just works. I'm sure you guys will be successful as I first hand witnessed the value it brings to an organization.
Hey Rob a co-founder and M3DB creator here, more than happy to answer any queries anyone might have. We're committed on continuing M3 being 100% apache 2 licensed, clustering and all other M3 features included. We're focused on providing reliable metrics hosting at scale.
Haha TY for the kind words, I had to stop playing years ago now unfortunately with family commitments - also I mainly enjoyed playing on a team than actually developing real soccer skills (which I relied on others in the team to pull me upwards, heh)
That’s my personal reference to the word, but searching around a bit, it seems that it was registered as a trademark by a medical company already in 1991, 5 years before Red Alert.
Metrics monitoring is hugely useful for figuring out what's going wrong (or right...) and where - especially when you can slice and dice by dimensions/tags. Microsoft (where I work) uses lots of metrics internally, for every sevice. It's nice to see M3/Chronosphere making this kind of thing more affordable and widely accessible.
One thing that I often miss when reading about this stuff is benchmarks. So it's faster than Prometheus? Prove it. So it's faster than Postgres, or TimescaleDB? Prove it.
It should be trivial, and the fact that it's not there and what you find instead is terms like "Uber-scale" is slightly worrying.
I'm not trying to take anything away from the achievements made here by the guys at Uber, but anyone seriously considering using this in production would probably need a better contrastive comparison between alternatives.
It's not raw speed or raw performance on a single node that M3DB is optimized for, it's for a reliable scale out story when you have a considerable number of instances required to collect the raw data you operate on (organizations of certain size/complexity run into this, not just a handful of organizations/companies).
Benchmarks tend to favor the authors and are frequently game-ified, look at GPU benchmarks like 3DMark that frequently had manufacturers release optimizations that were really only utilized in specific benchmarks.
> There weren’t any tools available on the market that could handle Uber’s scaling requirements
This isn't a problem that you can build a business around.
Edit: Ah, I get it. This is like a Mesosphere play--they're shepherding the M3 technology in the open source ecosystem and offering a commercial version. That makes more sense.
Splunk kool-aid drinker here; pardon my ignorant question, but why not just use Splunk?
Actually I think my real question is, why are there such a proliferation of these monitoring/logging/visualization -AAS startups? Who are the target customers, in terms of spends?
Congrats on the launch Rob and Martin. M3 is an amazing product which I had the privilege to use at Uber. Wish you two the best for your journey ahead!