- Some arbitrary type of thing. (Lists, sets, strings, numbers, whatever.) Let's call it 'A'.
- A value of type A. Let's call it the 'empty' value.
- A function that takes two A values as arguments, and returns a single one. Let's call it 'combine' function.
Here are some examples of monoids:
- String concatenation: 'combine' is string concatenation, and 'empty' is the empty string.
- Integer plus: 'combine' is the usual + operation, and 'empty' is the value 0.
- Set union: 'combine' is the union of two sets, and 'empty' is the empty set.
You can come up with monoids for all sorts of things: maps, lists, functions, bloom filters, futures / promises, ...
There are, however, a couple rules:
- It doesn't matter which order you combine things in: `("hello" + "world") + "!"` has the same result as `"hello " + ("world" + "!")`; and `1 + (2 + 3)` has the same result as `(1 + 2) + 3`.
- You can add the 'empty' value to either end without changing anything. `"" + "hello"` is the same as `"hello" + ""` and plain old `"hello"`. Likewise, `3 + 0 == 0 + 3 == 3`.
That's it! To get an intuition, you might want to pick a couple more types from the list above and pick a good 'empty' / 'combine' function that satisfies the rules.
Of course, why you'd want to do this is another question. (And one I'd be happy to answer, if you're curious.)
This is probably the clearest explanation of what a monoid is that I've seen. Basically, a monoid is a reduction function that takes multiple inputs of type A and returns a single output of type A. The only caveats on top are that type A must support the concept of an empty value, combining the empty value with x results in x, and that combinations are associative.
- Any monoid operation is trivially parallelizable: take the dataset, split it into chunks, combine all the elements in each chunk together, then combine those results together as a final step.
- If I'm updating a row in my database with a monoid operation, I can always 'pre-aggregate' a batch of values together on the client side before taking that result and combining it with the stored value.
- If I store some statistics for every day of the year, I can calculate the monthly statistics very cheaply -- as long as those stats are a monoid.
The monoid abstraction seems weird at first, because it's so dang general, but it ends up hitting a bit of a sweet spot: the rules are just strong enough to be useful for a bunch of things, but simple enough that be applied to all kinds of different situations. You can think of it kind of like a hub-and-spokes thing -- this interface connects all kinds of different data types to all kinds of different cool situations, so you get a lot of functionality with a lot less typing and thinking.
The introduction of stream processing in Java 8 is a concrete example - it can auto-parallelize computations. It was not immediately clear to me why an identity value was needed in this function:
A recent post, was about "How to Build Your Distributed Database" (1).
The word 'monoid' is not used (and the word 'commutative' is even misused for 'associative')
but this post shows well why monoids matter and how they can be used.
Thank you for the great explenation; I read the article,a nd if it only had taken it out of the pure abstract and given the int example; it would have been 100% more relatable. I always get the feeling that they fear that giving an actual example would hurt the article and that the pure abstract is to be preferred.