A single machine is never going to be completely reliable. At any time it can halt for a variety of reasons: power loss, hardware failure, disaster in the data center like flooding, etc.
Thus, a configuration that relies on the availability of a single machine is already risking serious outage or data loss by not being machine-redundant. Reliable systems require the coordination of many machines (at least two), and the replication of data across them if data's involved.
It is useful to have component-level redundancy (e.g., RAID or ECC memory), but in some environments it may be cheaper overall to have machine-level redundancy using inexpensive machines. It also only takes the failure of a single critical subsystem for a machine to suffer an outage. You might have ECC memory and RAID, but do you have only a single Ethernet card and power supply? Single machine availability is a "weakest link" phenomenon from its components.
I acknowledge that building software to run across a fleet of machines is more difficult than software that runs on only a single machine, but (1) the software development cost is largely a fixed cost, not a variable cost in the number of machines (2) building a distributed system is sometimes needed for scaling reasons anyway.
If you scale a single machine vertically (i.e., get a bigger box), its cost rises faster than its capabilities; so an efficient high-scale system typically also means running a fleet of cheap machines (scale horizontally). I think these effects contribute to the rise of commodity-server computing, and cost is a reason not to consider it disturbing.
In other words, crunch the numbers and see when it makes sense :-)
You are factually correct. However, availability isn't the problem ECC memory intends to solve.
The problem with memory errors is that they are silent. You won't notice them until something goes misteriously wrong. And that can be anything, from the innocent invalid memory access to data corruption. This just can't be tolerated anywhere data is being processed, data you don't want to lose that is...
RAID does nothing if the OS thinks that its in-memory filesystem datastructures are correct, and just goes ahead and updates the superblock with bad data, or writes over other files' pages. You just get a nice, redundant, corrupted filesystem. The same goes for multiple machines sharing data anywhere, filesystems or databases alike.
It's the error detection part that's important, not the correction part. And ECC main memory is just a part of the picture, you want to be notified of errors as soon as possible. And this is the important bit: "be notified". So, you want parity checks and CRCs on disk caches and data buses and everywhere else it's feasible. It's not an accident that server-class hardware costs more than your average PC.
The "correction" part is just a welcome by-product. I, for one, replace memory modules as soon as they trigger more than one ECC event. And this happens occasionally even with an universe of machines in the low dozens, with supposedly high-quality components. Now think what may be happening silently with all those borderline memory modules from anonymous manufacturers in China...
Besides, like I mentioned before, it isn't easy to find non-ECC memory servers from the usual vendors. Only their very low-end machines have it. Machines that aren't meant to do anything more that shoving packets around or other usage patterns where either silent data corruption can be tolerated (easy to replace appliances that don't process/store important data) or checksums are already a part of the job (network stuff like firewalls or routers).
ECC events are triggered by any memory error, be it the occasional cosmic ray or a not so good memory module.
It isn't difficult to tell these two possibilities apart. Sometimes I get an ECC event on some server, and then it never happens again (or it happens in a different module), which doesn't warrant a replacement. Now, if the same module triggers another event, what's the chance of two "cosmic rays" hitting the same module twice and flipping a bit on it? It's better to just replace it (which is covered by warranty or maintenance contracts, so it costs us no additional charge).
Manufacturing memory from silicon wafers is similar to baking cookies. Some cookies are great, some turn out OK, and some are burnt depending on the characteristics of the ingredients, the oven, and the chaotic thermodynamic properties of the system.
Thus, a configuration that relies on the availability of a single machine is already risking serious outage or data loss by not being machine-redundant. Reliable systems require the coordination of many machines (at least two), and the replication of data across them if data's involved.
It is useful to have component-level redundancy (e.g., RAID or ECC memory), but in some environments it may be cheaper overall to have machine-level redundancy using inexpensive machines. It also only takes the failure of a single critical subsystem for a machine to suffer an outage. You might have ECC memory and RAID, but do you have only a single Ethernet card and power supply? Single machine availability is a "weakest link" phenomenon from its components.
I acknowledge that building software to run across a fleet of machines is more difficult than software that runs on only a single machine, but (1) the software development cost is largely a fixed cost, not a variable cost in the number of machines (2) building a distributed system is sometimes needed for scaling reasons anyway.
If you scale a single machine vertically (i.e., get a bigger box), its cost rises faster than its capabilities; so an efficient high-scale system typically also means running a fleet of cheap machines (scale horizontally). I think these effects contribute to the rise of commodity-server computing, and cost is a reason not to consider it disturbing.
In other words, crunch the numbers and see when it makes sense :-)