A single machine is never going to be completely reliable. At any time it can ha...

CrLf · on Nov 27, 2012

You are factually correct. However, availability isn't the problem ECC memory intends to solve.

The problem with memory errors is that they are silent. You won't notice them until something goes misteriously wrong. And that can be anything, from the innocent invalid memory access to data corruption. This just can't be tolerated anywhere data is being processed, data you don't want to lose that is...

RAID does nothing if the OS thinks that its in-memory filesystem datastructures are correct, and just goes ahead and updates the superblock with bad data, or writes over other files' pages. You just get a nice, redundant, corrupted filesystem. The same goes for multiple machines sharing data anywhere, filesystems or databases alike.

It's the error detection part that's important, not the correction part. And ECC main memory is just a part of the picture, you want to be notified of errors as soon as possible. And this is the important bit: "be notified". So, you want parity checks and CRCs on disk caches and data buses and everywhere else it's feasible. It's not an accident that server-class hardware costs more than your average PC.

The "correction" part is just a welcome by-product. I, for one, replace memory modules as soon as they trigger more than one ECC event. And this happens occasionally even with an universe of machines in the low dozens, with supposedly high-quality components. Now think what may be happening silently with all those borderline memory modules from anonymous manufacturers in China...

Besides, like I mentioned before, it isn't easy to find non-ECC memory servers from the usual vendors. Only their very low-end machines have it. Machines that aren't meant to do anything more that shoving packets around or other usage patterns where either silent data corruption can be tolerated (easy to replace appliances that don't process/store important data) or checksums are already a part of the job (network stuff like firewalls or routers).

sliverstorm · on Nov 28, 2012

I, for one, replace memory modules as soon as they trigger more than one ECC event.

I thought ECC events were triggered by environment, rather than hardware faults? Or you just figure some sticks are by chance more susceptible?

CrLf · on Nov 28, 2012

ECC events are triggered by any memory error, be it the occasional cosmic ray or a not so good memory module.

It isn't difficult to tell these two possibilities apart. Sometimes I get an ECC event on some server, and then it never happens again (or it happens in a different module), which doesn't warrant a replacement. Now, if the same module triggers another event, what's the chance of two "cosmic rays" hitting the same module twice and flipping a bit on it? It's better to just replace it (which is covered by warranty or maintenance contracts, so it costs us no additional charge).

politician · on Nov 28, 2012

Manufacturing memory from silicon wafers is similar to baking cookies. Some cookies are great, some turn out OK, and some are burnt depending on the characteristics of the ingredients, the oven, and the chaotic thermodynamic properties of the system.

So, yes, yield varies.