Well, the reason why ECC mattered here is because the RAM was bad, but modern Mac computers do not come with user-serviceable RAM at all, so if you have a problem like this, it's a support ticket anyways, and I'm not even sure there's a true equivalent to Memtest86 for modern Mac computers in the first place. So basically, if it was a RAM problem, there's no point in diagnosing it even if you could; just send your Mac in when you start having issues that seem to be bad RAM.
Even with ECC, it's incredibly hard to know that a given one-off issue isn't a memory error, because even ECC can't detect 100% of memory issues. But without ECC, it's also nearly impossible to know if something is a memory error. If it's bad RAM, the same address will likely continue to exhibit bad behavior, but if it's a solar flare, you're never going to know the difference; you will just get incorrect behavior that may or may not crash, and it will be completely impossible to reproduce.
One big reason you don't hear it as much is there are not nearly as many data centers filled with Macs. There are definitely a few, and I bet if you got an experience report from them, they could give some idea of how visible memory errors are on Macs (although it's hard, because again, if you don't have ECC, there's not really a good way to know if something is a memory error; you can only really postulate.)
ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.
Without it the corruption is silent. Then this kind of thing happens:
Which is another reason not to solder the storage either.
Suppose you have a system board with bad soldered memory and you want to copy your data off of it onto the new one. Well, the memory is flipping random bits as it's copying, but the flash chips are permanently attached to the same board as the bad memory.
Otherwise it would have been just a support ticket; now it's something worse.
>ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.
I did neglect to mention that ECC by-definition can correct errors, but I wonder if what's making people upset with my comment is the implication that ECC can't detect all errors.
But it's true: ECC can't detect all bitflips, and in fact there's at least one study[1] that suggests quite a lot of memory errors go entirely undetected even with ECC.
Silent corruption does in fact occur even with ECC and it may not even be particularly rare, even though it is rarer than typical single/double-bit flips. Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.
Maybe the issue is that undetectable errors are possible, but if the system is in such a bad way that they're happening at any rate, you'll also be getting quite a lot of the detectable ones and then get prompt notice that something is wrong.
Whereas without ECC you could have silent data corruption for years and only discover it after it gets severe enough to warrant a manual investigation, after the damage has already propagated to your backups.
> Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.
There are two reasons it's useful. One is the cosmic ray random bit flip that happens even to hardware in good condition, and then ECC can usually detect and correct it, but that's less common and more important for production workloads.
The other is, your hardware is experiencing a higher than average number of random bit flips, and then ECC gives you immediate notice when this starts happening instead of letting it sow chaos until something crashes so hard you take notice.
Even with ECC, it's incredibly hard to know that a given one-off issue isn't a memory error, because even ECC can't detect 100% of memory issues. But without ECC, it's also nearly impossible to know if something is a memory error. If it's bad RAM, the same address will likely continue to exhibit bad behavior, but if it's a solar flare, you're never going to know the difference; you will just get incorrect behavior that may or may not crash, and it will be completely impossible to reproduce.
One big reason you don't hear it as much is there are not nearly as many data centers filled with Macs. There are definitely a few, and I bet if you got an experience report from them, they could give some idea of how visible memory errors are on Macs (although it's hard, because again, if you don't have ECC, there's not really a good way to know if something is a memory error; you can only really postulate.)