I wasn't trying to imply it should be on by default. Often one does not care about the lower bits of the floats, but do want the speed. For some tasks it's very much the opposite. Being able to specify a global option with local override is a great combo.
FMA is not always faster, it has a high latency: 5-6 cycles depending on the CPU while Add and MUL have very low-latency.
This means that to fully utilizes FMA you need to unroll a loop more. Sometimes yyou just can't, and the other time you use more instructions, use more cache.
In short it's not always better.
Also as other said, FMA has better accuracy than separate Add + Mul