+1, rounding f32 to bf16 is helpful. For the other direction, the approach we take in Highway/gemma.cpp is to load a full vector of bf16, then shift/AND to isolate the odd/even elements and convert to float. These can execute two per cycle, whereas promoting 16->32 bit is often just one per cycle (though a different port than FMA).