Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exotic is the most interesting one. For example, I recently needed a fast way to flip order of bits in each byte of a long buffer. Did that in 8 AVX2 instructions / 32 bytes (which include load and store), by using 16-bytes long lookup tables for VPSHUFB.


8 instructions seems very solid - guessing AND/PSHUFB for low nibble, SHIFT/AND/PSHUFB for high nibble, OR to combine plus load/store?

If you have AVX-512, GFNI is faster for this task, but obviously many situations where you can't use it.


> guessing AND/PSHUFB for low nibble, SHIFT/AND/PSHUFB for high nibble, OR to combine

Yeah, that’s exactly what I did in my C++ code with intrinsics.

About ISA extensions, I’m lucky to work on a professional CAM/CAE software. We have specified AVX2 in the system requirements, I’m guaranteed to have the support on our customer’s computers. However, very few of them have AVX512 CPUs so we are ignoring that thing so far.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: