Practice is a main ingredient. Build and debug (down to below assembler level with i.e. vtunes, learn about weak memory ordering, ...) increasingly complex systems.
Learn how high level language code is transformed into "what the hardware shall do" to be able to predict, to some degree, whether that particular code you review will be optimized well or not (then measure and confirm/extend/correct your prediction).
Working with FPGAs helped me learning a lot about how to get the right data to the right execution unit at the right point in time. Solving this "space-time-problem" is exactly what optimizing is about in CPUs as well.
For the engineering part I'd say: being strict about separation of concerns and single source principles and not too religious with abstractions certainly helps.
And finally: read, read, read. Books, Papers, other people's code...