For someone that knows a thing about CUDA and parallel programming already, the ...

For someone that knows a thing about CUDA and parallel programming already, the best reference is Paulius Micikevicius’ presentations. If the words in it mean something to you, these 100+ slides explain more about the hardware and programming model than any other documentation you’ll find elsewhere.

http://on-demand.gputechconf.com/gtc/2013/presentations/S346...

If you want to really master CUDA, Nvidia GPUs and the various programming model tradeoffs, the best thing is to write a GEMM kernel and a sort kernel from scratch. To take it even further, write two of each: one that optimizes large GEMMs/sorts, and one that optimizes for batches of small GEMMs (or large GEMMs with tiny (<16 or <32) `k` or another dim) / batches of small sorts. Specialization for different problem configurations is often the name of the game.

For GEMM, you can work through the simple GEMM example in the CUDA documentation, then take a look at the Volkov GEMM from 2008, then the MAGMA GEMM, then the Junjie Lai / INRIA GEMM, then eventually the Scott Gray / Nervana SASS implementation, in increasing order of complexity and state-of-the-art-ness.