You're going to want to optimize the hell out of it and ensure it's correct, then never revisit it.
Balderdash. Claptrap. Codswallop. Poppycock. What if you want to revisit it to make it run on parallel processors? Optimize it for a new CUDA architecture or caching scheme? A new instruction set that handles sparse matrices better? Code lives forever, at every level.
For this kind of linear algebra, you're going to have a different routine for a sparse matrix (dcoomm) or a symmetric one (dsymm) or a triangular one (dtrmm) or symmetric banded matrix vector product (dsbmv) or whatever. Hopefully those are separate functions from your dense one (dgemm), which is what I'm talking about. You'll also have a different function for CUDA (which exists in cuBLAS) than for multi-core (in ScaLAPACK? PLASMA?) than for single-CPU (in BLAS), which is what I'm talking about. General matrix multiply in this sense isn't "all the different matrix multiplies." It's a function with a very specific purpose, and yeah you don't really need to revisit it every decade.
Balderdash. Claptrap. Codswallop. Poppycock. What if you want to revisit it to make it run on parallel processors? Optimize it for a new CUDA architecture or caching scheme? A new instruction set that handles sparse matrices better? Code lives forever, at every level.