Yeah classic use cases of GPUs like deep learning have you transfer the weights ...

Yeah classic use cases of GPUs like deep learning have you transfer the weights for the entire model to your GPU(s) at the of inference and after you that you only transfer your input over.

The use case of transferring ALL data over every time is obviously misusing the GPU.

If anyone’s ever tried running a model that’s too large for your GPU you will have experienced how slow this is when you have to pull in the model in parts for a single inference run.