For those skimming and to add to the above the article is using the gpu to work ...

For those skimming and to add to the above the article is using the gpu to work with system memory since that’s where they have the initial data and where they want the result in this case and comparing it to a cpu doing the same. The entire bottleneck is GPU to system memory.

If you’re willing to work entirely with the gpu memory the gpu will of course be faster even in this scenario.