It is a little disappointing that they're setting the bar against vanilla Python in their comparisons. While I'm sure they have put massive engineering effort into their ML compiler, the demos they showed of matmul are not that impressive in an absolute sense; with the analogous Julia code, making use of [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) to automatically choose good defaults for vectorization, etc...
julia> using LoopVectorization, BenchmarkTools, Test
function AmulB!(C,A,B)
@turbo for n = indices((C,B),2), m = indices((C,A),1)
Cmn = zero(eltype(C))
for k = indices((A,B),(2,1))
Cmn += A[m,k]*B[k,n]
end
C[m,n]=Cmn
end
end
M = K = N = 144; A = rand(Float32, M,K); B = rand(Float32, K,N); C0 = A*B; C1 = similar(C0);
AmulB!(C1,A,B)
@test C1 ≈ C0
2e-9*M*K\*N/@belapsed(AmulB!($C1,$A,$B))
96.12825754527164
I'm able to achieve 96GFLOPs on a single core (Apple M1) or 103 GFLOPs on a single core (AMD EPYC 7502). And that's not even as good as what you can achieve using e.g. TVM to do the scheduling exploration that Mojo purports to do.
Perhaps they have more extensive examples coming that showcase the capabilities further. I understand it's difficult to show all strengths of the entire system in a short demonstration video. :)
EDIT: As expected, there are significantly better benchmarks shown at https://www.modular.com/blog/the-worlds-fastest-unified-matr... so perhaps this whole discussion truly is just a matter of the demo not showcasing the true power of the system. Hopefully achieving those high performance numbers for sgemm is doable without too much ugly code.
Yeah I think no one will likely have any edge for a simple thing like a matrix multiplication since all the right abstractions are supported in both languages and they end up in the LLVM code gen. Having Python 3 backwards compatibility and easily deploying your code to, say, phones via a C++ API is quite big though.
It seems you are using N=144 where in the modular example they are doing N=1024. Significantly less computationally expensive of a calculation in this Julia example.
I'm sure there are reasons for it, but Chris Lattner has been jumping around a bit. Remember swift4TF.
But hopefully this one is seen through with lots of open source too.
but Mojo seems to keep all of that numpy cruft that's there because Python. (numeric computing was bolted onto python via numpy, but it's built-into Julia)
I'd love to see a comparison vs Julia though, which I think tried to tackle some of the same problems.