This looks quite interesting with big names behind it. I'd love to see a compari...

staticfloat · on May 2, 2023

It is a little disappointing that they're setting the bar against vanilla Python in their comparisons. While I'm sure they have put massive engineering effort into their ML compiler, the demos they showed of matmul are not that impressive in an absolute sense; with the analogous Julia code, making use of [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) to automatically choose good defaults for vectorization, etc...

    julia> using LoopVectorization, BenchmarkTools, Test
           function AmulB!(C,A,B)
               @turbo for n = indices((C,B),2), m = indices((C,A),1)
                   Cmn = zero(eltype(C))
                   for k = indices((A,B),(2,1))
                       Cmn += A[m,k]*B[k,n]
                   end
                   C[m,n]=Cmn
               end
           end
           M = K = N = 144; A = rand(Float32, M,K); B = rand(Float32, K,N); C0 = A*B; C1 = similar(C0);
           AmulB!(C1,A,B)
           @test C1 ≈ C0
           2e-9*M*K\*N/@belapsed(AmulB!($C1,$A,$B))
    96.12825754527164

I'm able to achieve 96GFLOPs on a single core (Apple M1) or 103 GFLOPs on a single core (AMD EPYC 7502). And that's not even as good as what you can achieve using e.g. TVM to do the scheduling exploration that Mojo purports to do.

Perhaps they have more extensive examples coming that showcase the capabilities further. I understand it's difficult to show all strengths of the entire system in a short demonstration video. :)

EDIT: As expected, there are significantly better benchmarks shown at https://www.modular.com/blog/the-worlds-fastest-unified-matr... so perhaps this whole discussion truly is just a matter of the demo not showcasing the true power of the system. Hopefully achieving those high performance numbers for sgemm is doable without too much ugly code.

bufo · on May 2, 2023

Yeah I think no one will likely have any edge for a simple thing like a matrix multiplication since all the right abstractions are supported in both languages and they end up in the LLVM code gen. Having Python 3 backwards compatibility and easily deploying your code to, say, phones via a C++ API is quite big though.

zbowling · on May 5, 2023

It seems you are using N=144 where in the modular example they are doing N=1024. Significantly less computationally expensive of a calculation in this Julia example.

adgjlsfhk1 · on May 5, 2023

This is comparing Gflops which is a size normalized measurement. With N=M=K=1024, I get 127.5 Gflops on my laptop (intel i7-1185G7)

kaba0 · on May 4, 2023

But that’s on different hardware than what they used.

TheMagicHorsey · on May 2, 2023

I was thinking about Julia too. I guess they will try to make it reasonably compatible with Python directly making rewrites easier than Julia?

Edit: just saw this is a project by Chris Latner and Tim Davis. Hahaha ... that's immediate credibility.

borodi · on May 2, 2023

I'm sure there are reasons for it, but Chris Lattner has been jumping around a bit. Remember swift4TF. But hopefully this one is seen through with lots of open source too.

UncleOxidant · on May 2, 2023

but Mojo seems to keep all of that numpy cruft that's there because Python. (numeric computing was bolted onto python via numpy, but it's built-into Julia)

UncleOxidant · on May 2, 2023

Yeah, it seems like they sort of reinvented Julia but with more of a Python syntax?

cube2222 · on May 2, 2023

> we want full compatibility with the Python ecosystem

Oh, I guess that's one big philosophical difference.