Optimizing the Particle Life: From 400 to 4M particles

bee_rider · on March 18, 2024

Since I’m on my phone, I can only try the 2D js versions, but the quadtree version does seem to provide some really obvious and nice frame rate improvements. It seems like good work, although I don’t know much about this field.

Sidenote:

I’m often too self conscious to put anything technical online with my real name on it, probably a bit of imposter syndrome or something. I think based on

> For one of my programming classes this semester, my task was writing a program in Python. It was only an introductory course, so it didn’t have to be anything fancy.

the author is sort of early on in their career. I often suspect I’d be better off if I’d gotten over myself and thrown more early experiments online. I dunno, maybe this stuff just isn’t a big deal for some people, but I have to applaud the bravery.

_neil · on March 18, 2024

Give yourself to the embarrassment. Publish for your past self. Or publish for someone smarter to come along and correct you.

(Currently trying to take this advice as well)

alook · on March 19, 2024

Love this advice. I’ve always struggled with the same thing.

One friend started telling me to get over myself and publish, then he reworded it to:

“Get over being under yourself.”

For some reason that one stuck with me and gave me the strength to ship my first few pieces of writing.

danielam · on March 18, 2024

This reminds me of a paper[0] I co-wrote during a college internship that involved simulating the evolution of a Brownian system of particles with production and annihilation rules as the evolution of probability density distributions for each particle type within a partitioned cubic periodic box. Because we were interested in how these particles segregated spatially, visualization was a matter of finding the boundaries separating each domain of particles. I made some fun little videos of the evolution of the system over time (which tracks very conspicuously the entropy production, plotted in the paper), but unfortunately, they've since been taken down from the institute's website where they were hosted. (The paper does contain some snapshots that went into these visualizations, however.)

[0] https://pubs.aip.org/aip/jcp/article-abstract/122/17/174105/...

peterldowns · on March 18, 2024

I was having trouble finding the link to the actual project, so here you go:

- Github repo https://github.com/Caranea/gpulife

- Finished particle simulator https://webgpulife.netlify.app/

Jupe · on March 18, 2024

This is pretty neat...

Having an interest in the space for many years (on-and-off; life gets busy sometimes) - I've often lamented the lack of good pixel-level performance optimizations in graphics cards. Everything seems hell-bent on polygons.

Some years ago, DirectDraw on Windows was an excellent way to optimize the graphics portion of these types of things, but that all went away as "3D Is The Way" mentality took over.

My explorations of Processing.* were neat, but lacked the IDE and library support I like as a developer.

smlacy · on March 18, 2024

The thing you describe as "per-pixel optimizations" is exactly what a fragment shader is? You don't need "a polygon" per say, you can just run said code over ever pixel in the frame for each frame. This is how simple screenspace effects are implemented, compositing, HDR, etc.

Jupe · on March 18, 2024

Interesting. While I've played with ShaderToy a bit, I've not gotten such a thing to run locally. Maybe I should revisit this..

I guess all that I'd need would be in-memory buffer of my field, then a single function to copy the buffer (or section of the buffer if I want larger than screen fields) to the video buffer.

pixelpoet · on March 18, 2024

There's no lack of "pixel-level performance optimisations" on GPUs, and just because basically the whole world seems to think that graphics programming == using some rasterisation library like OpenGL or DX or Vulkan or whatever for realtime applications, doesn't mean you're forced to use it or that that's all there is.

Just fire up OpenCL (or CUDA, if you must) and start executing some data-parallel kernels.

Jupe · on March 18, 2024

That's all a little beyond me at this point. But very compelling!

As the original simulation shows, the idea is to calculate the difference and proximity of the pixels. So the drawing part is just a projection...

The ease of getting a memory buffer pointer, setting pixels within my calculation loop with simple offsets was very simple and compelling. I think I should look deeper into OpenCL/CUDA for this use-case, as I mentioned in the other response to my comment.

Thanks!

pixelpoet · on March 18, 2024

No problem, and I highly recommend using OpenCL to get started, it's actually pretty short and straightforward (lookin' at you, Vulkan compute!). So basically, the stuff in the for-loops you had before, that's the kernel you'll be writing.

I daresay it's actually easier to render things this way than getting lost in the weeds with API docs etc. A simple Mandelbrot hello-world example should be easy to understand and get compiling, and you can go from there.

dartos · on March 18, 2024

Is vulkan strictly a rasterization library?

I thought you could also do numeric calculations with compute shaders.

pixelpoet · on March 18, 2024

Yep you can, but there's still an astronomical amount of crap you have to do that has nothing to do with pure compute. OpenCL has none of these problems, and is IMO just the perfect compute API.

Twirrim · on March 18, 2024

The thing that strikes me as particularly weird about this is the use of numpy there. In other languages, they're using native code. In python they're reaching out to numpy, which is a great library, but not awesome inside a hot loop unless you're keeping the operation you're carrying out within numpy itself. This means, right in that hot loop, they're doing a lot of translating of numbers between python representation and native c representation.

Cutting numpy out by making the line "t = [0.0]*l" gets it down to 17059 ms, without attempting any other optimisations. Using that plus pypy (so you get the JIT that you have with javascript/v8) gets it down to 958 ms.

To show the cost of those translations, sticking with numpy. If we switch that inner loop to:

        temp_t_i = t[i]
        temp_t_i += 0.02 * j
        temp_t_i *= 0.03 * j
        temp_t_i -= 0.04 * j
        temp_t_i /= 0.05 * (j+1)
        temp_t_i = temp_t_i

It speeds us up from 47552 ms to 28251 ms. Almost half the execution time. That's still doing two hops back and forth, though. If you cut it down to a single line it's even faster at 18458 ms, cutting execution time down to about a third of the original example. Pypy isn't able to help here at all, this is sort of a pathological case for it.

edit: I'll add, I'm not that good with numpy, rarely use it myself. Not sure if it's possible to do that inner loop all within numpy somehow. I imagine that'd be a lot faster still.

bscottmay · on March 18, 2024

Yeah, an all-numpy version runs in less than 1ms on my M1 air

  import numpy as np
  l = 10_000
  t = np.empty(l, dtype=np.float32)
  j = np.arange(l)
  t = 0.02 \* j
  t *= (0.03 * j)
  t -= (0.04 \* j)
  t /= 0.05 \* (j + 1)

Twirrim · on March 18, 2024

So rather than spend 10-20 minutes reading about numpy, the author wrote 3 other implementations...?

The fact that they ran the C code without the optimisation flags and compared it that way makes me think Javascript was what they actually wanted to write this one in anyway.

Etheryte · on March 19, 2024

The author, by the looks of it from the article, is a uni student. Rather than straw-manning this into a language war, we should laude the fact that they managed to write the same thing in a number of different languages to begin with.

nerpderp82 · on March 18, 2024

This is the problem that Numba was designed to solve.

https://numba.readthedocs.io/en/stable/user/5minguide.html

cl3misch · on March 18, 2024

No. The problem in this post can be vectorized (== expressed as array OPs) with idiomatic numpy, making it very fast (see sibling comment).

Numba is designed to speed up code where looping is unavoidable, i.e. the code can't be (easily) expressed as array OPs.

GeompMankle · on March 18, 2024

Negative again, a series of array operations which are individually idiomatic numpy like this will run very very fast in numba as it can coalesce the ops into a single pass through memory. Numpy can't do this and has to pass through the array for each individually array operation. There's nothing wrong with straight numpy but if you want it compiled-C fast for the whole ensemble of array ops, you need a JIT.

cl3misch · on March 18, 2024

Ah, I was under the impression that using array OPs inside a numba njit gave worse performance. Has this changed, or is my memory tricking me?

Do you have a source for this? I have not seen it in the numba docs.

physicsguy · on March 18, 2024

Numba replaces ops with LLVM byte code equivalent so the compiler will optimise out and coalesce the operations as part of the optimisation stage.

If you want to look at the sort of things compilers do though, take a look at “common subexpression elimination”

hyperbrainer · on March 18, 2024

Why was C slower though?

pvg · on March 18, 2024

Author mentions they didn't use optimization flags but doesn't include the compilation details. You can sort of guess that (relatively) unoptimized C might perform worse than V8's JIT on short, straight computational code - you're more or less testing two native code generators doing a simple thing except one has more optimizations enabled and wins.

hyperbrainer · on March 18, 2024

Oh, I assumed he still did -O2, and did not do anything else. Is that bad to assume?

PS: I do not use C beyond reading some of its code for inspiration, so kinda unaware

pvg · on March 18, 2024

It just says 'no optimization flags' and that's that, I don't think even the compiler is mentioned so the author is not giving you a lot to go on here - you don't even know what the default is.

A modern C optimizing compiler is going to go absolutely HAM on this sort of code snippet if you tell it to and do next to nothing if you don't - that's, roughly, the explanation for this little discrepancy.

wzdd · on March 18, 2024

Simply compiling it with -O3 produces something which completes in half the time of the JavaScript version (350ms for C, 750ms for JS), so perhaps that.

Edit for Twirrim: on this system (Ryzen 7, gcc 11): "-O3": 350ms; "-O3 -march=native": 208ms; "-O2": 998ms; "-O2 -march=native": 1040ms.

Edit 2: Interestingly, changing the C from float to double produces a 3.5x speedup, taking the time elapsed (with "-O3 -march=native") to 58ms, or about 12x faster than JS. This also makes what it's computing closer to the JavaScript version.

Twirrim · on March 18, 2024

For fun and frolics:

No flags: 1843ms

-march=native: 2183 ms

-O2: 423 ms

-O2 -march=native: 250 ms

-O3: 425 ms

-O3 -march=native: 255 ms

O3 doesn't seem to be helping in my case.

Tadpole9181 · on March 18, 2024

They didn't use any optimization flags.

Twirrim · on March 18, 2024

Yeah, I was just trying to show the difference. Doing it without optimisation flags is an utterly bewildering decision by the author.

nickpsecurity · on March 19, 2024

Especially given the goal was increasing performance.

jasonjmcghee · on March 18, 2024

Seems crazy to me that double would produce that kind of speed up. Is float getting emulated somehow? Don't they end up the same size?

jlarocco · on March 19, 2024

> Don't they end up the same size?

No. Float is half the size of double.

suby · on March 19, 2024

My intuition would have been that floats were faster because of this. Less memory to iterate through.

fartsucker69 · on March 19, 2024

it is faster in just about every way. less memory, even the cpu instructions (which are usually not the problem) are faster. there's something fucky going on with code gen here. or it could also simply be the measurement procedure that is doing something weird like working with not properly cold or equally warmed up data or instruction caches.

anon946 · on March 18, 2024

Did you add output of the results to the C code?

wzdd · on March 18, 2024

I did not, but I confirmed with objdump that my compiler was not removing the code.

(But to be sure, I just ran it again with an output and got the same value.)

KeplerBoy · on March 18, 2024

No optimization flags could be a big part of the reason.

Haven't looked closely at the code or tried it, but with -O3, -fopenmp and a well-placed pragma the performance could increase many-fold.

Heck, with NVC++ you could offload that thing to a GPU with minimal effort and have it flying at the memory bandwidth limit.

atulvi · on March 18, 2024

4M particles is not the question. what's the framerate?

bee_rider · on March 18, 2024

You can go to the “next” story at the bottom of the post, and then the author has some links to an online implementation.

jasonjmcghee · on March 18, 2024

Would vary significantly based on the GPU, I would assume.

chris_va · on March 18, 2024

Sounds like they are slowly learning electrostatic PIC codes from first principles.

anon946 · on March 18, 2024

IMO, not using any optimization flags with C is somewhat arbitrary, since the compiler writers could have just decided that by default we'll do thing X, Y, and Z, and then you'd need to turn them off explicitly.

FWIW, without -O, with -O, and with -O4, I get 2500ms, 1500ms, and 550ms respectively. I didn't bother to look at the .S to see the code improvements. (Of course, I edited the code to output the results, otherwise, it just optimized out everything.)

abainbridge · on March 18, 2024

One optimization for the C code is to put "f" suffixes on the floating point constants. For example convert this line:

    t[i] += 0.02 * (float)j;

to:

    t[i] += 0.02f * (float)j;

I believe this helps because 0.02 is a double and doing double * float and then converting the result to float can produce a different answer to just doing float * float. The compiler has to do the slow version because that's what you asked for.

Adding the -ffast-math switch appears to make no difference. I'm never sure what -ffast-math does exactly.

Minimal case on Godbolt:

https://godbolt.org/z/W18YsnMY5 - without the f

https://godbolt.org/z/oc1s8WKeG - with the f

a1369209993 · on March 18, 2024

> I believe this helps because 0.02 is a double and [...] can produce a different answer

In principle, not quite. The real/unavoidable(-by-the-compiler) problem is that 0.02 is a not a diadic rational (not representable exactly as some integer over a power of two). So its representation (rounded to 52 bits) as a double is a different real number than its representation (rounded to 23 bits) as a float. (This is the same problem as rounding pi or e to a double/float, but people tend to forget that it applies to all diadic irrationals, not just regular irrationals.)

If, instead of `0.02f` you replaced `0.02` with `(double)0.02f` or `0.015625`, the optimization should in theory still apply (although missed optimization complier bugs are of course possible).

abainbridge · on March 18, 2024

OK, right, that's the clarity of thought I was missing.

But in this case the compiler still misses the optimization with '(double)0.02f'. https://godbolt.org/z/az7819nKM

I think this is because the optimization isn't safe. I wrote a program to find a counter example to your claim that "the optimization should in theory still apply". It found one. Here's the code:

    #include <stdio.h>
    #include <stdlib.h>

    float mul_as_float(float t) {
      t += 0.02f * (float)17;
      return t;
    }

    float mul_as_double(float t) {
      t += (double)0.02f * (float)17;
      return t;
    }

    int main() {
        while (1) {
            unsigned r = rand();
            float t = *((float*)&r);

            float result1 = mul_as_float(t);
            float result2 = mul_as_double(t);
            if (result1 != result2) {
                printf("Counter example when t is %f (0x%x)\n", t, *((unsigned*)&t));
                printf("result1 is %f (0x%x)\n", result1, *((unsigned*)&result1));
                printf("result2 is %f (0x%x)\n", result2, *((unsigned*)&result2));
                return 0;
            }
        }
    }

It outputs:

    Counter example when t is 0.000000 (0x3477d43f)
    result1 is 0.340000 (0x3eae1483)
    result2 is 0.340000 (0x3eae1482)

What do you think?

a1369209993 · on March 18, 2024

On my machine, the complier constant-folds the multiplication, producing a single-precision add for `mul_as_float` and a convert-t-to-double, double-precision-add, convert-sum-to-single for `mul_as_double`. I missed the `+=` in your original comment, but adding a float to a double does implicitly promote it like that, so you'd actually need:

  t += (float)((double)0.02f * (float)17);

to achieve the "and then converting the result [of the multiplication] to float" (rather than keeping it a double for the addition) from your original comment. (With the above line in mul_as_double, your test code no longer finds a counterexample, at least when I ran it.)

If you ask for higher-precision intermediates, even implicitly, floating-point compliers will typically give them to you, hoped-for efficiency of single-precision be damned.

abainbridge · on March 18, 2024

Ah right, yep.

sllabres · on March 19, 2024

"I'm never sure what -ffast-math does exactly."

Me too when I am away from C for a while. The topic has been on HN [3]

* Enable the use of SIMD instructions

* alter the behavior regarding NaN (you can't even check for NaN afterwards with isnan(f))

* alter the associativity of expression a+(b+c) might become (a+b)+c which seems inconspicuous at first, but there are exceptions (as example see [1] under -fassociative-math)

* change subnormals to zero (even if your program isn't compiled with this option, but a library you link to your program).

A nice overview from which I summarize is in [1] which contains a link to [2] with this nice text:

"If a sufficiently advanced compiler is indistinguishable from an adversary, then giving the compiler access to -ffast-math is gifting that enemy nukes. That doesn’t mean you can’t use it! You just have to test enough to gain confidence that no bombs go off with your compiler on your system"

[1] https://simonbyrne.github.io/notes/fastmath/

[2] https://discourse.julialang.org/t/when-if-a-b-x-1-a-b-divide...

[3] https://news.ycombinator.com/item?id=29201473 (107)

caranea · on March 18, 2024

Thanks for posting your results!

Since I was already set on writing in-browser particle life, I didn't benchmark C code with different flags.

anon946 · on March 18, 2024

Completely reasonable. I'd probably edit your blog post a bit to indicate that.

hyperbrainer · on March 18, 2024

> (Of course, I edited the code to output the results, otherwise, it just optimized out everything.)

O(1) :)

TylerE · on March 18, 2024

Should also test -Os when doing this sort of thing. Sometimes the reduced size greatly improves cache behavior, and even when not it's often outright competitive with at least -O2 anyway (usually compiles faster too!)

CyberDildonics · on March 18, 2024

This is fairly trivial stuff with underwhelming results. 4 million particles isn't much on modern hardware, even with spatial partitioning. The hardest part of spatial partitioning is probably the one dimensional partition algorithm in the first place, which can be found here:

https://en.cppreference.com/w/cpp/algorithm/partition

iainmerrick · on March 18, 2024

It's not rocket science, but "trivial" is very harsh. This stuff is fiddly to get right. The author (who I assume is a student) seems to have done a good job and it's a nice writeup.

CyberDildonics · on March 18, 2024

I guess you could argue that, but it's strange that people want to give a student's first project (with a lot of huge mistakes like thinking node is going to outperform C because they didn't turn optimizations on) so much interest. I think people are taken in by big numbers and assume there is something cutting edge because they don't know any better.

The results are just some particles in an extremely basic pattern, there isn't a lot of payoff.

caranea · on March 18, 2024

One point needs correcting though - nowhere in the article I say that node is outperforming C, on the contrary - I stressed that I didn't use any flags so that people don't come to a conclusion that C would be always slower, and I explicitly mentioned not get fixated on such benchmarks.

What it was meant to show is that V8 is *good enough* to even consider it for the job :)

CyberDildonics · on March 18, 2024

You benchmarked three different programs and one was compiled to be slow. Why give times for a debug build when the other two aren't? Showing off performance metrics that are both apples to oranges while also giving times for something not made to run fast is total nonsense.

Do you really not think this is a mistake? Most people would see C getting outperformed and realize there is something wrong. I'm shocked anyone would both this then try to rationalize it after.

Etheryte · on March 19, 2024

Touch grass my man, you're putting down a CS student's cool project over what compiler flags they used.

CyberDildonics · on March 19, 2024

If you read carefully you might see that I'm criticizing rationalizing a bizarre mistake as something that is completely fine.

Also if someone wants to put their first CS project on the internet, then label it that way and not as some sort of blog post about them doing something unique.