Since I’m on my phone, I can only try the 2D js versions, but the quadtree version does seem to provide some really obvious and nice frame rate improvements. It seems like good work, although I don’t know much about this field.
Sidenote:
I’m often too self conscious to put anything technical online with my real name on it, probably a bit of imposter syndrome or something. I think based on
> For one of my programming classes this semester, my task was writing a program in Python. It was only an introductory course, so it didn’t have to be anything fancy.
the author is sort of early on in their career. I often suspect I’d be better off if I’d gotten over myself and thrown more early experiments online. I dunno, maybe this stuff just isn’t a big deal for some people, but I have to applaud the bravery.
This reminds me of a paper[0] I co-wrote during a college internship that involved simulating the evolution of a Brownian system of particles with production and annihilation rules as the evolution of probability density distributions for each particle type within a partitioned cubic periodic box. Because we were interested in how these particles segregated spatially, visualization was a matter of finding the boundaries separating each domain of particles. I made some fun little videos of the evolution of the system over time (which tracks very conspicuously the entropy production, plotted in the paper), but unfortunately, they've since been taken down from the institute's website where they were hosted. (The paper does contain some snapshots that went into these visualizations, however.)
Having an interest in the space for many years (on-and-off; life gets busy sometimes) - I've often lamented the lack of good pixel-level performance optimizations in graphics cards. Everything seems hell-bent on polygons.
Some years ago, DirectDraw on Windows was an excellent way to optimize the graphics portion of these types of things, but that all went away as "3D Is The Way" mentality took over.
My explorations of Processing.* were neat, but lacked the IDE and library support I like as a developer.
The thing you describe as "per-pixel optimizations" is exactly what a fragment shader is? You don't need "a polygon" per say, you can just run said code over ever pixel in the frame for each frame. This is how simple screenspace effects are implemented, compositing, HDR, etc.
Interesting. While I've played with ShaderToy a bit, I've not gotten such a thing to run locally. Maybe I should revisit this..
I guess all that I'd need would be in-memory buffer of my field, then a single function to copy the buffer (or section of the buffer if I want larger than screen fields) to the video buffer.
There's no lack of "pixel-level performance optimisations" on GPUs, and just because basically the whole world seems to think that graphics programming == using some rasterisation library like OpenGL or DX or Vulkan or whatever for realtime applications, doesn't mean you're forced to use it or that that's all there is.
Just fire up OpenCL (or CUDA, if you must) and start executing some data-parallel kernels.
That's all a little beyond me at this point. But very compelling!
As the original simulation shows, the idea is to calculate the difference and proximity of the pixels. So the drawing part is just a projection...
The ease of getting a memory buffer pointer, setting pixels within my calculation loop with simple offsets was very simple and compelling. I think I should look deeper into OpenCL/CUDA for this use-case, as I mentioned in the other response to my comment.
No problem, and I highly recommend using OpenCL to get started, it's actually pretty short and straightforward (lookin' at you, Vulkan compute!). So basically, the stuff in the for-loops you had before, that's the kernel you'll be writing.
I daresay it's actually easier to render things this way than getting lost in the weeds with API docs etc. A simple Mandelbrot hello-world example should be easy to understand and get compiling, and you can go from there.
Yep you can, but there's still an astronomical amount of crap you have to do that has nothing to do with pure compute. OpenCL has none of these problems, and is IMO just the perfect compute API.
The thing that strikes me as particularly weird about this is the use of numpy there. In other languages, they're using native code. In python they're reaching out to numpy, which is a great library, but not awesome inside a hot loop unless you're keeping the operation you're carrying out within numpy itself. This means, right in that hot loop, they're doing a lot of translating of numbers between python representation and native c representation.
Cutting numpy out by making the line "t = [0.0]*l" gets it down to 17059 ms, without attempting any other optimisations. Using that plus pypy (so you get the JIT that you have with javascript/v8) gets it down to 958 ms.
To show the cost of those translations, sticking with numpy. If we switch that inner loop to:
It speeds us up from 47552 ms to 28251 ms. Almost half the execution time. That's still doing two hops back and forth, though. If you cut it down to a single line it's even faster at 18458 ms, cutting execution time down to about a third of the original example. Pypy isn't able to help here at all, this is sort of a pathological case for it.
edit: I'll add, I'm not that good with numpy, rarely use it myself. Not sure if it's possible to do that inner loop all within numpy somehow. I imagine that'd be a lot faster still.
So rather than spend 10-20 minutes reading about numpy, the author wrote 3 other implementations...?
The fact that they ran the C code without the optimisation flags and compared it that way makes me think Javascript was what they actually wanted to write this one in anyway.
The author, by the looks of it from the article, is a uni student. Rather than straw-manning this into a language war, we should laude the fact that they managed to write the same thing in a number of different languages to begin with.
Negative again, a series of array operations which are individually idiomatic numpy like this will run very very fast in numba as it can coalesce the ops into a single pass through memory. Numpy can't do this and has to pass through the array for each individually array operation. There's nothing wrong with straight numpy but if you want it compiled-C fast for the whole ensemble of array ops, you need a JIT.
Author mentions they didn't use optimization flags but doesn't include the compilation details. You can sort of guess that (relatively) unoptimized C might perform worse than V8's JIT on short, straight computational code - you're more or less testing two native code generators doing a simple thing except one has more optimizations enabled and wins.
It just says 'no optimization flags' and that's that, I don't think even the compiler is mentioned so the author is not giving you a lot to go on here - you don't even know what the default is.
A modern C optimizing compiler is going to go absolutely HAM on this sort of code snippet if you tell it to and do next to nothing if you don't - that's, roughly, the explanation for this little discrepancy.
Simply compiling it with -O3 produces something which completes in half the time of the JavaScript version (350ms for C, 750ms for JS), so perhaps that.
Edit for Twirrim: on this system (Ryzen 7, gcc 11): "-O3": 350ms; "-O3 -march=native": 208ms; "-O2": 998ms; "-O2 -march=native": 1040ms.
Edit 2: Interestingly, changing the C from float to double produces a 3.5x speedup, taking the time elapsed (with "-O3 -march=native") to 58ms, or about 12x faster than JS. This also makes what it's computing closer to the JavaScript version.
it is faster in just about every way. less memory, even the cpu instructions (which are usually not the problem) are faster. there's something fucky going on with code gen here. or it could also simply be the measurement procedure that is doing something weird like working with not properly cold or equally warmed up data or instruction caches.
IMO, not using any optimization flags with C is somewhat arbitrary, since the compiler writers could have just decided that by default we'll do thing X, Y, and Z, and then you'd need to turn them off explicitly.
FWIW, without -O, with -O, and with -O4, I get 2500ms, 1500ms, and 550ms respectively. I didn't bother to look at the .S to see the code improvements. (Of course, I edited the code to output the results, otherwise, it just optimized out everything.)
One optimization for the C code is to put "f" suffixes on the floating point constants. For example convert this line:
t[i] += 0.02 * (float)j;
to:
t[i] += 0.02f * (float)j;
I believe this helps because 0.02 is a double and doing double * float and then converting the result to float can produce a different answer to just doing float * float. The compiler has to do the slow version because that's what you asked for.
Adding the -ffast-math switch appears to make no difference. I'm never sure what -ffast-math does exactly.
> I believe this helps because 0.02 is a double and [...] can produce a different answer
In principle, not quite. The real/unavoidable(-by-the-compiler) problem is that 0.02 is a not a diadic rational (not representable exactly as some integer over a power of two). So its representation (rounded to 52 bits) as a double is a different real number than its representation (rounded to 23 bits) as a float. (This is the same problem as rounding pi or e to a double/float, but people tend to forget that it applies to all diadic irrationals, not just regular irrationals.)
If, instead of `0.02f` you replaced `0.02` with `(double)0.02f` or `0.015625`, the optimization should in theory still apply (although missed optimization complier bugs are of course possible).
I think this is because the optimization isn't safe. I wrote a program to find a counter example to your claim that "the optimization should in theory still apply". It found one. Here's the code:
#include <stdio.h>
#include <stdlib.h>
float mul_as_float(float t) {
t += 0.02f * (float)17;
return t;
}
float mul_as_double(float t) {
t += (double)0.02f * (float)17;
return t;
}
int main() {
while (1) {
unsigned r = rand();
float t = *((float*)&r);
float result1 = mul_as_float(t);
float result2 = mul_as_double(t);
if (result1 != result2) {
printf("Counter example when t is %f (0x%x)\n", t, *((unsigned*)&t));
printf("result1 is %f (0x%x)\n", result1, *((unsigned*)&result1));
printf("result2 is %f (0x%x)\n", result2, *((unsigned*)&result2));
return 0;
}
}
}
It outputs:
Counter example when t is 0.000000 (0x3477d43f)
result1 is 0.340000 (0x3eae1483)
result2 is 0.340000 (0x3eae1482)
On my machine, the complier constant-folds the multiplication, producing a single-precision add for `mul_as_float` and a convert-t-to-double, double-precision-add, convert-sum-to-single for `mul_as_double`. I missed the `+=` in your original comment, but adding a float to a double does implicitly promote it like that, so you'd actually need:
t += (float)((double)0.02f * (float)17);
to achieve the "and then converting the result [of the multiplication] to float" (rather than keeping it a double for the addition) from your original comment. (With the above line in mul_as_double, your test code no longer finds a counterexample, at least when I ran it.)
If you ask for higher-precision intermediates, even implicitly, floating-point compliers will typically give them to you, hoped-for efficiency of single-precision be damned.
Me too when I am away from C for a while.
The topic has been on HN [3]
* Enable the use of SIMD instructions
* alter the behavior regarding NaN (you can't even check for NaN afterwards with isnan(f))
* alter the associativity of expression a+(b+c) might become (a+b)+c which seems inconspicuous at first, but there are exceptions (as example see [1] under -fassociative-math)
* change subnormals to zero (even if your program isn't compiled with this option, but a library you link to your program).
A nice overview from which I summarize is in [1] which contains a link to [2] with this nice text:
"If a sufficiently advanced compiler is indistinguishable from an adversary, then giving the compiler access to -ffast-math is gifting that enemy nukes. That doesn’t mean you can’t use it! You just have to test enough to gain confidence that no bombs go off with your compiler on your system"
Should also test -Os when doing this sort of thing. Sometimes the reduced size greatly improves cache behavior, and even when not it's often outright competitive with at least -O2 anyway (usually compiles faster too!)
This is fairly trivial stuff with underwhelming results. 4 million particles isn't much on modern hardware, even with spatial partitioning. The hardest part of spatial partitioning is probably the one dimensional partition algorithm in the first place, which can be found here:
It's not rocket science, but "trivial" is very harsh. This stuff is fiddly to get right. The author (who I assume is a student) seems to have done a good job and it's a nice writeup.
I guess you could argue that, but it's strange that people want to give a student's first project (with a lot of huge mistakes like thinking node is going to outperform C because they didn't turn optimizations on) so much interest. I think people are taken in by big numbers and assume there is something cutting edge because they don't know any better.
The results are just some particles in an extremely basic pattern, there isn't a lot of payoff.
One point needs correcting though - nowhere in the article I say that node is outperforming C, on the contrary - I stressed that I didn't use any flags so that people don't come to a conclusion that C would be always slower, and I explicitly mentioned not get fixated on such benchmarks.
What it was meant to show is that V8 is *good enough* to even consider it for the job :)
You benchmarked three different programs and one was compiled to be slow. Why give times for a debug build when the other two aren't? Showing off performance metrics that are both apples to oranges while also giving times for something not made to run fast is total nonsense.
Do you really not think this is a mistake? Most people would see C getting outperformed and realize there is something wrong. I'm shocked anyone would both this then try to rationalize it after.
If you read carefully you might see that I'm criticizing rationalizing a bizarre mistake as something that is completely fine.
Also if someone wants to put their first CS project on the internet, then label it that way and not as some sort of blog post about them doing something unique.
Sidenote:
I’m often too self conscious to put anything technical online with my real name on it, probably a bit of imposter syndrome or something. I think based on
> For one of my programming classes this semester, my task was writing a program in Python. It was only an introductory course, so it didn’t have to be anything fancy.
the author is sort of early on in their career. I often suspect I’d be better off if I’d gotten over myself and thrown more early experiments online. I dunno, maybe this stuff just isn’t a big deal for some people, but I have to applaud the bravery.