There is no implementation of it but this is how i see it, at least comparing with how things with fully extensioned vulkan work, which uses a few similar mechanics.
Per-drawcall cost goes to nanosecond scale. Assuming you do drawcalls of course, this makes bindless and indirect rendering a bit easier so you could drop CPU cost to near-0 in a renderer.
It would also highly mitigate shader compiler hitches due to having a split pipeline instead of a monolythic one.
The simplification on barriers could improve performance a significant amount because currently, most engines that deal with Vulkan and DX12 need to keep track of individual texture layouts and transitions, and this completely removes such a thing.
This is a fantastic article that demonstrates how many parts of vulkan and DX12 are no longer needed.
I hope the IHVs have a look at it because current DX12 seems semi abandoned, with it not supporting buffer pointers even when every gpu made on the last 10 (or more!) years can do pointers just fine, and while Vulkan doesnt do a 2.0 release that cleans things, so it carries a lot of baggage, and specially, tons of drivers that dont implement the extensions that really improve things.
If this api existed, you could emulate openGL on top of this faster than current opengl to vulkan layers, and something like SDL3 gpu would get a 3x/4x boost too.
DirectX documentation is on a bad state currently, you have the Frank Lunas's books, which don't cover the latest improvements, and then is hunting through Learn, Github samples and reference docs.
Vulkan is another mess, even if there was a 2.0, how are devs supposed to actually use it, especially on Android, the biggest consumer Vulkan platform?
Isn't this all because PCI resizable BAR is not required to run any GPU besides Intel Arc? As in, maybe it's mostly down to Microsoft/Intel mandating reBAR in UEFI so we can start using stuff like bindless textures without thousands of support tickets and negative reviews.
I think this puts a floor on supported hardware though, like Nvidia 30xx and Radeon 5xxx. And of course motherboard support is a crapshoot until 2020 or so.
This is not really directly about resizable BAR. you could do mostly the same api without it. Resizable bar simplifies it a little bit because you skip manual transfer operations, but its not completely required as you can write things to a cpu-writeable buffer and then begin your frame with a transfer command.
Bindless textures never needed any kind of resizable BAR, you have been able to use them since early 2010s on opengl through an extension. Buffer pointers also have never needed it.
> tons of drivers that dont implement the extensions that really improve things.
This isn't really the case, at least on desktop side.
All three desktop GPU vendors support Vulkan 1.4 (or most of the features via extensions) on all major platforms even on really old hardware (e.g. Intel Skylake is 10+ years old and has all the latest Vulkan features). Even Apple + MoltenVK is pretty good.
Even mobile GPU vendors have pretty good support in their latest drivers.
The biggest issue is that Android consumer devices don't get GPU driver updates so they're not available to the general public.
Neither do laptops, where not using the driver from the OEM with whatver custom code they added can lead to interesting experiences, like power configuration going bad, not able to handle the mixed GPU setups, and so on.
Those requirements more or less line up with the introduction of hardware raytracing, and some major titles are already treating that as a hard requirement, like the recent Doom and Indiana Jones games.
On the contrary, I would say this is the main thing Vulkan got wrong and the main reason whe the API is so bad. Desktop and mobile are way too different for a uniform rendering API. They should be two different flavours with a common denominator. OpenGL and OpenGL ES were much better in that regard.
It is unreasonable to expect to run the same graphics code on desktop GPUs and mobile ones: mobile applications have to render something less expensive that doesn't exceed the limited capabilities of a low-power device with slow memory.
The different, separate engine variants for mobile and desktop users, on the other hand, can be based on the same graphics API; they'll just use different features from it in addition to having different algorithms and architecture.
> they'll just use different features from it in addition to having different algorithms and architecture.
...so you'll have different code paths for desktop and mobile anyway. The same can be achieved with a Vulkan vs VulkanES split which would overlap for maybe 50..70% of the core API, but significantly differ in the rest (like resource binding).
But they don't actually differ, see the "no graphics API" blog post we're all commenting on :) The primary difference between mobile & desktop is performance, not feature set (ignoring for a minute the problem of outdated drivers).
And beyond that if you look at historical trends, mobile is and always has been just "desktop from 5-7 years ago". An API split that makes sense now will stop making sense rather quickly.
Different features/architecture is precisely the issue with mobile, be it due to hardware constraints or due to lack in deiver support. Render passes were only bolted into Vulkan because of mobile tiler GPUs, they never made any sense for desktop GPUs and only made Vulkan worse for desktop graphics development.
And this is the reason why mobile and desktop should be separate graphics APIs. Mobile is holding desktop back not just feature wise, it also fucks up the API.
Mobile is getting RT, fyi. Apple already has it (for a few generations, at least), I think Qualcomm does as well (I'm less familiar with their stuff, because they've been behind the game forever, however the last I've read, their latest stuff has it), and things are rapidly improving.
Vulkan is the actual barrier. On Windows, DirectX does an average job at supporting it. Microsoft doesn't really innovate these days, so NVIDIA largely drives the market, and sometimes AMD pitches in.
Eh, I think the jury is still out on whether unifying desktop and mobile graphics APIs is really worth it. In practice Vulkan written to take full advantage of desktop GPUs is wildly incompatible with most mobile GPUs, so there's fragmentation between them regardless.
It's quite useful for things like skia or piet-gpu/vello or the general category of "things that use the GPU that aren't games" (image/video editors, effects pipelines, compute, etc etc etc)
would it also apply to stuff like the Switch, and relatively high-end "mobile" gaming in general? (I'm not sure what those chips actually look like tho)
there are also some arm laptops that just run Qualcomm chips, the same as some phones (tablets with a keyboard, basically, but a bit more "PC"-like due to running Windows).
AFAICT the fusion seems likely to be an accurate prediction.
Switch has its own API. The GPU also doesn't have limitations you'd associate with "mobile". In terms of architecture, it's a full desktop GPU with desktop-class features.
well, it's a desktop GPU with desktop-class features from 2014 which makes it quite outdated relative to current mobile GPUs. The just released Switch 2 uses an Ampere-based GPU, which means it's desktop-class for 2020 (RTX 3xxx series), which is nothing to scoff about but "desktop-class features" is a rapidly moving target and the Switch ends up being a lot closer to mobile than it does to desktop since it's always launching with ~2 generations old GPUs.
Only if you're ignoring mobile entirely. One of the things Vulkan did which would be a shame to lose is it unified desktop and mobile GPU APIs.
In this context, both old Switch and Switch 2 have full desktop-class GPUs. They don't need to care about the API problems that mobile vendors imposed to Vulkan.
I feel like it's a win by default.
I do like to write my own programs every now and then and recently there's been more and more graphics sprinkled into them.
Being able to reuse those components and just render onto a target without changing anything else seems to be very useful here.
This kind of seamless interoperability between platforms is very desirable in my book.
I can't think of a better approach to achieve this than the graphics API itself.
Also there is no inherent thing that blocks extensions by default.
I feel like a reasonable core that can optionally do more things similar to CPU extensions (i.e. vector extensions) could be the way to go here.
I definitely disagree here. What matters for mobile is power consumption. Capabilities can be pretty easily implemented...if you disagree, ask Apple. They have seemingly nailed it (with a few unrelated limitations).
Mobile vendors insisting on using closed, proprietary drivers that they refuse to constantly update/stay on top of is the actual issue. If you have a GPU capable of cutting edge graphics, you have to have a top notch driver stack. Nobody gets this right except AMD and NVIDIA (and both have their flaws). Apple doesn't even come close, and they are ahead of everyone else except AMD/NVIDIA. AMD seems to do it the best, NVIDIA, a distant second, Apple 3rd, and everyone else 10th.
> If you have a GPU capable of cutting edge graphics, you have to have a top notch driver stack. Nobody gets this right except AMD and NVIDIA (and both have their flaws). Apple doesn't even come close, and they are ahead of everyone else except AMD/NVIDIA. AMD seems to do it the best, NVIDIA, a distant second, Apple 3rd, and everyone else 10th.
It is quite telling how good their iGPUs are at 3D that no one counts them in.
I remember there was time about 15 years ago, they were famous for reporting OpenGL capabilities as supported, when they were actually only available as software rendering, which voided any purpose to use such features in first place.
APUs/iGPUs are compared, and here Intel's integrated GPUs seem to be very competitive with AMD's APUs.
---
You of course have to compare dedicated graphics cards with each other, and similarly for integrated GPUs, so let's compare (Intel's) dedicated GPUs (Intel Arc), too:
the current Intel Arc generation (Intel-Arc-B, "Battlemage") seems to be competitive with entry-level GPUs of NVidia and AMD, i.e. you can get much more powerful GPUs from NVidia and AMD, but for a much higher price. I thus clearly would not call Intel's dedicated GPUs to be so bad "at 3D that no one counts them in".
It's weird how the 'next-gen' APIs will turn out to be failures in many ways imo. I think still as sizeable amount of graphics devs still stuck to the old way of doing things. I know a couple graphics wizards (who work on major AAA titles) who never liked Vulkan/DX12, and many engines haven't really been rebuilt to accomodate the 'new' way of doing graphics.
Ironically a lot of the time, these new APIs end up being slower in practice (something confirmed by gaming benchmarks), probably exactly because of the issues outlined in the article - having precompiled 'pipeline states', instead of the good ol state machine has forced devs to precompile a truly staggering amount of states, and even then sometimes compilation can occur, leading to these well known stutters.
The other issue is synchronization - as the article mentions how unnecessarily heavy Vulkan synchronization is, and devs aren't really experts or have the time to figure out when to use what kind of barrier, so they adopt a 'better be safe than sorry approach', leading to unneccessary flushes and pipeline stalls that can tank performance in real life workloads.
This is definitely a huge issue combined with the API complexity, leading many devs to use wrappers like the aforementioned SDL3, which is definitely very conservative when it comes to synchronization.
Old APIs with smart drivers could either figure this out better, or GPU driver devs looked at the workloads and patched up rendering manually on popular titles.
Additionally by the early to mid 10s, when these new APIs started getting released, a lot of crafty devs, together with new shader models and OpenGL extensions made it possible to render tens of thousands of varied and interesting objects, essentially the whole scene's worth, in a single draw call. The most sophisticated and complex of these was AZDO, which I'm not sure made it actually into a released games, but even with much less sophisticated approaches (and combined with ideas like PBR materials and deferred rendering), you could pretty much draw anything.
This meant much of the perf bottleneck of the old APIs disappeared.
I think the big issue is that there is no 'next-gen API'. Microsoft has largely abandoned DirectX, Vulkan is restrictive as anything, Metal isn't changing much beyond matching DX/Vk, and NVIDIA/AMD/Apple/Qualcomm aren't interested in (re)-inventing the wheel.
There are some interesting GPU improvements coming down the pipeline, like a possible OoO part from AMD (if certain credible leaks are valid), however, crickets from Microsoft, and NVIDIA just wants vendor lock-in.
Yes, we need a vastly simpler API. I'd argue even simpler than the one proposed.
One of my biggest hopes for RT is that it will standardize like 80% of stuff to the point where it can be abstracted to libraries. It probably won't happen, but one can wish...
The modern cryengine compiles very fast. Their trick is that they have architected everything to go through interfaces that are on very thin headers, and thus their headers end very light and they dont compile the class properties over and over. But its a shame we need to do tricks like this for compile speed as they harm runtime performance.
Because you now need to go through virtual calls on functions that dont really need to be virtual, which means the possible cache miss from loading the virtual function from vtable, and then the impossibility of them being inlined. For example they have a ITexture interface with a function like virtual GetSize(). If it wasnt all through virtuals, that size would just be a vec2 in the class and then its a simple load that gets inlined.
At least on clang with LTO, with bitcode variant, that should be possible to devirtualize, assuming most of those interfaces only have a single implementation.
In my experience, as long as there's only a single implementation, devirtualization works well, and can even inline the functions. But you need to pass something along the lines of "-fwhole-program-vtables -fstrict-vtable-pointer" + LTO. Of course the vtable pointer is still present in the object. So I personally only use the aforementioned "thin headers" at a system level (IRenderer), rather than for each individual object (ITexture).
In addition to what everyone else has said it also makes it difficult to allocate the type on the stack. Even if you do allow it you'll at least need a probe.
Game consoles generally only offer clang as a possibility for compiler. If you can compile rust to C, then you can finally use rust for videogames that need to run everywhere.
Interesting library, but i see it falls back into what happens to almost all SIMD libraries, which is that they hardcode the vector target completely and you cant mix/match feature levels within a build. The documentation recommends writing your kernels into DLLs and dynamic-loading them which is a huge mess https://jfalcou.github.io/eve/multiarch.html
Meanwhile xsimd (https://github.com/xtensor-stack/xsimd) has the feature level as a template parameter on its vector objects, which lets you branch at runtime between simd levels as you wish. I find its a far better way of doing things if you actually want to ship the simd code to users.
You can do many thing with macros and inline namespaces but I believe they run into problems when modules come into play. Can you compile the same code twice, with different flags with modules?
I think I do understand, this is exactly what we do. (MEGA_MACRO == HWY_NAMESPACE)
Then we have a table of function pointers to &AVX2::foo, &AVX3::foo etc. As long as the module exports one single thing, which either calls into or exports this table, I do not see how it is incompatible with building your project using modules enabled?
(The way we compile the code twice is to re-include our source file, taking care that only the SIMD parts are actually seen by the compiler, and stuff like the module exports would only be compiled once.)
What leads you to that conclusion?
It is still possible to use #include in module implementations.
We can use that to make the module implementation look like your example.
Thus it ought to be possible, though I have not yet tried it.
Since you seem knowledgeable about this, what does this do differently from other SIMD libraries like xsimd / highway? Is it the addition of algorithms similar to the STD library that are explicitly SIMD optimized?
The algorithms I tried to make as good as I knew how. Maybe 95% there.
Nice tail handling. A lot of things supported.
I like or interface over other alternatives, but I'm biased here.
Really massive math library.
Game developers have been doing this since forever, its one of their main reasons to avoid the STL.
EASTL has this as a feature by default, and unreal engine container library has the boundchecks enabled on most games. The performance cost of those boundchecks in practice is well worth the reduction of bugs even on performance sensitive code.
Which is yet another reason to assert (pun intend), how far from reality the anti-bounds check folks are, when even the game industry takes them seriously.
A truly incredible profiler for the great price of free. There is nothing coming at this level of features and performance even on paid software. Tracy could cost thousands of dollars a year and would still be the best profiler.
Tracy requires you to add macros to your codebase to log functions/scopes, so its not an automatic sampling profiler like superluminal, verysleepy, VS profiler, or others. Each of those macros has around 50 nanoseconds of overhead, so you can liberally use them in the millions. On the UI, it has a stats window that will record average, deviation, min/max of those profiler zones, which can be used to profile functions at the level of single nanoseconds.
Its the main thing i use for all my profiling and optimization work. I combine it with superluminal (sampling profiler) to get a high level overview of the program, then i put tracy zones on the important places to get the detailed information.
it does, but i dont use it much due to it being too slow and heavy on memory on my ryzen 5950x (32 threads) on windows. a couple seconds of tracing goes into tens of gigabytes of ram.
Yeah I had issues with the Tracy sampler. It didn’t “just work” the way Superluminal did.
My only issue with Superluminal is I can’t get proper callstacks for interpreted languages like Python. It treats all the CPP callstacks as the same. Not sure if Tracy can handle that nicely or not…
They are not slower than headers. Ive been looking into it because modular STL is such a big win. On my little toy project i have .cpp files compiling in 0.05 seconds while doing import std.
Downside is that at the moment you cant mix normal header STL with module STL in the same project (msvc), so its for cleanroom small projects only. I expect the second you can reliably use that almost everyone will switch overnight just from how fast of a speed boost it gives on the STL vs even precompiled headers.
The one way in which they are slower than headers is that they create longer dependency chains of translation units, whereas with headers you unlock more parallelism at the beginning of the build process, but much of it is duplicated work.
Every post or exploration of modules (including this one) has found that modules are slower to compile.
> I expect the second you can reliably use that almost everyone will switch overnight just from how fast of a speed boost it gives on the STL vs even precompiled headers.
I look forward to that day, but it feels like we're a while off it yet
Depends on the compiler, clang is able to pre-instantiate templates and generate debug info as part of its pch system - (for instance most likely you have some std::vector<int> which can be instantiated somewhere in a transitively included header).
In my projects enabling the relevant flags gave pretty nice speedups.
Yes, but template code is all on headers, so it gets parsed every single time its included on some compile unit. With modules this only happens once so its a huge speed upgrade in pretty much all cases.
Whenever I’ve profiled compile times, parsing accounts for relatively little of the time, while the vast majority of the time is spent in the optimizer.
So at least for my projects it’s a modest (maybe 10-20%) speed up, not the order of magnitude speed up I was hoping for.
For some template heavy code bases I’ve been in, going to PCH has cut my compile times to less than half. I assume modules will have a similar benefit in those particular repositories, but obviously YMMV
Jai has a few unique features that are quite interesting. You can think of it as C with extremelly powerful metaprogramming, compile-time codegen and reflection, and strong template systems, plus a few extra niceties like native bump allocator support or things like easily creating custom iterators. There is nothing quite like it.
Thanks. I guess the "extremelly powerful metaprogramming", "compile-time codegen" and "strong template systems" are essentially the same thing. So the unique selling point of Jai from your perspective would be "C with generic metaprogramming, iterators and reflection"? Besides reflection, which is quite limited in C++, this would essentially be a subset of C++, isn't it?
Cpp goes nowhere to this level.
Jai templates execute normal code (any code) at compile time, so there is no need for sfinae tricks or others, you just write code that directly deals with types.
You can do things like having structs remove or add members by just putting the member behind an if branch. You can also fully inspect any type and iterate members through reflection. There is a macro system that lets you manipulate and insert "code" objects, the iterator system works this way. you create a for loop operator overload where the for loop code is sent to the function as a parameter.
It also lets you convert a string (calculated compile time) into code. Either by adding it into the macro system, or just inserting it somewhere.
Some of the experiments ive done were implementing a full marc-sweep GC by generating reflection code that would implement the GC sweep logic, having a database system where it directly converts structs into table rows and codegens all the code relating to queries, automatic inspection of game objects by having a templated "inspect<T>" type thing where it exposes the object into imgui or debug log output, fully automatic json serialization for everything, and an ECS engine where it directly codegens the optimal gather code and prepares the object model storage from whatever loops you use in the code.
All of those werent real projects, just experiments to play around with it, but they were done in just a few dozens/hundreds lines of code each. While in theory you could do those in Cpp, it would need serious levels of template abominations to write it. In jai you write your compile time templates as normal jai. And it all compiles quickly with optimal codegen (its not bloating your PDB like a million type definitions would do in cpp template metaprogramming).
The big downside of the language is that its still a closed private beta, so tooling leaves much to be desired. There is a syntax color system, but things like correct IDE autocomplete or a native debugger arent there. I use remedybg to debug things and vscode as text editor for it. Its also on relatively early stages even if its been a thing for years, so some of the features are still in flux between updates.
That sounds interesting; seems to be even more flexible than comptime in Zig, almost as powerful as Lisp and the MOP. How about compiler and runtime performance of these features?
EDIT: can you point to source locations in the referenced projects which demonstrate such features?
https://pastebin.com/VPypiitk This is a very small experiment i did to learn the metaprogramming features. its an ECS library using the same model as entt (https://github.com/skypjack/entt). In 200 lines or so it does the equivalent of a few thousand lines of template heavy Cpp while compiling instantly and generating good debug code.
Some walkthrough:
Line 8 declares a SparseSet type as a fairly typical template. its just a struct with arrays of type T inside. Next lines implement getters/setters for this data structure.
Note how std::vector style dynamic array is just a part of the lang, with the [..] syntax to declare a dynamically sized array.
Line 46 Base_Registry things get interesting. This is a struct that holds a bunch of SparseSet of different types, and providers getters/setters for them by type. It uses code generation to do this. The initial #insert at the start of the class injects codegen that creates structure members from the type list the struct gets on its declaration. Note also how type-lists are a native structure in the lang, no need for variadics.
Line 99 i decide to do variadic style tail templates anyway for fun. I implement a function that takes a typelist and returns the tail, and the struct is created through recursion as one would do in cpp. Getters and setters for the View struct are also implemented through recursion
Line 143 has the for expansion. This is how you overload the for loop functionality to create custom iterators.
The rest of the code is just some basic test code that runs the thing.
Last line does #import basic to essentially do #import stl type thing. Jai doesnt care about the order of any declarations, so having the includes at the bottom is fairly common.
Dunno about runtime performance, but compiling speed is something that's regularly mentioned as an important aspect of Jai and I think it's not an exaggeration to name it as one of the fastest languages when it comes down to comp speed.
So, it is like Nim?
Templates, metaprogramming, custom iterators, Nim's comp-time is also powerful, but I heard that 100% of Jai is available at compile time?
How good is interop with other languages?
Does Jai's syntax have some nice features?
Syntax sugar like Nim's UFCS or list comprehension (collect macro in nim)? effects system?
Functional paradigm support?
Yup if I would pick a language that’s closest to Jai I would choose Nim. Both have a bytecode-VM interpreter to run comp-time code (unlike with Zig where it still tree-walks the AST), have similar template codegen systems (though Nim also has AST macros), and has good interop with existing C/C++ code.
The one issue with Nim I had while trying it out is that there is an incredibly rich amount of features but many of them seem half-baked and it’s constantly being changed, so you’re always a bit unsure what’s a feature stable and safe to use or is “experimental”. But at least it’s been public for quite some time and you can actually use it right away…
I once had a possible series of PRs that would increase performance of godot renderer, fixing considerable bottlenecks on scene rendering on CPU. Reduz didn't like the changes and it went into the "will be fixed for 4.0" deflect. To this day most of those performance fixes arent there and 4.0 is slower than the prototype i had. My interaction was very much not positive.
Even then, i believe that godot leadership is doing a great job. Its almost comparable to the amazing process Blender has. This post looks to me like a ridiculous statement. Reduz and other godot leads get constant pestering from hundreds of people daily, Godot even has thousands of issues on the github repo for bug reports.
The godot project and W4 spend their money wisely. I know some freelance developers who got hired for doing some project features. Someone like reduz just does not need to scam anyone, because if he wanted money he could likely work for other companies as a low level Cpp engineer and get more money than what he pays himself with the w4 funding. That w4 funding is being used to make godot into a "real" game engine, with console support which is needed for the engine to be taken seriously by commercial projects and not just small indies or restricted projects. Setting up a physical office and a place to have all those console development kits costs money, and hiring developers experienced in those platforms is not cheap.
In the way i see it, the godot project often develops features in a "marketing" fashion. This clashes quite directly with people using it for serious projects. Unity engine has a very similar issue. We get things like development effort being spent on fancy dynamic global illumination while completely rejecting the classic static light maps (needed for lower end), and even basic features like level-of-detail or occlusion culling which are considered a must-have that every engine has. I think this is what makes the poster cyberreality in the linked forum so angry. Its one of the big faults of the engine, but its not that bad as a development idea. Those fancy big features attract a lot of users, who can then provide feedback and bug reports, PR their fixes to the engine, and of course, funding and hype. The main team makes sure that the architecture of the engine is good, with some fancy big features, and the bugfixing and more niche/professional features are left to community that can PR it. Github is filled with "better" engines from a technical standpoint with a total of 1 user.
Per-drawcall cost goes to nanosecond scale. Assuming you do drawcalls of course, this makes bindless and indirect rendering a bit easier so you could drop CPU cost to near-0 in a renderer.
It would also highly mitigate shader compiler hitches due to having a split pipeline instead of a monolythic one.
The simplification on barriers could improve performance a significant amount because currently, most engines that deal with Vulkan and DX12 need to keep track of individual texture layouts and transitions, and this completely removes such a thing.