Unfortunately, Memory64 comes with a significant performance penalty because the wasm runtime has to check bounds (which wasn't necessary on 32-bit as the runtime would simply allocate the full 4GB of address space every time).
But if you really need more than 4GB of memory, then sure, go ahead and use it.
Actually, runtimes often allocate 8GB of address space because WASM has a [base32 + index32] address mode where the effective address could overflow into the 33rd bit.
On x86-64, the start of the linear memory is typically put into one of the two remaining segment registers: GS or FS. Then the code can simply use an address mode such as "GS:[RAX + RCX]" without any additional instructions for addition or bounds-checking.
Seriously though, I’ve been wondering for a while whether I could build a GCC for x86-64 that would have 32-bit (low 4G) pointers (and no REX prefixes) by default and full 64-bit ones with __far or something. (In this episode of Everything Old Is New Again: the Very Large Memory API[1] from Windows NT for Alpha.)
Unfortunately the obvious `__attribute__((mode(...)))` errors out if anything but the standard pointer-size mode (usually SI or DI) is passed.
Or you may be able to do it based on x32, since your far pointers are likely rare enough that you can do them manually. Especially in C++. I'm pretty sure you can just call "foreign" syscalls if you do it carefully.
Especially how you could increase the segment value by one or the offset by 16 and you would address the same memory location. Think of the possibilities!
And if you wanted more than 1MB you could just switch memory banks[1] to get access to a different part of memory. Later there was a newfangled alternative[2] where you called some interrupt to swap things around but it wasn't as cool. Though it did allow access to more memory so there was that.
Then virtual mode came along and it's all been downhill from there.
Schulman’s Unauthorized Windows 95 describes a particularly unhinged one: in the hypervisor of Windows/386 (and subsequently 386 Enhanced Mode in Windows 3.0 and 3.1, as well as the only available mode in 3.11, 95, 98, and Me), a driver could dynamically register upcalls for real-mode guests (within reason), all without either exerting control over the guest’s memory map or forcing the guest to do anything except a simple CALL to access it. The secret was that all the far addresses returned by the registration API referred to the exact same byte in memory, a protected-mode-only instruction whose attempted execution would trap into the hypervisor, and the trap handler would determine which upcall was meant by which of the redundant encodings was used.
And if that’s not unhinged enough for you: the boot code tried to locate the chosen instruction inside the firmware ROM, because that will have to be mapped into the guest memory map anyway. It did have a fallback if that did not work out, but it usually succeeded. This time, the secret (the knowledge of which will not make you happier, this is your final warning) is that the instruction chosen was ARPL, and the encoding of ARPL r/m16, AX starts with 63 hex, also known as the ASCII code of the lowercase letter C. The absolute madmen put the upcall entry point inside the BIOS copyright string.
(Incidentally, the ARPL instruction, “adjust requested privilege level”, is very specific to the 286’s weird don’t-call-it-capability-based segmented architecture... But it’s has a certain cunning to it, like CPU-enforced __user tagging of unprivileged addresses at runtime.)
To clarify: when I said that “the boot code tried to locate the chosen instruction inside the firmware ROM”, I literally meant that it looked through the entirety of the ROM BIOS memory range for a byte, any byte, with value 63 hex. There’s even a separate (I’d say prematurely factored out) routine for that, Locate_Byte_In_ROM. It just so happens that the byte in question is usually found inside the copyright string (what with the instruction being invalid and most of the rest of the exposed ROM presumably being valid code), but the code does not assume that.
If the search doesn’t succeed or if you’ve set SystemROMBreakPoint=off in the [386Enh] section of SYSTEM.INI[1] or run WIN /D:S, then the trap instruction will instead be placed in a hypervisor-provided area of RAM that’s shared among all guests, accepting the risk that a misbehaving guest will stomp over it and break everything (don’t know where it fits in the memory map).
As to the chances of failing, well, I suspect the original target was the c in “(c)”, but for example Schulman shows his system having the trap address point at “chnologies Ltd.”, presumably preceded by “Phoenix Te”. AMI and Award were both “Inc.”, so that would also work. Insyde wasn’t a thing yet; don’t know what happened on Compaq or IBM machines. One way or another, looks like a c could be found somewhere often enough that the Microsoft programmers were satisfied with the approach.
Somewhat related. At some point around 15 years ago I needed to work with large images in Java, and at least at the time the language used 32-bit integers for array sizes and indices. My image data was about 30 gigs in size, and despite having enough RAM and running a 64-bit OS and JVM I couldn't fit image data into s ingle array.
This multi-memory setup reminds me of my array juggling I had to do back then. While intellectually challenging it was not fun at all.
The problem with multi-memory (and why it hasn't seen much usage, despite having been supported in many runtimes for years) is that basically no language supports distinct memory spaces. You have to rewrite everything to use WASM intrinsics to work on a specific memory.
It looks like memories have to be declared up front, and the memcpy instruction takes the memories to copy between as numeric literals. So I guess you can't use it to allocate dynamic buffers. But maybe you could decide memory 0 = heap and memory 1 = pixel data or something like that?
They might have meant lack of true 64bit pointers ..? IIRC the chrome wasm runtime used tagged pointers. That comes with an access cost of having to mask off the top bits. I always assumed that was the reason for the 32bit specification in v1
I still don't understand why it's slower to mask to 33 or 34 bit rather than 32. It's all running on 64-bit in the end isn't it? What's so special about 32?
That's because with 32-bit addresses the runtime did not need to do any masking at all. It could allocate a 4GiB area of virtual memory, set up page permissions as appropriate and all memory accesses would be hardware checked without any additional work. Well that, and a special SIGSEGV/SIGBUS handler to generate a trap to the embedder.
With 64-bit addresses, and the requirements for how invalid memory accesses should work, this is no longer possible. AND-masking does not really allow for producing the necessary traps for invalid accesses. So every one now needs some conditional before to validate that this access is in-bounds. The addresses cannot be trivially offset either as they can wrap-around (and/or accidentally hit some other mapping.)
I don't feel this is going to be as big of a problem as one might think in practice.
The biggest contributor to pointer arithmetic is offset reads into pointers: what gets generated for struct field accesses.
The other class of cases are when you're actually doing more general pointer arithmetic - usually scanning across a buffer. These are cases that typically get loop unrolled to some degree by the compiler to improve pipeline efficiency on the CPU.
In the first case, you can avoid the masking entirely by using an unmapped barrier region after the mapped region. So you can guarantee that if pointer `P` is valid, then `P + d` for small d is either valid, or falls into the barrier region.
In the second case, the barrier region approach lets you lift the mask check to the top of the unrolled segment. There's still a cost, but it's spread out over multiple iterations of a loop.
As a last step: if you can prove that you're stepping monotonically through some address space using small increments, then you can guarantee that even if theoretically the "end" of the iteration might step into invalid space, that the incremental stepping is guaranteed to hit the unmapped barrier region before that occurs.
It's a bit more engineering effort on the compiler side.. and you will see some small delta of perf loss, but it would really be only in the extreme cases of hot paths where it should come into play in a meaningful way.
> AND-masking does not really allow for producing the necessary traps for invalid accesses.
Why does it need to trap? Can't they just make it UB?
Specifying that invalid accesses always trap is going to degrade performance, that's not a 64-bit problem, that's a spec problem. Even if you define it in WASM, it's still UB in the compiler so you aren't saving anyone from UB they didn't already have. Just make the trapping guarantee a debug option only.
It's WASM. WASM runs in a sandbox and you can't have UB on the hardware level. Imagine someone exploiting the behavior of some browser when UB is triggered. Except that the programmer is not having nasal demons [1] but some poor user, like a mom of four children in Abraska running a website on her cell phone.
The UB in this case is "you may get another value in the sandboxed memory region if you dereference an invalid pointer, rather than a guaranteed trap". You can still have UB even in a sandbox.
Seems like they got overly attached to the guaranteed trapping they got on 32-bit and wanted to keep it even though it's totally not worth the cost of bounds checking every pointer access. Save the trapping for debug mode only.
Ah, so you meant UB = unspecified behavior, not UB = undefined behavior.
Maybe. Bugs that come from spooky behavior at a distance are notoriously hard to debug, especially in production, and it's worthwile to pay for it to avoid that.
The special part is the "signal handler trick" that is easy to use for 32-bit pointers. You reserve 4GB of memory - all that 32 bits can address - and mark everything above used memory as trapping. Then you can just do normal reads and writes, and the CPU hardware checks out of bounds.
With 64-bit pointers, you can't really reserve all the possible space a pointer might refer to. So you end up doing manual bounds checks.
Can't bounds checks be avoided in the vast majority of cases?
See my reply to nagisa above (https://news.ycombinator.com/item?id=45283102). It feels like by using trailing unmapped barrier/guard regions, one should be able to elide almost all bounds checks that occur in the program with a bit of compiler cleverness, and convert them into trap handlers instead.
Yeah, certainly compiler smarts can remove many bounds checks (in particular for small deltas, as you mention), hoist them, and so forth. Maybe even most of them in theory?
Still, there are common patterns like pointer-chasing in linked list traversal where you just keep getting an unknown i64 pointer, that you just need to bounds check...
Because CPUs still have instructions that automatically truncate the result of all math operations to 32 bits (and sometimes 8-bit and 16-bit too, though not universally).
To operate on any other size, you need to insert extra instructions to mask addresses to the desired size before they are used.
But if you really need more than 4GB of memory, then sure, go ahead and use it.