I feel like I still didnt fully understand what's going on here. Is the following correct? "Threads hava a 'canonical' stack that the OS auto-grows for you as you use more of it. But you can also create your own stack by putting any value you want in RSP. This is what the Go program did, and the vDSO, assuming it ran on an auto-growing stack, tried to probe it, which lead to corruption."
I believe that Golang, as a green-threaded runtime, is allocating a separate carrier thread for running syscalls on, so that a blocking syscall won’t block the green-threads. These syscall carrier threads are allocated with a distinct initial stack size + stack size limit than green-thread-scheduler carrier threads are, because it’s expected that syscalls will always just enter kernel code (which has its own stack.)
But vDSOs don’t enter the kernel. They just run as userland code; and so they depend on the userland allocated stack to be arbitrarily deep in a way that kernel calls don’t.
As shown in the article, Golang seems to have code specifically for dealing with vDSO-type pseudo-syscalls — but this is likely a specialization of the pre-existing syscall-carrier-thread allocation code, and so started off with a bad assumption about how much stack should be allocated for the created threads.
(I should also point out that the OS stack size specified in the ELF executable headers, only guarantees the stack size of the initial thread of a process created by exec(2). All further threads get their stacks allocated explicitly in userland code by libpthreads or the like calling malloc(2). Normally these abstractions just reuse the same config params from the executable (unless you override them, using e.g. pthreads_attr_setstacksize). But, as the article says, Golang implements its own support for things like this, and so can implement special thread-allocation strategies per carrier thread type.)
The problem is that the vDSO (which is compiled as part of the kernel but runs in userspace) does a stack probe for security reasons, trying to see if it will overrun the stack. It does this by checking if at least a page’s worth of data is accessible. If not, it will (typically) fault. However, Go programs use a stack size so small that they may have other data a page away, which means the problem may mess with that data and cause bad things to happen.
In that case it couldn’t corrupt data, but the orq instruction itself could crash if it pointed to an unmapped address. (Which is kind of the point of stack probes.)
Thread stacks in Linux are demand-paged: if you touch the next page then it magically exists, up to a limit. But the machine is not concerned with the convenient properties of this virtual memory area. To the CPU the register RSP is just an operand, expressed or implied, to some instructions.