Apple M1 Assembly Language Hello World

marcan_42 · on Jan 15, 2021

I see your userland M1 assembly language hello world and raise you a bare-metal M1 kernel assembly language hello world :-)

https://github.com/AsahiLinux/m1n1/blob/main/src/start.S

The first few lines of that will print 'm1n1' to the serial port as it initializes other things and eventually jumps to C code.

cshift · on Jan 15, 2021

> I see your userland M1 assembly language hello world and raise you a bare-metal M1 kernel assembly language hello world :-)

The most on-brand HN thing I've read so far today.

nneonneo · on Jan 15, 2021

This looks basically like assembly code for practically any embedded system - reading/writing hardware registers to interact with UARTs. Nice job (& best of luck with the Asahi Linux project - very exciting!)

marcan_42 · on Jan 15, 2021

Well, the Apple M1 is an embedded SoC after all :)

Thank you!

yitchelle · on Jan 15, 2021

That is the most exciting part of the M1 for me. What do you think will be the first embedded application for the M1, assuming Apple allows it to be purchased by 3rd parties? Autonomous driving, or robotics, or other..??

saagarjha · on Jan 15, 2021

Apple won't allow it to be purchased by third parties.

magnusmundus · on Jan 15, 2021

Not quite as exciting as your examples, but I guess an evolution of Apple CarPlay might be likely.

WanderPanda · on Jan 15, 2021

Yes they need to capture the car UI part as legacy automakers will not be able to build an ecosystem for that.

SulfurHexaFluri · on Jan 15, 2021

How do you actually load / run this code without an OS? I'm guessing its the same way something like grub gets run?

garaetjjte · on Jan 15, 2021

It's linked into Mach-O binary that is loaded by iBoot bootloader. It is explained in detail there: https://www.youtube.com/watch?v=d5s9fYfvzmY&t=12260

moonchild · on Jan 15, 2021

I don't know how it works on apple hardware—though I heard they used a UEFI variant with HFS+ at one point—but in general, on PCs, there are two ways to get bootcode to run.

The first is legacy BIOS: the BIOS will load the first 512 bytes of the boot sector, called the Master Boot Record (MBR) into memory location 0x7C00; they are then responsible for bootstrapping the rest of the system. Usually by doing the bare minimum to load a second-stage bootloader off the disc, and letting the second-stage bootloader do the rest of the work, since 512 bytes (actually only ~440) is really not very much.

The second is UEFI. The firmware will load a PE-formatted executable from a .efi-suffixed file on a FAT32 partition which has the EFI System Partition (ESP) flag set. Since this PE file can be any size, a second-stage bootloader is not generally necessary (but may be interesting for aesthetic reasons e.g. if you don't like PE).

alexjm · on Jan 15, 2021

The process for Apple hardware to find and start EFI is described in the "Booting from an Apple File System Partition" section on page 22 of this document. Basically, the EFI image is read from a special block on disk.

https://developer.apple.com/support/downloads/Apple-File-Sys...

The boot process for HFS+ is more involved, because you have to actually read the HFS+ file system to locate the EFI image, since it's stored in a regular file.

I'm not sure what happens after jumping into the EFI code.

(Disclaimer: I worked on the linked document.)

NewJazz · on Jan 15, 2021

On Rockchip boards, there is a bootrom that looks for data at a specific offset on a few storage devices, then calls the first one it finds. From there open source / flashable firmware initializes RAM, scales up CPU frequencies, and starts looking for a kernel or whatever.

TL;DR likely read only firmware, possibly more low level than a BIOS, UEFI, or Coreboot implementation would be.

Edit: then again, they could be shipping a full EFI implementation like they do on their x86 offerings.

user-the-name · on Jan 15, 2021

Where is the UART on that thing?

terramex · on Jan 15, 2021

UART can be enabled on M1 Mini on "the leftmost port (closest to the power input)". It first needs to be enabled by sending special USB-PD message.

More information: https://github.com/AsahiLinux/docs/wiki/HW%3AUSB-PD

user-the-name · on Jan 15, 2021

Interesting!

deaddodo · on Jan 15, 2021

He talks about it here:

https://www.youtube.com/watch?v=d5s9fYfvzmY

(In addition to the docs, as outlined by terramex)

_dh54 · on Jan 15, 2021

What baud rate does it run at?

marcan_42 · on Jan 15, 2021

115200 by default (it's what iBoot initializes it at), but I spent a chunk of yesterday's live development stream yak-shaving a way to smoothly switch to 1.5Mbaud in a way that lets my Glasgow-based auto-baud-sensing serial interface resync cleanly (turns out: it seems to be impossible using any single byte repeating with 1 stopbit and no pauses due to ambiguous framing; I had to use a multiple-byte pattern).

reaperducer · on Jan 15, 2021

115200 by default

This made me laugh because I spent a good chunk of last weekend trying to debug a problem with some of my old computer hardware and the problem turned out to be a link that couldn't handle anything over 150 baud.

Now we're squirting 115,200 straight out of a chip. It's like the future!

_dh54 · on Jan 16, 2021

The crazy part is that 115200 is actually slow by today’s standards!

RantyDave · on Jan 16, 2021

LOL. Zephyr?

sneak · on Jan 15, 2021

String bet! :)

(Raises are not accepted after you've already called.)

nneonneo · on Jan 15, 2021

The author appears to have conflated "Linux" with the "kernel" - for example, "When calling a Linux service the function number goes in X16 rather than X8.", "X16 - linux function number", "Call linux to output the string", etc. in the macOS assembly code.

Linux refers specifically to the operating system kernel that's used on Linux systems; the macOS kernel is called Mach. Technically, the non-negative-numbered system calls on macOS are those derived from BSD; macOS additionally has another (negative-numbered) set of system calls for its Mach microkernel.

coldtea · on Jan 15, 2021

>The author appears to have conflated "Linux" with the "kernel"

I don't think he conflated it (in his mind). In the post he's quite clear which is which.

He probably just reused an ARM Linux example code (probably from his book), made some changes required for macOS ARM, but forgot to change the name of the kernel.

bananabreakfast · on Jan 15, 2021

Small nitpick, but the macOS kernel is actually called XNU which is a hybrid of Mach and FreeBSD.

reaperducer · on Jan 15, 2021

the macOS kernel is actually called XNU

Since you seem knowledgeable about these things, how does one pronounce "XNU?"

Is it "ex-en-you" or "zznoo" or "she-no" or something else?

glhaynes · on Jan 15, 2021

Fairly certain I've heard it as "ex-noo". But it could just be that I've said it in my head that way forever!

joshspankit · on Jan 16, 2021

ZA-na-doo

athrun · on Jan 15, 2021

Yes I think he's getting some of terminology wrong.

Saying that "Linux [is] based on Unix" is not really accurate either.

But ultimately this is just nitpicking. It's great that he's sharing what he learned.

derefr · on Jan 15, 2021

> Saying that "Linux [is] based on Unix" is not really accurate either.

The context of the author's statement was syscall ABIs. And Linux's original (x86) syscall ABI is based on [a snapshot of] the syscall ABI from a Unix (4.2BSD, I think.)

Loosely based, mind you — 4.2BSD wasn't targeting a 32-bit architecture, let alone x86, so the registers et al weren't the same. But the syscall numbers match up, and the number and order of registers used have direct parallels.

Compare and contrast:

• FreeBSD ABI (direct descendant of 4.2BSD): https://github.com/freebsd/freebsd-src/blob/master/sys/kern/...

• Linux x86 ABI: https://chromium.googlesource.com/chromiumos/docs/+/master/c...

Everything lines up until you get to the OS-proprietary stuff.

spijdar · on Jan 15, 2021

So, a few things. 4.2BSD was targeting VAX, which is most assuredly a 32 bit machine.

Further, the common denominator you're seeing goes back much, much further than BSD itself. Behold! "Version 1" of UNIX, written in just 14 files of PDP-11 assembly. And if you look at u1.s, you'll see the sysent routine which handles syscalls. The numbering should be familiar :-)

https://github.com/dspinellis/unix-history-repo/blob/Researc...

If you go to 4.2BSD, you'll see many of these syscalls labeled as "old", e.g. "old pause", "old wait", "old break".

https://github.com/dspinellis/unix-history-repo/blob/BSD-4_2...

You can go a little further back to the original PDP7 assembly, but I don't believe the "kernel" actually ran on a separate "process" at all, and the userland simply linked into kernel symbols and called "read", "readdir" etc directly, hence why even later unix documentation tends to call these "routines" or "library calls".

derefr · on Jan 15, 2021

> Further, the common denominator you're seeing goes back much, much further than BSD itself.

Interesting, I didn’t know that! There’s definitely an essay to be written here about the evolution of this “Unix system-call convention” over the decades, going into how this table of calls survived each transition and port mostly-unscathed. (Given that we’re throwing actual source syscall tables back and forth as proofs, there’s definitely some narrativizing to be done here, The Old New Thing-style.)

I assume the syscalls weren’t retained in ports with the goal of “binary compatibility”, given that these descendant Unix ports were on different architectures that couldn’t literally exec(2) binaries from their ancestor. Guesses:

• Toolchain compatibility?

• Some shared cross-compiling assembler/linker that nevertheless had hard-coded syscalls?

• (The perhaps never-achieved-in-practice goal of) emulation-assisted descendant cross-compatibility, ala z/OS?

• The existence of hybrid/transitional minicomputer generations, that had application processors for both the old and new architectures (or application processors that could execute on both ISAs!), such that at least some of the systems being ported to could exec(2) the ancestor’s binaries straight from tape?

• Or just the expectation that, despite Unix and C being so intertwined, there were still enough people writing ASM for these machines — using a non-macro assembler — who had developed reflex-memory for the existing syscall numbers, that it would be a bad idea to change the table out from under them?

> the userland simply linked into kernel symbols and called "read", "readdir" etc directly

So basically, the original Unix was a DOS, rather than a kernel/supervisor. Was that just because the PDP7 didn’t have virtual memory management, or was it a conscious design decision that was later reversed?

spijdar · on Jan 15, 2021

Well first of all I was wrong -- the PDP7 did have syscalls, I'm just bad at reading PDP7 assembly and missed the dispatcher. Curiously, it looks like the sequence is entirely different, although there could be some magic that makes the order different than it appears at first glance.

https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/s...

It's all just guessing, but I figure the explanation is much simpler -- for PDP11 UNIX, they just kept using the same syscalls up till V7 / 2BSD, and there should have been a sort of "rolling release" binary compatibility. For the VAX, the first port (32v) probably just retained the original numbering since there was no reason to deviate from it, which colored 3BSD and 4BSD, hence {Net,Free,Open}BSD and Darwin and friends.

Worth pointing out that several versions of Linux have rather different syscall tables. 32 bit ARM and x86 are more-or-less matches, with ARM differing on a few early syscalls, while 64 bit ARM and amd64 differing quite dramatically. The old ABI for 32bit MIPS also matches, but both the n32 and n64 ABIs use slightly variant syscall tables. PowerPC 32/64 bit is also a close match, although it has some impedance (I think it matches closer to AIX by design)

At the end of the day, I think the similarity is mostly a mixture of coincidence, system developers being influenced by their bootstrap system's syscall tables, and no real reason to change them up. No reason to not change them, either, since it's pretty trivial to use different dispatch tables for different types of processes, like how the BSD's handle other-OS compat.

Blikkentrekker · on Jan 15, 2021

The term is so often conflated because it seems as though what is wanted be a technical definition of a concept that is not technical, but social.

“Linux” the kernel is technical, but that other thing for which “Linux” is often used has no technical barriers to it, and is entirely a social thing — something is “Linux” when it declares itself part of the tribe, and is so accepted.

SteamOS as a marketing strategy always used the phrase “Linux and SteamOS” and was thus successful in moving itself outside of the tribe, for example.

There is no technical definition of “an operating system” and there are no technical reasons for what “operating systems" are and aren't “Linux”; it is purely based on social cohesion.

userbinator · on Jan 15, 2021

Nothing a little s/Linux/macOS/ won't fix... but I sure hope the book he is promoting is less confusing than that. To his credit, he does say "in MacOS it's X, in Linux it's Y".

(I'm a long-time x86 programmer, who has done a very little of ARM but finds it boring since you can't beat a compiler as easily as you can on x86... and yet instinctively felt unease at moving 0 into a register. ;-)

nneonneo · on Jan 15, 2021

Yeah, there's a bunch of stuff in AArch64 that takes getting used to - for example, fixed-length instructions (everything is a 32-bit instruction!), tons of sanely-named registers (X0~X30, W0~W30), shorter literals (all literals need to fit in a 32-bit instruction), and so much more. Plus, it's a total break from AArch32 - I'm pretty comfortable with 32-bit ARM code, and yet 64-bit ARM code still looks pretty unusual to me.

lovelyviking · on Jan 15, 2021

> I'm pretty comfortable with 32-bit ARM

Could you give a direction to a good place for learning it ? Where is better to start? I wish to write hello world without linker or assembler for linux.

saagarjha · on Jan 15, 2021

Try https://azeria-labs.com/writing-arm-assembly-part-1/.

madog · on Jan 15, 2021

I found Azeria Labs [0] to be a pretty good introduction for learning ARM assembly basics. Although the main focus is ARM exploitation, it's still a good primer

[0]: https://azeria-labs.com/writing-arm-assembly-part-1/

lovelyviking · on Jan 15, 2021

What do you mean by “beat the compiler” ? like running code without it nor linker ? Looks like I have a mood to try that now

userbinator · on Jan 15, 2021

Smaller and/or faster than an equivalent HLL function.

raxxorrax · on Jan 15, 2021

Not yet given the M1 or MacOS a spin, but the syscall adresses should be available under /sys/syscall.h according to an answer on stack exchange.

Cannot find a table on the net and syscall adresses could be subject to changes.

saagarjha · on Jan 15, 2021

https://opensource.apple.com/source/xnu/xnu-7195.50.7.100.1/...

rckoepke · on Jan 15, 2021

I believe the first attempts at doing something like this were back in July of 2020. https://github.com/below/HelloSilicon : Someone with an early Developer Transition Kit (pre-M1 release) worked to convert the code from "Programming with 64-Bit ARM Assembly Language" to the M1's syntax.

Porting additional textbooks to M1's (ARMv8) syntax could help a lot in terms of making assembly accessible to more people. I believe there's a lot of value in learning it on a particularly popular real-world platform like x86 or M1 - where it may directly translate to reverse engineering userspace applications without having to then learn another assembler language as you might if you started with, say, RISC-V.

Truthfully I think someone should really add the M1 syntax to the new version of VisUAL: https://github.com/scc416/Visual2 . I've been intending to work on that to go along with porting Bob Plantz' "Introduction to Computer Organization: ARM Assembly Language Using the Raspberry Pi" to ARMv8 for the same educational purpose but haven't quite found the time to really dive in. A tool like VisUAL2 could help a lot of people learn this even if they don't own an M1 themselves.

Very tangentially to all of this, I'd like to showcase https://github.com/cornell-brg/pydgin , which is a flexible toolkit for simulating ISA's in Python, and was used to help validate the first version of VisUAL during its own development.

KMag · on Jan 15, 2021

I'm currently auditing the Stanford compilers course online. I understand the inertia in introductory courses, and the simplicity of MIPS, but hopefully the course is eventually ported to Aarch64 or RISC-V. Presumably a spim-like simulator for Aarch64 or RISC-V is the biggest missing component.

6.004 was one of my favorite classes at MIT, and the DEC Alpha AXP was a fine architecture to simplify for the pedagogical Beta architecture. However, I'm glad to hear they've moved to RISC-V. I presume they'd been avoiding porting the course to Aarch64 (or a simplified version thereof) due to intellectual property issues.

remexre · on Jan 15, 2021

For AArch32, not AArch64, but there's https://salmanarif.bitbucket.io/visual/index.html as a SPIM replacement.

rckoepke · on Jan 16, 2021

Worth mentioning that there's a new version of that here: https://github.com/scc416/Visual2 I don't believe the source code for the first version was ever released.

ffhhj · on Jan 15, 2021

Nice tutorial! In the 90's got really excited with x86, trying all the DOS/BIOS interrupts and CGA/EGA/VGA/VESA programming. Got a computer lab evacuated with a fake "computer virus" Mcafee AV couldn't detect, a resident program that would display a snake moving on the screen. So much fun.

It would be great to get a book on the new M1 and MacOS services, specially if it includes some guide to their Neural Engine and GPU programming.

andy_threos_io · on Jan 15, 2021

Prime example of how you NOT write assembly code, no matter what cpu are you using.

First thing first, macro assemblers have been around for at least 40+ years. So don't use numbers in your code. use macros or defines.

Ex. you can compile with gcc flags -x assembler-with-cpp

and you can have nice defines in your code.

Second on a decent OS even in assembly link against operating system call library, no matter what. The system call numbers can change. so use symbols.

Also don't write the string length in the code, it's total lame.

and use .asciz not .ascii

and define your function symbols to function gcc:

    .global my_func_name
    .type my_func_name, function
    my_func_name:

if you use C preprocessor for assembly compile just make an include like this:

  #define _FUNCTION(A) .global A ;\
   .type A, function

EDIT: For the not well informed HN readers:

https://developer.apple.com/library/archive/qa/qa1118/_index...

"Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework."

So DON'T write direct system call numbers in your code!

messe · on Jan 15, 2021

> Ex. you can compile with gcc flags -x assembler-with-cpp

You don't even need the flag. Just use the .S file extension rather than .s

saagarjha · on Jan 15, 2021

Using .ascii is fine; the string is going to a syscall–not libc. And using the standard library in a program like this is kind of beside the point…

andy_threos_io · on Jan 15, 2021

nope, every decent OS has system call library you can write like:

  bl _Open ; whatever syscall name you have

and the linker or the OS dynamic linker will get you the proper system call code, with the numbers.

BTW even in our small threos.io os we use dynamic linked system calls.

saagarjha · on Jan 15, 2021

I know how dynamic linkers work and am not arguing against using the libc interface in general. I'm saying that this program intentionally doesn't use those interfaces.

vetinari · on Jan 15, 2021

This can be a problem on macOS; Apple doesn't keep compatibility among releases at sysenter level.

Golang used to run into this, they switched to using libSystem in 1.11: https://golang.org/doc/go1.11#runtime

andy_threos_io · on Jan 15, 2021

The system call library is NOT libc.

saagarjha · on Jan 15, 2021

Then Apple doesn't have it.

andy_threos_io · on Jan 15, 2021

/usr/lib/system/libsystem_kernel.dylib

andy_threos_io · on Jan 15, 2021

you don't know about it

/usr/lib/system/libsystem_kernel.dylib

ex. https://stackoverflow.com/questions/37656016/osx-setgid-syst...

saagarjha · on Jan 15, 2021

That's part of libSystem, macOS's libc implementation…

_ph_ · on Jan 15, 2021

This is very nice to see. I recently got this book to get into ARM assembly language, with the long-term goal of using it on an M1 Mac, I barely could resist on getting a MacBook Air immediately :). Right now, I am setting up a Raspberry Pi 400 for first experiments. Unfortunately, it still ships with a 32 bit install.

While the differences between a M1 Mac and Linux-ARM machine are small, it is very nice to have the examples ported and especially tested on the Mac, as especially during learning a small mistake can be difficult to be noticed.

wk_end · on Jan 15, 2021

I low-key think ARM64 might be the nicest ISA around. It reminds me of a modern, RISCier 68k.

Intermernet · on Jan 15, 2021

I quite like RISC-V, but have yet to buy any actual hardware to play with. I don't think the worthwhile dev hardware exists yet, but I don't think it's far away!

bni · on Jan 15, 2021

How is it possible that a RISC ISA can be nice to program for in Assembly? Wasn't PPC horrible for this for example?

Wasn't the point of CISC to make it easier for Assembly programmers (made sense when x86 and 68k was defined)

adrian_b · on Jan 15, 2021

Most RISC ISA's, including PPC and ARM are much nicer to program in assembly than the Intel/AMD x86 ISA.

The only exception is with RISC ISA's that are far too simple, e.g. the base RISC-V, so they lack many common instructions or addressing modes, forcing the programmer to use sequences of instructions instead of the single instructions of other, more complete, ISA's.

Most RISC ISA's are relatively orthogonal, in the sense that you have to memorize only a small list of things, e.g. of kinds of instructions, kinds of registers, kinds of addressing modes, and then you can express any program as a combination of those few elements.

On the other hand, for the Intel/AMD x86 ISA, to be able to write an optimized program, you need to know a huge list of things.

The original x86 registers are all different having different features, there are not even 2 identical registers, so to find the optimal register allocation is difficult. Even the extra 8 registers added by the AMD 64-bit extension are not identical, but they are divided in 3 or 4 groups with different properties regarding the encoding of the instructions, which result in different program izes.

Among the registers added by more recent ISA extensions, especially the SIMD extensions, SSE/AVX, there are finally groups of identical registers that are easier to allocate, but these must be used together with the integer registers, which are needed at least for addressing.

It is difficult to determine the total number of different instructions in the x86 ISA, but they are much more than a thousand, many of which are obsolete and which should be avoided.

Even when compared to older normal ISA's, which are not so complex like x86, e.g. Motorola 68k or IBM mainframes, many RISC ISA's, including POWER & ARMv8, are still an easier target for assembly programming, when you attempt to write an optimized program.

When the performance of the program was not important, assembly programs for some traditional CISC ISA's e.g. Motorola 68k or DEC VAX could be simpler than for RISC ISA's, due to using fewer more complex instructions, but those instructions typically had slower implementations, so an optimized program might have been forced to avoid them.

Knowing the very variable timing of the CISC instructions and accounting for that in order to write an optimized program, made optimized assembly programming more difficult than for RISC CPUs.

userbinator · on Jan 15, 2021

My experience is pretty much the exact opposite, and I've been writing x86 Asm since the DOS days, only looking at ARM and other RISC stuff later.

Most RISC ISA's are relatively orthogonal, in the sense that you have to memorize only a small list of things, e.g. of kinds of instructions, kinds of registers, kinds of addressing modes, and then you can express any program as a combination of those few elements.

x86 is very orthogonal and has been since the 386. There's something called a ModRM, and the majority of the instructions use that form. The encoding is based on octal: https://gist.github.com/seanjensengrey/f971c20d05d4d0efc0781...

The original x86 registers are all different having different features, there are not even 2 identical registers, so to find the optimal register allocation is difficult.

On the contrary, that guides register allocation. Too bad most compilers don't seem to know, which is why they are often very easy to beat (and why compiler output looks so distinctively different from good handwritten Asm.)

Even the extra 8 registers added by the AMD 64-bit extension are not identical, but they are divided in 3 or 4 groups with different properties regarding the encoding of the instructions, which result in different program izes.

I have no idea what you mean by "divided in 3 or 4 groups". The "high registers" are all accessed with an extra prefix byte.

It is difficult to determine the total number of different instructions in the x86 ISA, but they are much more than a thousand, many of which are obsolete and which should be avoided.

FUD, no one really cares about "total number of different instructions". If you think of the typical RISC with fixed 32-bit instructions, that's over 4 billion values...

RISC is so bland and boring that it's hard to do a better job than a stupid compiler. An "op mem, reg" which is usually 2 or 3 bytes on x86 turns into at least 3 instructions (=12 bytes) and an explicit register allocation on a RISC. It's like writing microcode. No wonder they have so many registers --- they're really what a more intelligent CISC would use dynamically allocate internally in its uop execution engine/register renamer, saving cache occupancy and fetch bandwidth.

adrian_b · on Jan 15, 2021

"On the contrary, that guides register allocation"

I agree that you are right that for simple programs you can use the x86 registers for the purpose chosen by Intel, e.g. CX for storing a loop counter and so on.

Nevertheless, in my experience that are a lot of programs where you also want to use the registers for other things and if you would want to ensure that the program has the minimal size possible with the x86 encoding, you would have to move that variable from one register to another many times, to be able to use the shorter instructions possible with certain registers.

Of course, you can just ignore a large number of the 8086 special instruction encodings that were designed to enable shorter programs, in which case you can use most registers in an almost orthogonal way.

However, that is not very satisfying, to waste the extra existing hardware, just because it is hard to use. I prefer a CPU which does not implement features with little benefits, but if the features have already been implemented, I will rather use them.

"The "high registers" are all accessed with an extra prefix byte."

This is how it should have been, except that AMD was super lazy for unknown reasons and they saved a few decoding gates in a 64-bit CPU with millions of gates by not fully-decoding that extra prefix byte.

Because of that, there are addressing modes where you cannot encode R12 and other addressing modes where you cannot encode R13. When using R12 or R13, certain instructions are longer than when using other registers and those instructions are not the same for R12 and for R13.

So, to write the minimum-length program you must allocate differently R12, R13 and the other high registers. Moreover, there are other differences in the high registers that are not enforced by hardware, but they are enforced by the AMD64 ABI conventions about register usage, further dividing the high registers into groups with different properties.

Also the high number of registers in RISC CPUs vs. CISC CPUs has almost no relationship with the RISC concept.

One or two extra registers in a RISC are enough to cover the fact that a CISC instruction that combines a load with an operation uses an implicit register, besides the architectural registers.

The fact that most RISC ISAs have more registers than older ISAs, which happened to be CISC, is due to the need for higher performance in modern CPUs.

When you have more registers, you can interleave more chains of dependent operations, being thus able to hide the latencies of various execution units or to fully use all the execution units that would have stayed idle otherwise.

Too many registers can lower the performance, due to the need for saving and restoring them, but 32 registers are definitely better than 16, because they frequently enable faster programs.

Register renaming does not help with this. Register renaming eliminates the problem of shared resources (i.e. shared registers) in concurrent instructions, but it does not enable you to write a program that will hide the latencies caused by dependencies between instructions or that will be able to fully use all the execution units.

userbinator · on Jan 15, 2021

The high registers directly map onto the existing ones. R12 and R13 would better be called XSP and XBP, because that's what their corresponding "unescaped" registers are, and if you know the 32 bit encoding, you'd know that EBP without a displacement is the absolute mode, and ESP the SIB mode.

It's easy to complain about something you don't understand.

socialdemocrat · on Jan 15, 2021

I started with 68k on Amiga which made me love assembly. Then got exposed to x86 which I hated with a passion.

Keep in mind I was quite young then and x86 had so many oddities and inconsistencies that it made it really hard for a beginner.

In Uni when making a compiler I choose to target PPC as I figured anything was better than x86.

I kind of regretted it. I was new to RISC and the heavy use of registers threw me off. E.g. using registers for all function arguments except when there where too many. Using a register for return address but then needing to use stack anyway of call stack got too deep. How every address had to be loaded in two steps.

On top of that PPC did not offer a smaller and simpler instruction set. It was quite large from what I remember.

So getting something akin to feeling like 68k was a dream that got popped.

But I would say RISC-V gives me a bit of the 68k feeling as things are kept really small and beginner friendly.

mhh__ · on Jan 16, 2021

Ignoring that ARM isn't all that RISC-y anymore, CISC vs. RISC is basically a trade-off between density and the ease of widening execution.

wtsnz · on Jan 15, 2021

Sidenote: What a cool personal blog. Started in 2009 and still going strong. Inspirational!

bobrippling · on Jan 15, 2021

A few extra things to note for enthusiasts:

> The MacOS linker/loader doesn’t like doing relocations, so you need to use the ADR rather than LDR instruction to load addresses. You could use ADR in Linux and if you do this it will work in both.

I think this is more the assembler - the linker's perfectly happy performing relocations

> In MacOS you need to link in the System library even if you don’t make a system call from it or you get a linker error. This sample Hello World program uses software interrupts to make the system calls rather than the API in the System library and so shouldn’t need to link to it.

You can get around this by creating a statically linked executable, which requires a bit of wrangling, but is supported (and perhaps handy if you're going on to write a kernel).

> In MacOS the default entry point is _main whereas in Linux it is _start. This is changed via a command line argument to the linker.

In macOS the default entry point is start (linux is still _start), the C runtime still needs to be setup - the kernel can't jump a program straight to main.

95014_refugee · on Jan 16, 2021

> In macOS the default entry point is start (linux is still _start), the C runtime still needs to be setup - the kernel can't jump a program straight to main.

macOS has no interest in what the symbol is called; it pulls the initial PC from the LC_MAIN command in the Mach-o header.

ld64 (the linker) will by default populate that load command by looking up “_start”, but that’s a separate thing...

95014_refugee · on Jan 16, 2021

> I think this is more the assembler - the linker's perfectly happy performing relocations

No, this is “all executables are PIE because applying relocations to shared code is stupid and inefficient in the presence of ASLR”.

mark_l_watson · on Jan 15, 2021

That is so cool. I think that I am going to but his book.

It has been years since I wrote assembly code. I will time box an hour or two this weekend to play with this, but just for fun.

mark_l_watson · on Jan 15, 2021

To late to edit, I meant “buy his book”. Typo.

userbinator · on Jan 15, 2021

Class AMSupportURLConnectionDelegate is implemented in both ?? (0x1edb5b8f0) and ?? (0x122dd02b8). One of the two will be used. Which one is undefined.

It's both amusing and sad to see even simple command-line tools spewing those warnings which are most commonly found in the system log. I don't have a system to check at the moment, but I seem to recall using xcrun a long time ago and it didn't do that. I guess it is as they say, "beauty is only skin deep"...

nsonha · on Jan 15, 2021

isn't this just ARM assembly, is there such a thing called M1 assembly (genuine question)?

_ph_ · on Jan 15, 2021

Yes, it is just ARM assembly. But the tooling (Clang vs GCC) and the operation system differ, so the code from the book won't work without changes on a Mac. The book already contains a chapter for using the ARM assembly on iOS, because you have to do the same changes vs. running the sample code on Linux. The linked blog posts lists these changes. They are not large, but significant enough, that someone learning how to code in ARM assembly will be thankful for adjusted example code.

fulafel · on Jan 15, 2021

Yes. I guess the author is (successfully) going along with the general popular hype that paints it as a new architecture.

_ph_ · on Jan 15, 2021

I am certain, he isn't. The book also has a special chapter for iOS, explaining the changes needed to what is shown in the book to run the code with the iOS operation system and tooling.

fellellor · on Jan 15, 2021

Sorry for the noob question, but here it is anyway. What kind of projects and career paths does this kind of programming enable?

KMag · on Jan 15, 2021

The direct uses would be either very specialized portions of the kernel or C/C++ runtime (mutexes, etc.) or performance-critical code, where you'd typically look at C/C++ compiler output, tweak it by hand, and then insert your modifications into the C/C++ code as inline assembly.

Though, a small amount of assembly and a small understanding of processor design and implementation (I highly recommend MIT 6.004) go a long way into having a better mental model of what Java/JavaScript/C/C++ are actually doing under the covers and what's going on.

I was working on a project where I looked at the Philox4x32 random number generator specification and implemented it in Java. A colleague was trying to speed things up by porting it to C++ and so he just used the liberally licensed reference C++ implementation. He couldn't figure out why my Java implementation was significantly faster than the C++. I had a look, at the C++ implementation, and they missed an optimization to save one register in the inner loop. I didn't disassemble the native binary or have the JVM dissassemble JIT'd code, but I guessed that a register spill (32-bit x86 is pretty register starved) in the inner loop was the problem. Just by applying my optimization, he got the C++ implementation faster than the Java implementation.

fellellor · on Jan 15, 2021

Sounds cool. I’m interested in this now.

KMag · on Jan 15, 2021

In my experience, reading assembly, and being able to anticipate roughly what the compiler is going to produce is much more important than writing assembly. Learning to write a bit makes you better at reading it.

Next time you have some tricky inner loop showing up in profiles, write a few alternative implementations, benchmark each, and use the -S flag to get gcc/clang to dump out assembly.

There are some people who write a lot of assembly language, but I think orders of magnitude more people read assembly regularly while writing it rarely.

mhh__ · on Jan 15, 2021

My snarky answer to that is that I genuinely think any software engineer should be able to read and write assembly, even if purely to remind them just how far up the stack we usually operate at.

If we take "this" to mean low-level programming, it can open doors into anywhere from reverse engineering, to OS development etc (or more boringly, writing toolchains). Some parts of compilers rely heavily on having real problem solving skills at this level - for example most books won't teach you how to integrate ELF into your compiler, let alone DWARF.

I am hesitant to say performance optimization because many people have an idea of a CPU just executing one instruction at a time rather than the heavily pipelined and out of order monsters we have today.

To become Anger Fog, you must first invent the CPU-niverse.

(Agner is a almost mythical figure for many compiler and microarchitecture folks)

2sk21 · on Jan 15, 2021

When I was a CS PhD student, I spent the summer of 1988 writing assembly code on a parallel machine (Encore Multimax) that was running a variant of Unix. I wrote the runtime system (a virtual machine actually) for a parallel programming language. Haven't used assembly language since then!

fellellor · on Jan 15, 2021

I still don’t get it. This sounds cool, but it seems like work other people have already done and done well. My impression of the computing world is that once someone does something very well, it shuts the door for new entrants. There’s no more low hanging fruit left. So unless you think this is something cool, there’s not much left here.

Tell me how stupidly wrong I am (gently).

tiborsaas · on Jan 15, 2021

Besides understanding better what your language of choice is doing under the hood (vaguely) you can write cool demos like this: https://www.youtube.com/watch?v=sWblpsLZ-O8

fellellor · on Jan 15, 2021

That is insane. 256 bytes!!

saagarjha · on Jan 15, 2021

Someone doing something very well doesn't mean you can't take the time to understand what they did or replicate their work.

Intermernet · on Jan 15, 2021

I'm somewhat delighted that your first mention of Agner got auto-corrected to "Anger Fog" :-)

I like to think that Agner would appreciate that!

mhh__ · on Jan 15, 2021

Agner Fog becomes Anger Fog when measures Intel Compilers on AMD...

megablast · on Jan 15, 2021

Looking forward to the updates benchmarks on hello world programs.

musicale · on Jan 15, 2021

The macOS kernel is Xnu or Darwin, not Linux. Try this on macOS vs. Linux:

    uname -a

macOS system call numbers and interfaces are derived from (Free)BSD.

95014_refugee · on Jan 16, 2021

Some, a long time ago. Others come from Mach, and there’s ~20 years worth of independent work and divergence between then and now...

cylinder714 · on Jan 15, 2021

Many thanks for any assembly-language posts, but can we agree that a FizzBuzz implementation would be more interesting? ;-)

saagarjha · on Jan 15, 2021

  #include <sys/syscall.h>
  
  modulo:
      udiv x9, x0, x1
      msub x0, x9, x1, x0
      ret
  
  print_fizz:
      mov x0, #0
      adr x1, fizz
      mov x2, #4
      mov x16, SYS_write
      svc #0x80
      ret
  
  print_buzz:
      mov x0, #0
      adr x1, buzz
      mov x2, #4
      mov x16, SYS_write
      svc #0x80
      ret
  
  print_number:
      mov x9, x0
      mov x10, #10
      udiv x11, x0, x10
      mov x0, #0
      adr x1, number_table
      add x1, x1, x11
      mov x2, #1
      svc #0x80
      msub x11, x11, x10, x9
      mov x0, #0
      adr x1, number_table
      add x1, x1, x11
      mov x2, #1
      svc #80
      ret
  
  print_newline:
      mov x0, #0
      adr x1, newline
      mov x2, #1
      mov x16, SYS_write
      svc #0x80
      ret
  
  .globl _main
  .align 2
  _main:
      mov x19, #0
  loop:
      mov x20, #0
      mov x0, x19
      mov x1, #3
      bl modulo
      cmp x0, 0
      cset x20, eq
      b.ne not_3
      bl print_fizz
  not_3:
      mov x0, x19
      mov x1, #5
      bl modulo
      cmp x0, 0
      cset x21, eq
      b.ne not_5
      bl print_buzz
  not_5:
      orr x20, x20, x21
      cmp x20, #0
      b.ne divisible
      mov x0, x19
      bl print_number
  divisible:
      bl print_newline
      add x19, x19, #1
      cmp x19, #100
      b.le loop
      mov x16, #0
      svc #0x80
  
  fizz:
      .ascii "Fizz"
  buzz:
      .ascii "Buzz"
  number_table:
      .ascii "0123456789"
  newline:
      .ascii "\n"

tcmb · on Jan 15, 2021

I'm confused, why is this showing a Linux assembly file and then the Makefile and build instructions for macOS?

high_priest · on Jan 15, 2021

Does this blog have RSS feed?

banana_giraffe · on Jan 15, 2021

https://smist08.wordpress.com/feed/

pedro1976 · on Jan 15, 2021

Just in case you are looking for an rss feed, you can always give rss-proxy [0] a try.

[0] https://github.com/damoeb/rss-proxy

SecurityLagoon · on Jan 15, 2021

Thanks for this. Seems like another great service to run on my pi.

Getting a little tangential but how well do you find it works in general? I imagine it is only truly reliable with SSG type websites.

pedro1976 · on Jan 15, 2021

I think it works fine, it makes a website of choice accessible for your Reader. Since it just maps HTML -> RSS the entry description can be sparse, but then your reader should grab the fulltext of the referenced site.

There is basic JavaScript support [0], but this is rather new and not sure how well it works. I tested it with a couple of webapps and craigslist and it worked ok. If you host it on your pi, you can just run the big js-image.

[0] https://github.com/damoeb/rss-proxy#javascript-support

khanan · on Jan 15, 2021

Funny how you can replace MacOS with FreeBSD in this article and it's the same thing. OH WAIT! eyeroll

astrange · on Jan 15, 2021

It isn't the same thing since macOS doesn't use ELF.

mhh__ · on Jan 15, 2021

Is mentioning M1 just guaranteed karma now?

sph · on Jan 15, 2021

Apparently yes, I doubt a generic ARM assembly tutorial that writes "Hello, world" to stdout would get almost 400 points. It's not like M1 is a totally new architecture.

Conversely, saying anything bad about M1 or anything related to it, like this article, is a good way to piss people off on this forum.

andy_threos_io · on Jan 15, 2021

And it's a total lame, bad assembly code. Lacks any kind of material knowledge or practice.

Hackbraten · on Jan 15, 2021

Your comments in this thread have been belittling and unhelpful. We don’t like that here at all. Please stop.

andy_threos_io · on Jan 15, 2021

The example program presented is not a good solution, it goes against all the rules and recommendations.

Your program (made this way) will NOT WORK at all in the near future, ex when the system call number is changed (it's happened).