Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
8086 Microcode Disassembled (reenigne.org)
233 points by matt_d on Sept 5, 2020 | hide | past | favorite | 51 comments


Chris Gerlinsky did a talk last year on the process he uses to decap chips and extract their ROM bits with a microscope: https://youtu.be/4YpSevQWCX8

My favorite line was when he described how one hint you've got the decoding right might be stumbling upon a recognizable ASCII string, and said "sometimes the only ASCII text you find is a copyright notice... keep putting those in, that's great!"


One thing I've always wondered: in what ways is the design of microcode instruction-sets for CISC-ISA CPUs, different from the design of outward-presenting RISC ISAs?

For example, does microcode tend to have instructions that "half complete" a transfer-level operation, leaving some registers in an indeterminate state, under the assumption (which is, in practice, a guarantee) that they'll always have another ucode op executed after them that does "the rest of" the operation and so puts things right?

Or, for another example, on CISC CPUs that have a small set of system-visible registers, and use register renaming to map them to a larger register file (e.g. x86_64), do the user-visible register names make it into the microcode; or do the microcode ops function directly on register-file offsets?

To answer these questions, though, we'd probably need a survey of microcode for various CPUs, including modern ones. So I'm not holding my breath. Unless an engineer from Intel or the like wants to jump in!

------

I've also been curious whether there are any lessons in the design of microcode ISAs, that can be applied in the design of abstract-machine bytecode ISAs.

Right now, most bytecode ISAs are semi-idealized RISC ISAs, with some load-time specialization of bytecode into VM-specific ops; but rarely is there recompilation of bytecode into VM-specific microcode. I'm curious why that is.


> in what ways is the design of microcode instruction-sets for CISC-ISA CPUs, different from the design of outward-presenting RISC ISAs?

There's a lot of variability in microcode designs, but based on the microcodes I've examined closely (various IBM 360 mainframes, Xerox Alto, 8086), there are several characteristics.

Microcode is usually much wider than instructions (21 bits for the 8086, over 100 bits for some IBM machines). Microcode is usually doing several things in parallel. An instruction set is designed to be general-purpose and "make sense", while microcode is nearly incomprehensible and does whatever bizarre tricks are necessary to implement just what is needed for the instruction set. (One important factor is that microcode doesn't need to be backwards or forwards compatible, so designers can do whatever they want.) Microcode's relationship with memory is different since you're dealing with address and data registers, not abstract reading and writing of memory. Microcode needs to worry a lot more about timing. For instance, in the 8086, an ALU operation is set up a cycle before it happens. In the Xerox Alto, conditional branches happen a cycle after you issue them.

For your specific question about registers, much of the 8086's microcode ignores the specific register names, saying things like move the generic source register to the ALU. The hardware selects the appropriate register based on the instruction, direction bit, etc. (I'm in the middle of writing a blog post about this.)

For a more modern look at microcode, the book "The Anatomy of a High-Performance Microprocessor" describes the AMD K6 processor in way more detail than you'd want.


There's two vague categories of microcode, with different answers for each.

This article is an example of vertical microcode. 15 to 64 bit wide instructions that look a lot like a regular assembly, just what you'd expect after the first stage of decoding, with the some of the encoding wrinkles ironed out. Source register always in the same place in the bit pattern kind of stuff. This will normally look a lot like the host ISA; two address versus three address kind of stuff. Might still have memory RMW ops if it's a CISC ISA.

Then you have horizontal microcode, which is a wider word. I've seen 64 bits to 256 bits. It's simply most of the control lines for the processor uarch state concatenated together. You'll have sometimes dozens of fields that always mean the same thing, and when you look at the schematic you clearly say to yourself "oh, these three bits are this mux, the four bits next are this mux, this next line is an enable", etc."

It's not uncommon to have both. The 68k's main microcode looks vertical, then has a horizontal microcode for it's "nanocode" deeper in the core.


Could you explain the flow or how the vertical and horizontal microcode interact? If a bit pattern is read into the instruction register that pattern is used as an index into a translation ROM? Is that translation the horizontal microcode that drives everything including further lookup and translations to mico-ops i.e vertical microcode?


So go to the 6502 for an example which has a decode ROM which is similar in concept, but far more primitive and fixed rather than programmable..

Instructions take a length of time from 2 to 7 cycles, with an additional cycle under certain conditions.

The decode ROM determines what is done for each of those cycles and allow the modularization of circuity for common purposes among instructions.

I think CPUs such as the Z80 and definitely the 68000 had more sophisticated mechanisms where the microcode was really a sub-CPU executing actual micro-instructions.

> I've also been curious whether there are any lessons in the design of microcode ISAs, that can be applied in the design of abstract-machine bytecode ISAs.

I think something like Itanium's VLIW breaks this barrier between microcode and ISA more than other ISAs and the lesson was it's too difficult to port legacy software to it, so we keep going on with CPUs that continue to support the appearance of an in-order ISA initially developed in the 1970's with more and more frankensteined extensions as time goes on.


Microcode instruction sets have different engineering trade-offs to the user-visible ISA. In microcode, memory bandwidth isn't such an issue so microcode instructions can be relatively wide (21 bits compared to 8 for the 8086).

The microcode can also be relatively difficult to write. For example, in the 8086 microcode I saw one place where there is a "DEC2 tmpc" microinstruction (subtract 2 from tmpc), then tmpc is loaded after that, after which the correct result is available (this makes sense when you think about how the ALU works on the chip, but in any normal ISA you have to load the values into the operands before you perform operations on them).

There's nothing in the 8086 microcode which creates any temporary undetermined states as far as I can tell but there may be combinations of microinstructions which could create a race condition.


In my experience, CISC instruction sets are primarily a way to create compact representations of fairly long instruction words.

That arises from the way in which the register file, the ALU, the flags, and various counters (instruction, stack, Etc) are laid out in logic and the buses between them.

Something that I find really fun to do is to experiment with compute architectures. I first started playing around with this with an Altera DE2 board, then the Spartan III dev board from Xilinx, and these days with a Lattice Ice40K board (Icebreaker, and soon ULX3S board). There are a number of "soft" CPUs where you can play around with this to your hearts content (given you have enough gates in your FPGA, which is getting easier and easier these days).

CISC instructions are, as the name suggests, are simply "subroutine calls" or "macro calls" (depending you what era of computing you were introduced to) into an underlying machine that can move bits around each "clock". RISC is essentially making that level of instruction available directly to the compiler.

The most infamous version of exposing what is essentially microcode to the compiler was, in my opinion, the Itanium which has a really flexible native instruction set that the compiler mixed and matched into pseudo instructions which it used to compile code into. A more elegant version of this was the Xerox PARC "D" machines which allowed you to load an instruction set prior to booting into your actual applications. This made Mesa[1] development interesting because you needed the appropriate instruction set to go with the Mesa compiler you were using.

[1] Mesa was Xerox's modular development language that inspired Wirth's Modula 2 (I believe that was the ordering, Wirth might claim it went the other way)


>" There are a number of "soft" CPUs where you can play around with this to your hearts content (given you have enough gates in your FPGA, which is getting easier and easier these days)."

Could you make some recommendations for a newcomer in terms of ease of use and good tooling? Thanks.


Perhaps the easiest way to get set up is to get a ULX3S from Crowd Supply[1], and then download the "official" tool chain from Github[2]. This project[3] is a good hackable target. There are interesting examples for this board here[4].

A better collection of links are here: https://ulx3s.github.io/

You'll need to develop some HDL skills (which is a deep dive) but it is well worth it to make experimental computer architectures!

[1] https://www.crowdsupply.com/radiona/ulx3s

[2] https://github.com/ulx3s/ulx3s-toolchain

[3] https://github.com/SpinalHDL/SaxonSoc/tree/dev-0.1/bsp/radio...

[4] https://github.com/emard/ulx3s-examples


Thanks for the recommendation this is a great list of links. You states: >"You'll need to develop some HDL skills ..."

Do you have some of recommendations on where to begin with that?

Thanks


Well there are two popular HDLs, one is Verilog and the other VHDL. In my experience Verilog is more popular both in github repositories and around the industry, so you could get something like "Verilog by Example"[1] which is a pretty concise jump into building things with Verilog.

There are a number of universities that put their Verilog course notes on-line and that is a good source of reference material as well.

[1] https://www.amazon.com/Verilog-Example-Concise-Introduction-...


Outstanding work - never fails to amaze me when people unearth little secrets like that 4 decades after the fact. That MUL/IMUL/IDIV status bit hack is one for the ages.


And he found it by “reading the code”!


Author here if anyone has any questions.


Does the microcode give any hints on why the general PUSH and POP are in completely different places in the opcode map (push is FF/6, pop is in its own group in 8F/0 with 8F/1-7 invalid, while FF/7 is unused)? It almost looks like FF/7 was supposed to be the pop. I've always wondered what 8F/1-7 and FF/7 do on an 8086/8 too, but it's very hard to find that information.


What's "random logic"? From context, it sounds like circuitry that explicitly implements the functionality of an opcode, as opposed to circuitry that can be used by the microcode, or something?


Yes, exactly - the logic that implements the simpler instructions directly as special-purpose gates rather than microcode.


To expand on that, "random logic" means that it looks random; it's not actually random. This is in contrast to circuits that have an underlying structure to them, like a PLA or ROM.


> While most of the unused parts of the ROM (64 instructions) are filled with zeroes, there are a few parts which aren't. The following instructions appear right at the end of the ROM [...]

Given that they're right at the end — and seemingly intentionally written there after the rest of the unused space before them was zeroed — might those bytes be a checksum of the ROM?


I don't think there's anything on the chip that could compute a checksum of the microcode ROM contents. It could be some kind of copyright message perhaps, though I don't know how it's encoded and it's only 42 bits long so there isn't much space for anything meaningful.


I would guess that it’s not a runtime-verified checksum, but rather a simple embedded “sum complement” value, used for ROM-mastering-time integrity verification.

A sum-complement value is a value computed from some data, such that, when the data is checksummed with the sum-complement value now embedded into it, the data will sum to zero. This approach to checksumming is useful, as any potential verifier just has to throw the image-as-a-whole through the checksumming algorithm, and ensure that the output is zero. It doesn’t need one iota of knowledge about what it’s verifying. It doesn’t even need an extra machine-register to hold the expected checksum.

These “blind” checksums allow ROM production hardware (programmers, copiers) to both pre-verify the integrity of the input image, and to post-verify that it has programmed the image onto a chip successfully. No special container format for the ROM image is required, nor is the ROM image required to be structured in any particular way (which is good, because ROMs are used for all sorts of things, not just code.) The ROM image can be any opaque blob, just as long as it sums to zero.

In fact, you don’t even need a ROM “image” at all. It’s possible to integrity-verify a programmed ROM “against itself”; and thus, a hand-programmed ROM (e.g. an EEPROM you programmed in your office) can be sent to the duplication facility to serve as the reference from which mask-ROM masks will be generated. The data on the EEPROM can be trusted, because it sums to zero. And the mask ROMs themselves can be checked for flaws by seeing whether they sum to zero.

For smaller-scale ROM distribution, ROM-to-PROM bulk copiers are used. These copiers can be made to both pre-verify the source, and to post-verify the programmed copies. Using this approach to checksumming, the copier can avoid having to verify the source “against” the destination, instead only needing to verify the source once, and then verify the destinations against themselves. This both speeds up verification; and allows for the use of simpler microcontrollers in these copiers, which reduces their design cost. (By quite a lot, back in the 1970s, when all this was most relevant.)

You can see this approach to checksumming in practice in early-generation game cartridge ROMs, which almost always have these embedded sum-complement values (and so presumably were integrity-verified during mastering/duplication.) These sum-complement value fields get referred to by emulators as “the checksum” of the ROM image—but technically, they’re not; if you’re following along, you’ll realize that “the checksum” of such ROM images is zero! :)


"In fact, you don’t even need a ROM “image” at all."

What exactly is a ROM image? Is it just the ROM contents encoded in some defined file format? If so what would a common format be.


I was being kind of loose with terminology; technically, a “ROM image” is an image (i.e. a replica, like a disk image) of a ROM chip.

ROM is random-access for reads—it’s “memory” in the same sense that RAM is memory, wiring onto a device’s address bus and so becoming part of that device’s physical memory layout.

So when people say that a game-cartridge backup device or the like captures a “ROM image”, what they really mean is that it captures “a snapshot of what the mapped region of the address space that the ROM chip claims to map for — or seems to be wired to — looks like.” Sometimes there’s metadata in the ROM itself saying what region the ROM maps for. But since the ROM is just a physical chip sitting on the bus, it can map or not map for any address arbitrarily (as long as it has the correct address lines wired to discriminate that address from other addresses.)

This is what results in so-called “overdumps” — this is where a ROM chip doesn’t actually respond to all the read requests that its mapping claims it does, and thus, for some reads (usually the ones at the top end of the ROM’s address space) you don’t get a response from the ROM, leaving the data bus floating (“open bus”), giving you undefined data for those reads.

This is why I say that a ROM image is technically an image of the address space a ROM occupies as discovered by requesting those addresses, and not an image of the ROM’s contents per se: most ROM images are, in fact, overdumps. It’s just that more modern systems have pull-up resistors on the data bus to ensure that reads the ROM doesn’t deign to respond to, read off as zero.

ROM copiers are really “ROM image” copiers — they work by programming the destination ROM(s) with the data discovered by probing the source ROM’s address space, as above. If the destination ROM is larger than the source ROM, the destination ROM will record an overdump of the source ROM.

All that being said, when originally programming an EEPROM, the ROM-programming device doesn’t actually interface to your computer as writable random-access memory. It interfaces as, essentially, a hybrid serial/block device — i.e. a device where you can either write (program) one byte to an arbitrary address, or write (program) a whole ROM-block (usually 64 bytes) at a time. You can also erase an entire block.

In other words, functionally, an EEPROM accessed through a programming device acts very similarly to flash memory accessed through a flash controller. (Flash memory is, in essence, an EEPROM technology with very fast writes trading off against slower, block-at-a-time reads rather than bus-speed byte-at-a-time reads.)

What that means, in practice, is that there’s no particular constraint on how you first program the data into the EEPROM you’re going to be mastering PROMs with. There’s no “ROM programmer file format”, any more than there’s a common file format used to descriptively represent the instructions the various mkfs(8) utils use to initialize filesystems onto a block device. Programming EEPROMs is a procedure, not data per se.

That being said, if we wanted to represent the process of programming an EEPROM using modern file formats, a CUE sheet (or equivalent) would probably be the best approach. A CUE sheet isn’t a description of the intended result, but rather a sequence of instructions for an abstract “burner” to go through to produce a result. Unlike a ROM image, which just tells you what you got when you tried to read from the addresses in an assumed-mapped memory region, a CUE sheet tells you what some other device originally tried to put at those addresses, and so lets you figure out which reads are “true” answers from the ROM, vs “open bus” answers, vs. de-facto responses from a pull-up resistor. (It also lets you emulate the process of cell wear, and so figure out which cells were intentionally “programmed to death”, allowing a faithful representation of “indeterminate state” addresses, much like the Applesauce image format[1] does for magnetic-flux media.)

[1] https://wiki.reactivemicro.com/Applesauce#Applesauce_Image_F...

So, to be clear, there's no defined file format for ROMs generally. You know the size of the EEPROM chip sitting in the programmer; you have some data you'd like to write (maybe in a file; maybe as a stream); as long as the size of the data is less than the size of the chip, you can just dd(1) the data, blockwise, onto the programmer block-device, and you'll get a programmed EEPROM.

But if you want to make this friendly to consumers — say, if the EEPROM is your computer's BIOS ROM — then you take a ROM image you've constructed some other way; wrap it in your own format with checksums et al; create a "flasher" program that first verifies the integrity of the ROM image against the checksum, and then dd(1)s it to the EEPROM programmer block-device. Usually the file extension OEMs decided on for these ROM-in-container files was ".bin". Doesn't mean anything; they were arbitrary formats, or sometimes not formats at all, just raw ROM images.


Thanks for the wonderfully detailed reply. I had a follow up question does the ROM designer or any part of the ROM itself ever have to know where in memory it is mapped to?


They almost certainly do. The set of system architectures that relied heavily on memory-mapped ROM, are almost exactly the same as the set of systems that don't have any concept of virtual memory, and where achieving position-independence (i.e. indirecting through some kind of symbol table) would be a huge waste of CPU cycle budget.

An interesting "exception that proves the rule" is the "option ROMs" (https://en.wikipedia.org/wiki/Option_ROM) on modern PCI-e cards, e.g. GPUs, NVME controllers, etc. which provide capabilities to the BIOS, like writing to the GPU's framebuffer.

These ROMs aren't position-independent (i.e. they always get mapped to the same physical memory region during BIOS bring-up) but their contents are position-independent code. This is because they're not actually ROM that lives on the CPU's address bus where the CPU could ever execute from it; but rather these ROMs live on the MMIO bus, which in x86 at least, can only be interacted with via specific IN/OUT instructions.

As such, even though BIOS option ROMs all wire to the same physical address† on the MMIO bus, they get copied into RAM in order for the CPU to execute on them, and so the code in those ROM chips has to be position-independent code.

† You might wonder, then, how the BIOS manages to read off a particular option ROM, when multiple ROMs could be wired to the same MMIO address, and thereby all respond to the same latched MMIO in request, making a mess of the MMIO data lines. My understanding of the spec, is that the BIOS just powers PCI/PCI-e devices on and off one by one during early boot, such that only one option ROM can be wired at a time; and does all its interaction with said ROM while it's isolated like this. The ability to do this "early power-on" — that maybe only powers on the wired ROMs and nothing else — is an important part of what it means for a PCI device to be "Plug-and-Play"!


>"My understanding of the spec, is that the BIOS just powers PCI/PCI-e devices on and off one by one during early boot, such that only one option ROM can be wired at a time; and does all its interaction with said ROM while it's isolated like this. The ability to do this "early power-on" — that maybe only powers on the wired ROMs and nothing else — is an important part of what it means for a PCI device to be "Plug-and-Play"!"

Interesting is this what acutally makes the boot times for servers with a handlful of option ROMs so painfully slow then?


Does MUL/IMUL/IDIV result negation trick (via REP prefix) work on later 8086-compatible Intel CPUs (e.g. 80286, 80386 etc)?


I have just learned from dreNorteR on VCF that it no effect on a 286 but has a different, unexpected, and useful effect on a 186! http://www.vcfed.org/forum/showthread.php?76657-8088-8086-mi...


I wonder if you could do a version of this article for a lay person like me? I really enjoyed Ken's articles because it assumed very little knowledge.


If ROM space is so valuable, why is ASCII text (which I imagine is relatively large) stored on it?


Yup. Who did write the microcode at the time? And for how long?


According to https://en.wikipedia.org/wiki/Intel_8086: "The architecture was defined by Stephen P. Morse with some help and assistance by Bruce Ravenel (the architect of the 8087) in refining the final revisions. Logic designer Jim McKevitt and John Bayliss were the lead engineers of the hardware-level development team and Bill Pohlman the manager for the project." I expect the microcode was developed in tandem with the rest of the chip, so probably took about 2 years.


It seems like there have been a few disassembly write ups on the 8086 lately. Are the tools getting to the point where this is possible, or just enough people with enough serious interest in this? Coincidence? Am I seeing a pattern that isn't really there?


Probably not entirely a coincidence - Ken Shirriff is doing a series on the 8086 which may account for at least one of the other articles you've noticed. My disassembly was only possible because of Ken's high-resolution photos of the die with the metal layer removed - that's why it took me until now to do it.


so it's turtles all the way down? someone makes a break through that gets used by someone else to make a different break through kind of a thing. this is why science needs to be open. no one person/group can do it all. i just wish that research didn't have to be done in secret to protect potential patent ability. Let the work be published and the let the people responsible receive whatever credit/recognition/awards deserved.

kudos for your efforts!


I apologize if this is a naive question but how do I make sense of microcode_8086.txt in the zip file? Just using line 1 or 000. I understand from the final column(s) that this concerns a MOV instruction and Mod R/M byte is that correct? How do I understand what everything to the left of that means?

000 A CD F H J L OPQR U R -> tmpb 4 none WB,NX 0100010??.00 MOV rm<->r

Similarly how do I understand what each of the different files in the zip file represents?

The authors states: >"I used bitract to extract the bits from the two main microcode ROMs, and also from the translation ROM which maps opcode bit patterns onto positions within the main microcode ROM."

Is the the translation ROM the translation.txt file then? Is that the key to understanding these files? If so why wouldn't there be more than the 38 or so Op codes listed?


The translation.txt file is the contents of the translation ROM which tells the CPU where in the microcode to go for long jumps, calls and EA decoding. The key.txt file has the details of all the mnemonics.

"000" - this is just a line number "A CD F H J L OPQR U" - these are the actual bits from the ROM. "R -> tmpb" - this is a move operation (each microcode instruction can do a move as well as something else) copying the value from "R" (a register described by the word length bit and either the R field or the RM field of the modrm byte depending on the direction bit) to "tmpb" (an internal register not accessible from the user-level ISA). "4 none WB,NX" - a type 4 instruction (bookkeeping) that tells the CPU that the next instruction is the last one in the microcode burst (NX) unless a write back (WB) to memory is needed. "0100010??.00" - this is the bit pattern by which this line of microcode is addressed. This one means opcodes 0x88-0x8b. "MOV rm<->r" - a comment added to say what this set of opcodes actually corresponds to x86 assembler, or what it does if it's a subroutine.


Thank you this really helps.

One last question do the individual files. There seems to be 3 distinct file groups:

0b.txt - 8t.txt

l0.txt - l3.txt

r0.txt-r3.txt

Do each of these represent different ROMs or different logical parts of the two ROMs? Or am I reading too much into the naming convention?

By the way - brilliant work. This is really a fascinating read.


The b and t files are the bottom and top halves of the 9 "chunks" of the decoder above the main microcode ROM. The l* and r* files are the left and right halves of the four horizontal slices of the main ROM. I split them up that way because bitract needs the bits to be regularly spaced in both the horizontal and vertical directions.

Thanks - glad you enjoyed it!



Awesome stuff. Really nostalgic. An 8086 with yellow monochrome screen was my first computer. It ran Police Quest I, I think.


Have you used a green monochrome screen? I still remember the first time I got one, because it was cheaper than those newfangled amber screens.

At first I thought it was a little stupid because of how slow the fade was when the cursor blinked, and it wasn't nearly as sharp or vivid. But within the first few hours of hacking around, I recognized how much easier on the eyes it was without the flickery amber that wobbled when you clacked your teeth together, and the weird random "snow" when refreshing the screen in a text "animation."

If only fractals didn't take an hour or so to render back then, an animated one at modern speeds would have been quite soothing to watch that way.

Fractint - I'm shocked I actually remember the name. Downloading it from a BBS is how I got my second computer virus! Exciting times. Nostalgic is right.


It looks like the Linux port of Fractint was last updated in August 2020!

https://fractint.org/ftp/current/linux/


Have you come across coolretroterm? It simulates the snow and wobble of those screens, and I think, does a reasonable job. Not sure if it would work with a graphical program though.


I remember seeing it, I think from HN when I was still a lurker. I bookmarked it, but never got around to trying it.

Some things are best left to the fond memories. I got an sdf.org account a few months ago, and that quickly demonstrated to me that nostalgia occurs when things happened so long ago that you forgot about all the not-great things like lag, terms that don't agree on what a backspace is, newsgroup spam, every program having completely different idioms, etc. My newsfeed config randomly disappeared after the third day, which was a very accurate representation of "back in the day" that I had forgotten about until that moment.



First I've seen this. I'm surprised, because alife was my obsession for pretty much the entirety of the 90's, and I recently have been taking courses through Santa Fe Institute that touch on the subject.

Thanks for the link!


I share that obsession(at the times). Nowadays I wish I'd have the space for a room with at least a half dome, to project something like this at the walls like in a planetarium, to be immersed in. If I were to build a house, it would have this.

My very own eternal lava lamp! :-)


Amazing work! Can't say I understood half of what you have written, but sure is some top quality work!


Amazing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: