I believe that even if there was one then, it's probably no longer valid and it'...

wofo · on July 12, 2024

Other than backwards-compatibility, I can imagine simplicity being a reason. For instance, sequential pushing makes it easier to calculate the sha256 hash of the layer as it's being uploaded, without having to do it after-the-fact when the uploaded chunks are assembled.

amluto · on July 12, 2024

The fact that layers are hashed with SHA256 is IMO a mistake. Layers are large, and using SHA256 means that you can’t incrementally verify the layer as you download it, which means that extreme care would be needed to start unpacking a layer while downloading it. And SHA256 is fast but not that fast, whereas if you really feel like downloading in parallel, a hash tree can be verified in parallel.

A hash tree would have been nicer, and parallel uploads would have been an extra bonus.

cpuguy83 · on July 12, 2024

sha256 has been around a long time and is highly compatible.

blake3 support has been proposed both in the OCI spec and in the runtimes, which at least for runtimes I expect to happen soon.

I tend to think gzip is the bigger problem, though.

amluto · on July 12, 2024

> sha256 has been around a long time and is highly compatible.

Sure, and one can construct a perfectly nice tree hash from SHA256. (AWS Glacier did this, but their construction should not be emulated.)

cpuguy83 · on July 13, 2024

But then every single client needs to support this. sha256 support is already ubiquitous.

amluto · on July 13, 2024

Every single client already had to implement enough of the OCI distribution spec to be able to parse and download OCI images. Implementing a more appropriate hash, which could be done using SHA-256 as a primitive, would have been a rather small complication. A better compression algorithm (zstd?) is far more complex.

cpuguy83 · on July 13, 2024

I don't think we can compare reading json to writing a bespoke, secure hashing algorithm across a broad set of languages.

amluto · on July 14, 2024

Reading JSON that contains a sort of hash tree already. It’s a simple format that contains a mess of hashes that need verifying over certain files.

Adding a rule that you hash the files in question in, say, 1 MiB chunks and hash the resulting hashes (and maybe that’s it, or maybe you add another level) is maybe 10 lines of code in any high level language.

oconnor663 · on July 14, 2024

Note that secure tree hashing requires a distinguisher between the leaves and the parents (to avoid collisions) and ideally another distinguisher between the root and everything else (to avoid extensions). Surprisingly few bespoke tree hashes in the wild get this right.

amluto · on July 14, 2024

This is why I said that Glacier’s hash should not be emulated.

FWIW, using a (root hash, data length) pair hides many sins, although I haven’t formally proven this. And I don’t think that extension attacks are very relevant to the OCI use case.

catlifeonmars · on July 12, 2024

That does not make any sense; as the network usually is a much bigger bottleneck than compute, even with disk reads. You’re paying quite a lot for “simplicity” if that were the case

jtmarmon · on July 12, 2024

I’m no expert on docker but I thought the hashes for each layer would already be computed if your image is built

cpuguy83 · on July 12, 2024

It's complicated. If you are using the containerd backed image store (opt-in still) OR if you push with "build --push" then yes.

The default storage backend does not keep compressed layers, so those need to be recreated and digested on push.

With the new store all that stuff is kept and reused.

wofo · on July 12, 2024

That's true, but I'd assume the server would like to double-check that the hashes are valid (for robustness / consistency)... That's something my little experiment doesn't do, obviously.