Thank you, formatting fixed. My TLDR is that I would regard all commits by JiaT7...

ogurechny · on March 29, 2024

Zstd belongs to the class of speed-optimized compressors providing “tolerable” compression ratios. Their intended use case is wrapping some easily compressible data with negligible (in the grand scale) performance impact. So when you have a server which sends gigabits of text per second, or caches gigabytes of text, or processes a queue with millions of text protocol messages, you can add compression on one side and decompression on the other to shrink them without worrying too much about CPU usage.

Xz is an implant of 7zip's LZMA(2) compression into traditional Unix archiver skeleton. It trades long compression times and giant dictionaries (that need lots of memory) for better (“much-better-than-deflate”) compression ratios. Therefore, zstd, no matter how fashionable that name might be in some circles, is not a replacement for xz.

It should also be noted that those LZMA-based archive formats might not be considered state-of-the-art today. If you worry about data density, there are options for both faster compression at the same size, and better compression in the same amount of time (provided that data is generally compressible). 7zip and xz are widespread and well tested, though, and allow decompression to be fast, which might be important in some cases. Alternatives often decompress much slowly. This is also a trade-off between total time spent on X nodes compressing data, and Y nodes decompressing data. When X is 1, and Y is in the millions (say, software distribution), you can spend A LOT of time compressing even for relatively minuscule gains without affecting the scales.

It should also be noted that many (or most) decoders of top compressing archivers are implemented as virtual machines executing chains of transform and unpack operations defined in archive file over pieces of data also saved there. Or, looking from a different angle, complex state machines initializing their state using complex data in the archive. Compressor tries to find most suitable combination of basic steps based on input data, and stores the result in the archive. (This is logically completed in neural network compression tools which learn what to do with data from data itself.) As some people may know, implementing all that byte juggling safely and effectively is a herculean task, and compression tools had exploits in the past because of that. Switching to a better solution might introduce a lot more potentially exploited bugs.

treffer · on March 29, 2024

Arch Linux switched switched from xz to zstd, with neglectable increase in size (<1%) but massive speedup on decompression. This is exactly the use case of many people downloading ($$$) and decompressing. It is the software distribution case. Other distributions are following that lead.

You should use ultra settings and >=19 as the compression level. E.g. arch used 20 and higher compression levels do exist, but they were already at a <1% increase.

It does beat xz for these tasks. It's just not the default settings as those are indeed optimized for the lzo to gzip/bzip2 range.

ogurechny · on March 30, 2024

My bad, I was too focused on that class in general, imagining “lz4 and friends”.

Zstd does reach LZMA compression ratios on high levels, but compression times also drop to LZMA level. Which, obviously, was clearly planned in advance to cover both high speed online applications and slow offline compression (unlike, say, brotli). Official limit on levels can also be explained by absence of gains on most inputs in development tests.

Distribution packages contain binary and mixed data, which might be less compressible. For text and mostly text, I suppose that some old style LZ-based tools can still produce an archive roughly 5% percent smaller (and still unpack fast); other compression algorithms can certainly squeeze it much better, but have symmetric time requirements. I was worried about the latter kind being introduced as a replacement solution.

account42 · on April 3, 2024

> the lzo to gzip/bzip2 range

bzip2 is a pig that has no place being in the same sentence as lzo and gzip. It's nieche was maximum compression no matter the speed but it hasn't been relevant even there for a long time.

Yet tools still need to support bzip2 because bzip2 archives are still out there and are still being produced. So we can't get rid of libbz2 anytime soon - same for liblzma.

the8472 · on March 30, 2024

Note that the xz CLI does not expose all available compression options of the library. E.g. rust release tarballs are xz'd with custom compression settings. But yeah, zstd is good enough for many uses.

shanipribadi · on March 30, 2024

Looking forward to the time when Meta will make https://github.com/facebookincubator/zstrong.git public

found it mentioned in https://github.com/facebook/proxygen/blob/main/build/fbcode_..., looks like it's going to be cousin of zstd, but maybe for the stronger compression use cases

tomrod · on March 29, 2024

Not just Jia. There are some other accounts of concern with associated activity or short term/bot-is names.

jdright · on March 30, 2024

yes, like this one: https://github.com/facebook/folly/pull/2153

account42 · on April 3, 2024

Interesting, this would suggest exploits other than the known sshd one.

joveian · on March 30, 2024

Note that zstd (the utility) currently links to liblzma since it can compress and decompress other formats.

account42 · on April 3, 2024

Lol as if there weren't enough general archivers already.

account42 · on April 3, 2024

> Given the ability to manipulate gitnhistory I am not sure if a simple time based revert is enough.

Rewritten history is not a real concern because it would have been immediately noticed by anyone updating an existing checkout.

> Overall the only safe action would IMHO to establish a new upstream from an assumed good state, then fully audit it. At that point we should probably just abandon it and use zstd instead.

This is absurd and also impossible without breaking backwards compatibility all over the place.