Isn't the whole point of compression algorithms to find string of bytes that are...

dmw_ng · on March 14, 2021

It depends on a lot on the compressor. Ancient junk like gzip/deflate only has a 32kb window, more modern compressors like zstd can do 512mb or more.

Larger window means potentially much more work during compression, and larger RAM requirements during decompression, so its not exactly free

WoodenChair · on March 15, 2021

> Ancient junk like gzip/deflate only has a 32kb window

Yes, a reflection of the time that they were first created, but hardly junk. gzip/DEFLATE have saved millions if not billions of hours of human time over the past few decades. Better things exist now, but it's unnecessarily derogatory to call something that significant "junk." And by being industry standards, they have made compression available in many more places than it used to be, even if they are not state of the art. Better to have some reasonable compression than none.

dmw_ng · on March 15, 2021

Picture the effect of compression ratios or decompression performance improving 1% globally, or conversely, if planes were replaced with a Wright Brothers model or cars with the T1. When discussing Deflate it's not just 1%, but regularly 200% or 2,000%, nor is it something people use once per month, but practically every time they interact with digital electronics.

Regressing to the T1 is unthinkable because they're long out of fashion and widely perceived as obsolete. Every change starts with a shift in perception, a process accelerated by a shift in language. For that I'm content with "junk", it describes exactly how we must feel about this dead technology before we'll ever be rid of it.

I'll gladly reminisce once we've escaped its legacy, be it time wasted installing an Android app and the effect it has on the battery (multiplied by billions of users), the time to open an Excel document, power up some smart TVs, fetch an itemized AWS bill over a typical S.E. Asian Internet connection, or grep that bill at speeds appropriate for a state of the art laptop manufactured this side of the millennium. You expect me to pay respect to software that wastes the finite breaths of billions of real people every single day.

yongjik · on March 15, 2021

So, I wanted to check so I downloaded linux source tree and ran gzip and zstd.

    Uncompressed .tar:    1064376320
    .xz (downloaded file): 117637692 (88.9%)
    .gz:                   189299444 (82.2%)
    .gz (level 9):         186533230 (82.5%)
    .zst:                  178012638 (83.3%)
    .zst --ultra -22:      119676867 (88.8%)

So, even at maximum level, zstd is only about 6% better, or about 37% if you consider .gz as baseline, while being about 10x slower than gzip level 9. There are places where having a smaller compressed file is worth the cost of 10x execution time (and much higher memory usage), but serving web pages is almost certainly not one of them.

In other words, the reason gzip is still being widely used is simply because it's good enough for its use cases. It's less like the original T1, and more like complaining that people are still driving Toyota Camry 2011 when Toyota Camry 2021 is a so much better car.

dmw_ng · on March 15, 2021

For the pile of XML in front of me just now (originally arrived in a ZIP container aka deflate, part of a much larger set), it's 1.4gb decompressed, 44M gzipped/deflated (default settings), 31mb zstd (default), 5.379s to decompress with gzip, 0.560s to decompress with zstd. Just under an order of magnitude difference in the most frequent operation (decompression).

Admittedly compressors generally love XML, this is just one example -- 28% less time on download and 89% less wasted on file open. Multiply by a few tens to a few thousand occurrences per week for 7.6 billion people, and I really struggle to call that a Camry.

rualca · on March 15, 2021

You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

Come on, let's be honest here: that's nitpicking.

Unless you are running batch processes that store TB of compressed data, no one would even bother to switch app for thosd residual gains.

Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.

OskarS · on March 15, 2021

> You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

> Come on, let's be honest here: that's nitpicking.

I agree on the compression size (13 megabytes is really not very much difference) but the decompression speed improvement really is remarkable. It's an order of magnitude of difference! Amortize that over every time you make a webrequest that has to decompress data, it makes a huge difference.

I'm mostly in the "gzip is good enough" camp, but a speed improvement of 10x is not nitpicking.

afavour · on March 15, 2021

> Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.

I agree but have a different takeaway: it’s why VCs aren’t the be all and end all. Such an improvement is worth the time invested to create it. It won’t change the world but spanned across the entire globe it’s a very notable step forward. That a VC can’t get hockey stick growth and a lucrative return out of it doesn’t invalidate the idea.

dmw_ng · on March 15, 2021

5.3s at 1.4gb on a laptop is 53ms at 14mb, is >70ms on mobile before adding any increased download time. 14mb is one retina image, even Reddit already weighs 8mb. It's silly to pretend this stuff doesn't matter, you can't even get search ranking without addressing it, for reasons that don't require concocted scenarios to explain.

NohatCoder · on March 15, 2021

Your difference might be mainly single threaded vs multi threaded decompression. This doesn't typically transform into a real-world speedup when browsing the web, as one will typically run multiple decompression streams in parallel anyway. Zstd is definitely an improvement on Gzip, but not by that much.

rualca · on March 15, 2021

> 5.3s at 1.4gb on a laptop is 53ms at 14mb, is 70ms on mobile before adding any increased download time.

Factor in how much time it takes you to unpack those 1.4GB of data in a smartphone. Those 5s amount to which percentage of the total execution time?

NohatCoder · on March 15, 2021

A quick note on compressing XML. XML files typically have a high compression ratio because they consist of mostly redundancy, but compressing redundancy isn't free, you still end up with some overhead in the compressed file due to the redundancy. So if you pack the same data in a denser format, and then compress that, you typically end up saving a good chunk of space relative to the compressed XML file.

luckystarr · on March 15, 2021

The tradeoff you show between compression ratio and compression speed is a misleading one.

The slow compression speed only takes place once (when compressing), while the saved bandwidth due to a better compression ratio is saved continuously.

OskarS · on March 15, 2021

Not if you're serving things dynamically from a web server, then it happens on the fly, every request.

luckystarr · on March 15, 2021

True. But there you can go into optimizations like parallelization of the compression and building an optimized dictionary upfront once to make the process faster.

pmarreck · on March 15, 2021

Cache by the ETag and/or If-Modified-Since headers, then

telendram · on March 15, 2021

[flagged]

nathell · on March 15, 2021

> This is a deliberately misleading statement

> the same kind of bias

> another willful misrepresentation

These phrases, as well as the whole last paragraph, contribute no meritorical value to the comment, but are aggresively vocal about assuming bad faith and as such violate HN guidelines. Please don't do that.

tomp · on March 15, 2021

I disagree. Lies should be called out as lies, and aren't nearly often enough in the current era of fake news/science/politics.

tovej · on March 15, 2021

IMO, attack the message, not the messenger. Pointing out flaws in an argument is good, but assuming malice hurts the discussion. Now we are suddenly talking about the morality of the poster instead of the issue at hand.

pmarreck · on March 15, 2021

Never attribute to malice, that which can be adequately explained by error.

https://en.wikipedia.org/wiki/Hanlon%27s_razor

yongjik · on March 15, 2021

Well, I did say "about 37% if you consider .gz as baseline", but I agree that it could have been worded better.

The problem is what you are measuring. If you have to compress a 100MB file, and algorithm A/B/C/D compress it to 10/9/1/0.9MB, respectively, then we can argue that "B is 10% better than A" (because the file is 10% smaller) and also "D is 10% better than C" (same).

However, moving from A to B saves us 1MB for this file. Moving from C to D only saves us 0.1MB. So depending on how you look at it, you could also argue that the difference between C and D is really only 0.1% (of the original file size). It may not be what you want (for a particular situation), but it's not wrong.

yarcob · on March 15, 2021

> at its default level, compresses not only more than gzip level 9, but also at more than 10x the speed

I could also call that a wilful misrepresentation, since gzip level 9 is very slow compared to the default setting and IMO isn't worth the extra execution time.

Also, nobody mentions that gzip is already installed everywhere. zstd is fairly new, and it's not installed by default on any of the machines I use.

I mean, I'm sure there are some exotic screw drivers that work better than Torx. But I can get Torx screws and drivers in every hardware store and everyone already has compatible bits so why not use whatever works?

usr1106 · on March 15, 2021

Never trust any statistics you have not manipulated yourself...

You could have demonstrated the math without your attacks.

saberience · on March 15, 2021

> It's 10x times better. Because it can fit 10x more data in the same storage budget.

This makes no sense. It's 9% better because it's 9% better.

campital · on March 15, 2021

The resulting compressed archives of these operations are 10% and 1% of the original file sizes, respectively (10x difference). The percent reduction can’t really be compared linearly.

pmarreck · on March 15, 2021

I share your frustration with misinformation, but this is just human error, which you could be much kinder about correcting, I detected no gzip-pushing "agenda" in that comment, are you sure you're not just projecting?

sodality2 · on March 15, 2021

>but it's unnecessarily derogatory to call something that significant "junk."

By now, it certainly is. Back then it was probably state of the art. That's the cycle of computing. If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".

WoodenChair · on March 15, 2021

> If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".

I think you have that backwards. "junk" is an insulting word. If you called an Atari junk, most would understand you to be disparaging it in a roundabout way, not making a temporal point. To be fair to OP he did at least call it "ancient junk." But there's no need to add "junk." "old" or "ancient" more than ably would have made the point.

sodality2 · on March 15, 2021

I suppose that's down to cultural differences. I'd consider junk to be not useful anymore (I call my core2 duo junk but it was a powerhouse when I got it), but maybe a better word is defunct? According to dictionary, junk = Cheap, shoddy, or worthless.

xeromal · on March 15, 2021

I'll start calling my grandpa junk too.

geofft · on March 15, 2021

Your grandpa isn't particularly functionally different in design from new humans being produced today, just at a different stage of his life. Also, he's a human.

Let's take some non-human examples for comparison. Just about every good military strategy produced when your grandpa was young is, objectively, junk when compared to the state of war today. But just about every good piece of music still holds up - it might be different from good music produced today, but it's not worse. Your grandpa probably did mathematics with the help of books of log tables and books of mathematical instruction. The former are thoroughly junk, being at best as useful as a calculator (or calculator app) but much heavier, and at worst bearing typos; the latter are every bit as useful now as they were back then.

Describing something from a previous generation as "junk" does not convey the belief it is old - there's a perfectly good term for that, "old." It conveys the belief that it is junk.

13of40 · on March 15, 2021

I recall reading something in the 90s where in the 70s, Huffman coding was considered mathematically optimal compression and the whole field stagnated for a while because they thought they'd reached a fundamental limit.

kaba0 · on March 15, 2021

“Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream” -wikipedia

So being pedantic, it is optimal in this specific meaning, but I don’t know about this stagnation.

bscphil · on March 14, 2021

Makes me wonder how tar+zstd would perform for the StackOverflow user. They were using xz, which is hardly ancient junk.

scaladev · on March 15, 2021

>xz

It may not be ancient, but it's certainly not the greatest compression format out there. I found this article to be very informative:

https://www.nongnu.org/lzip/xz_inadequate.html

(The author worked on lzip so it's probably somewhat biased, but facts are facts.)

lifthrasiir · on March 15, 2021

The article calls xz out for its inadequacy as a (long-term) container format, which is nothing to do with its compression method (shared by both xz and lzip). I guess in this thread we are specifically talking about the efficiency of compression algorithms...

bscphil · on March 16, 2021

> I guess in this thread we are specifically talking about the efficiency of compression algorithms

Yep, and on top of that xz is more efficient at compression than zstd despite being "older", which is the relevant consideration here. I'm not expecting zstd to compete with xz for the StackOverflow case in terms of efficiency, what I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.

lifthrasiir · on March 16, 2021

> I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.

It should, because much like other LZ77-based formats zstd uses a distance symbol plus additional uncompressed bits for longer distances, which should be frequent with the unorganized file order. Those formats assume that lower bits of longer distances are essentially random, so if this assumption doesn't hold those bits will affect the efficiency. Zstd also recommends the minimum window size of 8 MB, which is an improvement but also not very large.

I've done a quick experiment with a directory hanging around my temporary folder that weighs about 14 MB (Android CellBroadcastReceiver source code, if you ask). The random file order resulted in a file about 1.4% larger than those for the file extension order. So the effect definitely still exists, though I'm not able to quantify that.

faho · on March 15, 2021

Most of the points that article makes are utterly irrelevant for most uses of xz, including this one.

loeg · on March 14, 2021

xz is more than ten years old now :-).

craftinator · on March 14, 2021

In terms of standard Linux tools, that's hardly ancient.

stavros · on March 15, 2021

Hell, it's barely gone from "bleeding edge" to mere "cutting edge".

Blikkentrekker · on March 15, 2021

Most tools do not become obsolete as quickly.

Unlinking a path is hardly something that sees cutting edge improvements, cryptographic hashing, filesystems, and compression is something that is.

nxpnsv · on March 15, 2021

Exactly because of your caveat they are not junk,

jtvjan · on March 14, 2021

The 'window size' dictates how far back the algorithm can look for redundant data. If you set it too high, people without enough core memory might not be able to decompress your archive.

duskwuff · on March 14, 2021

And the window size for gzip is only 32 kB -- it "forgets" context pretty quickly. gzip was written in 1992, so it had to run under much tighter memory constraints than necessarily make sense today.

Modern compressors like xz (LZMA) use much more memory for context. Under extreme settings and when working with large files, they can use gigabytes of memory.

derefr · on March 14, 2021

If you know your compressor has enough memory to hold the whole plaintext at once, but your decompressor doesn't, why not do extra work on the compression end with the whole plaintext in memory to emit an optimal "dynamic" Huffman alphabet, i.e. a code that gets patched by each new compressed block, rather than resetting to a completely new dictionary for each new block?

I presume the tightest packing possible would be achieved by converting the entire document into a Huffman-encoded sequence of references into a single unified strings table; breaking that strings table down recursively using largest-common-substring until satisfied; and then packing the resulting strings tree into something more like an intrusive prefix trie where offsets are implicit IDs. The problem would, of course, be the size of the resulting trie.

But you could always make the document into bytecode for a compressor VM (RAR-alike) where one opcode is "insert nodes {KV1, KV2, ...} into the prefix trie", and another opcode is "drop nodes {K1, K2, ...} from the prefix trie", such that the trie "window" used over the document would grow and shrink over time. Then you could optimize for redundant re-definitions of identical trie nodes in the compressed data, vs. total in-memory trie "window" size at any given time.

If the decompression wasn't required to be streaming, but rather could advance-allocate an mmap(2)ed file to write to, then you could also have a jump opcode, so that you could organize the Huffman-code trie-prefix definitions in an optimal order to minimize definition redundancy, trading off for making the compressed representation out-of-order relative to the plaintext it decompresses to. The compressor would put the "spine" of the document all together at the beginning, together with the frontloaded rootmost trie-node definitions; then gradually undefine them, replacing them with less-frequently-used nodes needed to decode context-free "islands" of content.

pezezin · on March 15, 2021

What you described is pretty much LZ78. It has been proven that if you can have an infinite dictionary, then you can achieve optimal efficiency. But infinite dictionaries are impossible.

Much better algorithms exist nowadays. Also, no modern algorithm uses Huffman coding anymore, it was overtaken by arithmetic and range coders long ago, and more recently by ANS coders.

derefr · on March 15, 2021

LZ78 isn't adaptive; entries are forced out of the dictionary as soon as the window moves beyond where the dictionary-entry was originally defined, and if those entries are still relevant, they'll get a redundant definition in the next block.

Also, LZ78 is not recursively compressed (i.e. it doesn't compress the dictionary itself—it won't recognize when the subsequences it's working with are themselves internally redundant and would be better represented as a function with an input.)

Huffman/arithmetic/range/ANS are all "the same" insofar as they're all local-context streaming coders, i.e. things that only have the power of an FSM rather than the power you can get from requiring a pushdown automata/Turing machine.

The kind of coder I was talking about above, definitely would require a Turing machine. You couldn't implement it as a DSP — you need that unbounded-length state tape, the ability to write unbounded amounts of code to it, and the ability to jump to that code. That doesn't guarantee that it would be more powerful/better compressing; just that it could be. It wouldn't have the same inherent limits to its power that FSM-equivalent codes do.

-----

Let me go on a tangent for a bit, about what such a code probably would achieve if it could work (which should convince you, like me, that this is probably impossible somehow):

One thing I'd expect the code above to accomplish, would be to take a set of files representing images produced by capturing the sequential output frames generated by running an input movie on a NES emulator; and to emit something that looks very analogous to a cut-down version of the original ROM + input movie. (That is, the compressed data would contain 1. a set of partial spritesheets — the nodes of the prefix trie; 2. a set of partial tilemaps, referencing those spritesheets — the dictionary patch definition blocks; and 3. a sequence of bytecode instructions that use arbitrary geometric "draw" operations, jumps, loop registers, etc., to generate the screen-capture images by referencing the tilemaps that in turn reference the spritesheets — i.e. by calling ops to load/unload dictionary patches by reference. That's a "demo", in demoscene terms.)

This kind of result, if achieved, would (AFAIK) beat all currently-known compression methods. The best ANS code can't compress a sequence-of-images of SMB1 as well as just storing the ROM of SMB1 together with the input movie does. This wouldn't get quite to that degree of compression, but it would get close, and it would do it without the "implicit" Kolmogorov complexity of a NES emulator†.

† Though you could think of the output as something like the Futamura projection of a NES emulator with a specific ROM + input movie loaded into it. (It'd be a very weird NES emulator; one that runs as a batch process and emits image files of video frames. But if you had such an emulator, the output of Futamura-ing it would be very close to — or possibly better than — the output of this kind of approach. But, unlike this approach, incredibly non-portable — it would likely only run on the host architecture + OS version it was Futamura-ed on.)

In the (1D) audio domain, what I'm describing would basically involve turning PCM audio that encodes a rendering of a MIDI sequence through a set of patches, back into the inputs (a MIDI sequence and a set of patches), plus any noise/mastering effects as a residue. We can already do that in domain-specific systems, of course — that's how Melodyne's Direct Note Access works.

I'm just talking about doing it instead in a general compressor, lifting data with opaque redundancies out into graphical data-structures heuristically in order to then compress those graphical data-structures into programs that build-and-then-walk those graphs to produce the original output. A code that synthesizes something like MIDI to represent music; something like NES nametables to represent collaged images; etc. Not as well as those domain-tuned codes do, but better than local-context non-structural codes do. Basically the promise of Hierarchical Temporal Memory, but without the ML part. Purely-algorithmic HTM.

And another fun consequence of such a graphical-structuring code existing, would be that you could feed it non-trivial program binaries, and its output would be an equivalent program retargeted into an hypothetical ISA optimized for just that program. Feed it a corpus of enough such programs, all distinct, and you'll get a generally-space-optimized version of the ISA. An automatic derivation of THUMB from ARM, sorta thing.

------

And again, as I said in my sibling comment: this seems so obvious a win to me that it must be broken or impossible somehow. Tries / DAFSAs are easy. Longest-common-substring iterated to a satisfaction threshold is easy (using temporary post-constructed suffix automata.) Hypothetical iterated restructuring is easy — see OptiPNG, HypoPG, etc. So, where's the impossible part where the wheels fall off? I'm pretty sure it's there.)

psandersen · on March 15, 2021

Interesting idea! I'm not an expert, but if I'm reading you right and the decompressor is Turing complete; you're searching for the smallest Turing machine capable of generating the input sequence?

That would indeed be an optimal compressor and start to enable things like Solomonof induction and AIXI.

Seems the challenge is to get feedback to guide the search for the right Turing machine. With infinite compute can just run an ever expanding set of Turing machines until you get one that generates the input, but without some breakthrough that would enable something like stochastic gradient descent to efficiently work on non-differentiable programs I don't see how it can be made practical.

Perhaps there is some sweet spot thats more general than current compressors that looks at a restrict class of Turing machines that could be feasible.

tedunangst · on March 14, 2021

Sure. Why not code it up?

derefr · on March 14, 2021

I'm not an information theorist / cryptographer / etc., and so I have a strong feeling that there must be something wrong with my assumptions, given that "compressing big files using lots of memory for people to later decompress back to files using average amounts of memory" is an extremely-common need (for e.g. packaged software/OS updates.)

When a physicist sees a proof for a perpetual-motion machine, they don't ask "can this work"; they ask "where is this going wrong, that it came to this absurd conclusion?"

My post wasn't suggesting a real technique — it was me sharing my own perpetual-motion machine proof, so that someone can hopefully point out the step that's impossible, and I can then learn something from that :)

dmurray · on March 15, 2021

It sounds plausible to me. Real-world constraints aren't usually "you can use any amount of resources on the compressor, and any amount of compute on the decompressor for free, but the decompressor is hardcapped at exactly X MB of memory". So you might only have a theoretical improvement on existing methods, but that would still be very cool.

You can't beat existing methods by much - we're pretty close to the Shannon limit - but it's not a closed subject.

craftinator · on March 14, 2021

Relevant xkcd: https://xkcd.com/1831/

dmurray · on March 14, 2021

Is this really true? Surely they should be able to decompress it with virtual memory - you tell them exactly where to look for each operation. On the other hand, it will take you much longer to compress the file if your lookback window is larger than your available memory.

jtvjan · on March 15, 2021

They're stream compressors, so they don't want to rely on being able to seek in the file. Even with swap, you'll have to cut it off somewhere; swap space isn't infinite.

mrec · on March 14, 2021

Due to limited memory, I think most compression algos have a window of "stuff they've seen recently" which they use to detect duplicates. If the duplicate is of something too far back, it's fallen out of the window and been forgotten.

jtolmar · on March 14, 2021

Compression formats have to be small themselves, so they make tradeoffs on what data they can point at to make copies of. Most formats prioritize recent data, since that tends to be the most similar.

burnte · on March 15, 2021

Sliding sindow effects. Most compression algos don't keep the entire set of files to be compressed in memory, they have a sliding window where the only lok back and ahead X bytes.

eyelidlessness · on March 15, 2021

I know there are fundamental limits on compression and I know I don’t know the first thing about implementing a modern compression algorithm. With that said:

> Most compression algos don't keep the entire set of files to be compressed in memory

This makes sense, but why keep the uncompressed data in memory? If the memory constraint is your biggest concern and you’re not CPU conscious, compare the outputs rather than the inputs. If you’re concerned about collisions, do a second pass to validate.

I’m certain the best minds on the topic are already aware of these and either using them or have ruled them out for reasons I can’t anticipate. I really hope, that likely being the case, I’ll get a chance to learn in responses and not just unexplained downvotes.

chmod775 · on March 15, 2021

> Isn't the whole point of compression algorithms to find string of bytes that are the same?

No. The goal of compression algorithms is to describe an object approximately (lossy) or accurately (lossless) using the smallest number of [insert unit] possible. In some cases looking for strings of bytes that are the same serves well enough.

Examples of lossy algorithms that don't just look for same strings of bytes are audio and video codecs.

An example of lossless compression that works differently would be compressing "123456789" as "1 to 9".

> Seems simple to detect and handle?

Sure. If you've got infinite memory and time. Depending on your format you may also need to read the whole input before you can start writing the first byte.

diegoperini · on March 14, 2021

Simple but time consuming to detect I presume.