Isn't the whole point of compression algorithms to find string of bytes that are the same? Why does it struggle if its out of order or not next to each other? Seems simple to detect and handle?
> Ancient junk like gzip/deflate only has a 32kb window
Yes, a reflection of the time that they were first created, but hardly junk. gzip/DEFLATE have saved millions if not billions of hours of human time over the past few decades. Better things exist now, but it's unnecessarily derogatory to call something that significant "junk." And by being industry standards, they have made compression available in many more places than it used to be, even if they are not state of the art. Better to have some reasonable compression than none.
Picture the effect of compression ratios or decompression performance improving 1% globally, or conversely, if planes were replaced with a Wright Brothers model or cars with the T1. When discussing Deflate it's not just 1%, but regularly 200% or 2,000%, nor is it something people use once per month, but practically every time they interact with digital electronics.
Regressing to the T1 is unthinkable because they're long out of fashion and widely perceived as obsolete. Every change starts with a shift in perception, a process accelerated by a shift in language. For that I'm content with "junk", it describes exactly how we must feel about this dead technology before we'll ever be rid of it.
I'll gladly reminisce once we've escaped its legacy, be it time wasted installing an Android app and the effect it has on the battery (multiplied by billions of users), the time to open an Excel document, power up some smart TVs, fetch an itemized AWS bill over a typical S.E. Asian Internet connection, or grep that bill at speeds appropriate for a state of the art laptop manufactured this side of the millennium. You expect me to pay respect to software that wastes the finite breaths of billions of real people every single day.
So, even at maximum level, zstd is only about 6% better, or about 37% if you consider .gz as baseline, while being about 10x slower than gzip level 9. There are places where having a smaller compressed file is worth the cost of 10x execution time (and much higher memory usage), but serving web pages is almost certainly not one of them.
In other words, the reason gzip is still being widely used is simply because it's good enough for its use cases. It's less like the original T1, and more like complaining that people are still driving Toyota Camry 2011 when Toyota Camry 2021 is a so much better car.
For the pile of XML in front of me just now (originally arrived in a ZIP container aka deflate, part of a much larger set), it's 1.4gb decompressed, 44M gzipped/deflated (default settings), 31mb zstd (default), 5.379s to decompress with gzip, 0.560s to decompress with zstd. Just under an order of magnitude difference in the most frequent operation (decompression).
Admittedly compressors generally love XML, this is just one example -- 28% less time on download and 89% less wasted on file open. Multiply by a few tens to a few thousand occurrences per week for 7.6 billion people, and I really struggle to call that a Camry.
You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.
Come on, let's be honest here: that's nitpicking.
Unless you are running batch processes that store TB of compressed data, no one would even bother to switch app for thosd residual gains.
Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.
> You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.
> Come on, let's be honest here: that's nitpicking.
I agree on the compression size (13 megabytes is really not very much difference) but the decompression speed improvement really is remarkable. It's an order of magnitude of difference! Amortize that over every time you make a webrequest that has to decompress data, it makes a huge difference.
I'm mostly in the "gzip is good enough" camp, but a speed improvement of 10x is not nitpicking.
> Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.
I agree but have a different takeaway: it’s why VCs aren’t the be all and end all. Such an improvement is worth the time invested to create it. It won’t change the world but spanned across the entire globe it’s a very notable step forward. That a VC can’t get hockey stick growth and a lucrative return out of it doesn’t invalidate the idea.
5.3s at 1.4gb on a laptop is 53ms at 14mb, is >70ms on mobile before adding any increased download time. 14mb is one retina image, even Reddit already weighs 8mb. It's silly to pretend this stuff doesn't matter, you can't even get search ranking without addressing it, for reasons that don't require concocted scenarios to explain.
Your difference might be mainly single threaded vs multi threaded decompression. This doesn't typically transform into a real-world speedup when browsing the web, as one will typically run multiple decompression streams in parallel anyway. Zstd is definitely an improvement on Gzip, but not by that much.
A quick note on compressing XML. XML files typically have a high compression ratio because they consist of mostly redundancy, but compressing redundancy isn't free, you still end up with some overhead in the compressed file due to the redundancy. So if you pack the same data in a denser format, and then compress that, you typically end up saving a good chunk of space relative to the compressed XML file.
The tradeoff you show between compression ratio and compression speed is a misleading one.
The slow compression speed only takes place once (when compressing), while the saved bandwidth due to a better compression ratio is saved continuously.
True. But there you can go into optimizations like parallelization of the compression and building an optimized dictionary upfront once to make the process faster.
These phrases, as well as the whole last paragraph, contribute no meritorical value to the comment, but are aggresively vocal about assuming bad faith and as such violate HN guidelines. Please don't do that.
IMO, attack the message, not the messenger. Pointing out flaws in an argument is good, but assuming malice hurts the discussion. Now we are suddenly talking about the morality of the poster instead of the issue at hand.
Well, I did say "about 37% if you consider .gz as baseline", but I agree that it could have been worded better.
The problem is what you are measuring. If you have to compress a 100MB file, and algorithm A/B/C/D compress it to 10/9/1/0.9MB, respectively, then we can argue that "B is 10% better than A" (because the file is 10% smaller) and also "D is 10% better than C" (same).
However, moving from A to B saves us 1MB for this file. Moving from C to D only saves us 0.1MB. So depending on how you look at it, you could also argue that the difference between C and D is really only 0.1% (of the original file size). It may not be what you want (for a particular situation), but it's not wrong.
> at its default level, compresses not only more than gzip level 9, but also at more than 10x the speed
I could also call that a wilful misrepresentation, since gzip level 9 is very slow compared to the default setting and IMO isn't worth the extra execution time.
Also, nobody mentions that gzip is already installed everywhere. zstd is fairly new, and it's not installed by default on any of the machines I use.
I mean, I'm sure there are some exotic screw drivers that work better than Torx. But I can get Torx screws and drivers in every hardware store and everyone already has compatible bits so why not use whatever works?
The resulting compressed archives of these operations are 10% and 1% of the original file sizes, respectively (10x difference). The percent reduction can’t really be compared linearly.
I share your frustration with misinformation, but this is just human error, which you could be much kinder about correcting, I detected no gzip-pushing "agenda" in that comment, are you sure you're not just projecting?
>but it's unnecessarily derogatory to call something that significant "junk."
By now, it certainly is. Back then it was probably state of the art. That's the cycle of computing. If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".
> If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".
I think you have that backwards. "junk" is an insulting word. If you called an Atari junk, most would understand you to be disparaging it in a roundabout way, not making a temporal point. To be fair to OP he did at least call it "ancient junk." But there's no need to add "junk." "old" or "ancient" more than ably would have made the point.
I suppose that's down to cultural differences. I'd consider junk to be not useful anymore (I call my core2 duo junk but it was a powerhouse when I got it), but maybe a better word is defunct? According to dictionary, junk = Cheap, shoddy, or worthless.
Your grandpa isn't particularly functionally different in design from new humans being produced today, just at a different stage of his life. Also, he's a human.
Let's take some non-human examples for comparison. Just about every good military strategy produced when your grandpa was young is, objectively, junk when compared to the state of war today. But just about every good piece of music still holds up - it might be different from good music produced today, but it's not worse. Your grandpa probably did mathematics with the help of books of log tables and books of mathematical instruction. The former are thoroughly junk, being at best as useful as a calculator (or calculator app) but much heavier, and at worst bearing typos; the latter are every bit as useful now as they were back then.
Describing something from a previous generation as "junk" does not convey the belief it is old - there's a perfectly good term for that, "old." It conveys the belief that it is junk.
I recall reading something in the 90s where in the 70s, Huffman coding was considered mathematically optimal compression and the whole field stagnated for a while because they thought they'd reached a fundamental limit.
“Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream” -wikipedia
So being pedantic, it is optimal in this specific meaning, but I don’t know about this stagnation.
The article calls xz out for its inadequacy as a (long-term) container format, which is nothing to do with its compression method (shared by both xz and lzip). I guess in this thread we are specifically talking about the efficiency of compression algorithms...
> I guess in this thread we are specifically talking about the efficiency of compression algorithms
Yep, and on top of that xz is more efficient at compression than zstd despite being "older", which is the relevant consideration here. I'm not expecting zstd to compete with xz for the StackOverflow case in terms of efficiency, what I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.
> I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.
It should, because much like other LZ77-based formats zstd uses a distance symbol plus additional uncompressed bits for longer distances, which should be frequent with the unorganized file order. Those formats assume that lower bits of longer distances are essentially random, so if this assumption doesn't hold those bits will affect the efficiency. Zstd also recommends the minimum window size of 8 MB, which is an improvement but also not very large.
I've done a quick experiment with a directory hanging around my temporary folder that weighs about 14 MB (Android CellBroadcastReceiver source code, if you ask). The random file order resulted in a file about 1.4% larger than those for the file extension order. So the effect definitely still exists, though I'm not able to quantify that.
The 'window size' dictates how far back the algorithm can look for redundant data. If you set it too high, people without enough core memory might not be able to decompress your archive.
And the window size for gzip is only 32 kB -- it "forgets" context pretty quickly. gzip was written in 1992, so it had to run under much tighter memory constraints than necessarily make sense today.
Modern compressors like xz (LZMA) use much more memory for context. Under extreme settings and when working with large files, they can use gigabytes of memory.
If you know your compressor has enough memory to hold the whole plaintext at once, but your decompressor doesn't, why not do extra work on the compression end with the whole plaintext in memory to emit an optimal "dynamic" Huffman alphabet, i.e. a code that gets patched by each new compressed block, rather than resetting to a completely new dictionary for each new block?
I presume the tightest packing possible would be achieved by converting the entire document into a Huffman-encoded sequence of references into a single unified strings table; breaking that strings table down recursively using largest-common-substring until satisfied; and then packing the resulting strings tree into something more like an intrusive prefix trie where offsets are implicit IDs. The problem would, of course, be the size of the resulting trie.
But you could always make the document into bytecode for a compressor VM (RAR-alike) where one opcode is "insert nodes {KV1, KV2, ...} into the prefix trie", and another opcode is "drop nodes {K1, K2, ...} from the prefix trie", such that the trie "window" used over the document would grow and shrink over time. Then you could optimize for redundant re-definitions of identical trie nodes in the compressed data, vs. total in-memory trie "window" size at any given time.
If the decompression wasn't required to be streaming, but rather could advance-allocate an mmap(2)ed file to write to, then you could also have a jump opcode, so that you could organize the Huffman-code trie-prefix definitions in an optimal order to minimize definition redundancy, trading off for making the compressed representation out-of-order relative to the plaintext it decompresses to. The compressor would put the "spine" of the document all together at the beginning, together with the frontloaded rootmost trie-node definitions; then gradually undefine them, replacing them with less-frequently-used nodes needed to decode context-free "islands" of content.
What you described is pretty much LZ78. It has been proven that if you can have an infinite dictionary, then you can achieve optimal efficiency. But infinite dictionaries are impossible.
Much better algorithms exist nowadays. Also, no modern algorithm uses Huffman coding anymore, it was overtaken by arithmetic and range coders long ago, and more recently by ANS coders.
LZ78 isn't adaptive; entries are forced out of the dictionary as soon as the window moves beyond where the dictionary-entry was originally defined, and if those entries are still relevant, they'll get a redundant definition in the next block.
Also, LZ78 is not recursively compressed (i.e. it doesn't compress the dictionary itself—it won't recognize when the subsequences it's working with are themselves internally redundant and would be better represented as a function with an input.)
Huffman/arithmetic/range/ANS are all "the same" insofar as they're all local-context streaming coders, i.e. things that only have the power of an FSM rather than the power you can get from requiring a pushdown automata/Turing machine.
The kind of coder I was talking about above, definitely would require a Turing machine. You couldn't implement it as a DSP — you need that unbounded-length state tape, the ability to write unbounded amounts of code to it, and the ability to jump to that code. That doesn't guarantee that it would be more powerful/better compressing; just that it could be. It wouldn't have the same inherent limits to its power that FSM-equivalent codes do.
-----
Let me go on a tangent for a bit, about what such a code probably would achieve if it could work (which should convince you, like me, that this is probably impossible somehow):
One thing I'd expect the code above to accomplish, would be to take a set of files representing images produced by capturing the sequential output frames generated by running an input movie on a NES emulator; and to emit something that looks very analogous to a cut-down version of the original ROM + input movie. (That is, the compressed data would contain 1. a set of partial spritesheets — the nodes of the prefix trie; 2. a set of partial tilemaps, referencing those spritesheets — the dictionary patch definition blocks; and 3. a sequence of bytecode instructions that use arbitrary geometric "draw" operations, jumps, loop registers, etc., to generate the screen-capture images by referencing the tilemaps that in turn reference the spritesheets — i.e. by calling ops to load/unload dictionary patches by reference. That's a "demo", in demoscene terms.)
This kind of result, if achieved, would (AFAIK) beat all currently-known compression methods. The best ANS code can't compress a sequence-of-images of SMB1 as well as just storing the ROM of SMB1 together with the input movie does. This wouldn't get quite to that degree of compression, but it would get close, and it would do it without the "implicit" Kolmogorov complexity of a NES emulator†.
† Though you could think of the output as something like the Futamura projection of a NES emulator with a specific ROM + input movie loaded into it. (It'd be a very weird NES emulator; one that runs as a batch process and emits image files of video frames. But if you had such an emulator, the output of Futamura-ing it would be very close to — or possibly better than — the output of this kind of approach. But, unlike this approach, incredibly non-portable — it would likely only run on the host architecture + OS version it was Futamura-ed on.)
In the (1D) audio domain, what I'm describing would basically involve turning PCM audio that encodes a rendering of a MIDI sequence through a set of patches, back into the inputs (a MIDI sequence and a set of patches), plus any noise/mastering effects as a residue. We can already do that in domain-specific systems, of course — that's how Melodyne's Direct Note Access works.
I'm just talking about doing it instead in a general compressor, lifting data with opaque redundancies out into graphical data-structures heuristically in order to then compress those graphical data-structures into programs that build-and-then-walk those graphs to produce the original output. A code that synthesizes something like MIDI to represent music; something like NES nametables to represent collaged images; etc. Not as well as those domain-tuned codes do, but better than local-context non-structural codes do. Basically the promise of Hierarchical Temporal Memory, but without the ML part. Purely-algorithmic HTM.
And another fun consequence of such a graphical-structuring code existing, would be that you could feed it non-trivial program binaries, and its output would be an equivalent program retargeted into an hypothetical ISA optimized for just that program. Feed it a corpus of enough such programs, all distinct, and you'll get a generally-space-optimized version of the ISA. An automatic derivation of THUMB from ARM, sorta thing.
------
And again, as I said in my sibling comment: this seems so obvious a win to me that it must be broken or impossible somehow. Tries / DAFSAs are easy. Longest-common-substring iterated to a satisfaction threshold is easy (using temporary post-constructed suffix automata.) Hypothetical iterated restructuring is easy — see OptiPNG, HypoPG, etc. So, where's the impossible part where the wheels fall off? I'm pretty sure it's there.)
Interesting idea! I'm not an expert, but if I'm reading you right and the decompressor is Turing complete; you're searching for the smallest Turing machine capable of generating the input sequence?
That would indeed be an optimal compressor and start to enable things like Solomonof induction and AIXI.
Seems the challenge is to get feedback to guide the search for the right Turing machine. With infinite compute can just run an ever expanding set of Turing machines until you get one that generates the input, but without some breakthrough that would enable something like stochastic gradient descent to efficiently work on non-differentiable programs I don't see how it can be made practical.
Perhaps there is some sweet spot thats more general than current compressors that looks at a restrict class of Turing machines that could be feasible.
I'm not an information theorist / cryptographer / etc., and so I have a strong feeling that there must be something wrong with my assumptions, given that "compressing big files using lots of memory for people to later decompress back to files using average amounts of memory" is an extremely-common need (for e.g. packaged software/OS updates.)
When a physicist sees a proof for a perpetual-motion machine, they don't ask "can this work"; they ask "where is this going wrong, that it came to this absurd conclusion?"
My post wasn't suggesting a real technique — it was me sharing my own perpetual-motion machine proof, so that someone can hopefully point out the step that's impossible, and I can then learn something from that :)
It sounds plausible to me. Real-world constraints aren't usually "you can use any amount of resources on the compressor, and any amount of compute on the decompressor for free, but the decompressor is hardcapped at exactly X MB of memory". So you might only have a theoretical improvement on existing methods, but that would still be very cool.
You can't beat existing methods by much - we're pretty close to the Shannon limit - but it's not a closed subject.
Is this really true? Surely they should be able to decompress it with virtual memory - you tell them exactly where to look for each operation. On the other hand, it will take you much longer to compress the file if your lookback window is larger than your available memory.
They're stream compressors, so they don't want to rely on being able to seek in the file. Even with swap, you'll have to cut it off somewhere; swap space isn't infinite.
Due to limited memory, I think most compression algos have a window of "stuff they've seen recently" which they use to detect duplicates. If the duplicate is of something too far back, it's fallen out of the window and been forgotten.
Compression formats have to be small themselves, so they make tradeoffs on what data they can point at to make copies of. Most formats prioritize recent data, since that tends to be the most similar.
Sliding sindow effects. Most compression algos don't keep the entire set of files to be compressed in memory, they have a sliding window where the only lok back and ahead X bytes.
I know there are fundamental limits on compression and I know I don’t know the first thing about implementing a modern compression algorithm. With that said:
> Most compression algos don't keep the entire set of files to be compressed in memory
This makes sense, but why keep the uncompressed data in memory? If the memory constraint is your biggest concern and you’re not CPU conscious, compare the outputs rather than the inputs. If you’re concerned about collisions, do a second pass to validate.
I’m certain the best minds on the topic are already aware of these and either using them or have ruled them out for reasons I can’t anticipate. I really hope, that likely being the case, I’ll get a chance to learn in responses and not just unexplained downvotes.
> Isn't the whole point of compression algorithms to find string of bytes that are the same?
No. The goal of compression algorithms is to describe an object approximately (lossy) or accurately (lossless) using the smallest number of [insert unit] possible. In some cases looking for strings of bytes that are the same serves well enough.
Examples of lossy algorithms that don't just look for same strings of bytes are audio and video codecs.
An example of lossless compression that works differently would be compressing "123456789" as "1 to 9".
> Seems simple to detect and handle?
Sure. If you've got infinite memory and time. Depending on your format you may also need to read the whole input before you can start writing the first byte.