Do any compression algorithms use and approximate matching technique like ssdeep...

thomasmg · on March 15, 2021

Sorting chunks by similarity: commonly used tools don't do that. Most archive tools only sort by file type.

I wrote a tool that chunks the data (into variable-sized blocks, to re-sync if there are multiple files that have different length prefixes, but that's another story), and then sorts the chunks by LSH (locality sensitive hash). LSH is used by search engines to detect similar text. It can compress directories that contain multiple version of e.g. source code very well (e.g. trunk, branches). https://github.com/h2database/h2database/blob/master/h2/src/...

I discussed this approach with a researcher in this area in January 2020. AFAIK there is active research in this area, specially to compress DNA sequences. But he also wasn't aware of papers or research in this area for general-purpose data compression.

So, I think this area is largely uncharted. I would be interested (as a hobby side project) to help, if somebody is interested.

Summary:

* Regular compression algorithms use a certain "window size": this can be 32 KB, or 2 GB, or so. But sorting chunks can be faster faster and can improve the compression rate.

* Regular compression tools can sort by file name and file type.

* Data de-duplication tools (e.g. enterprise backup tools) chunk the data in a smart way, but then only support exact duplicates (not near duplicates).

* Locality sensitive hashing: it is not used in common compression tools so far AFAIK.

gwern · on March 15, 2021

You can use binsort to sort the file list upfront by similarity of contents, which is usable with tar and zip and so can be piped into any other compression you want (every once in a great while when I can't come up with a good filename-based sort, I try out binsort); and lrzip attempts to do long-range similarity detection without changing order to do something similar.

thomasmg · on March 16, 2021

"Binsort - sort files by binary similarity": Thanks for the info! Yes, this seems to do something similar to what I implemented, on a file level (not chunk level). It seems to be O(n^2), while my algorithm is O(sort(n)). But having a standalone too is great!

lrzip / rzip use 900 MB a window, which used to be huge at that time it was written (1998?). But nowadays it is considered medium size.