How feasible is it to store raw content in the Git content-addressable-store (CA...

u801e · on Oct 8, 2021

IIRC, a git blob has the size of the data encoded in the first 4 bytes of the file, and the data itself appended to it. It could be stored uncompressed, but I don't think there's anything in the git plumbing layer that could deal with it directly.

That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

rectang · on Oct 8, 2021

The header for a blob file is "blob", a space, the length of the content as ASCII integer representation, then a null byte.

    $ echo "hello world" > HELLO.txt
    $ git add HELLO.txt 
    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
    > zpipe -d | \
    > hexdump -e '"|"24/1 "%_p" "|\n"'
    |blob 12.hello world.|
    $

The header and the content get concatenated together, and the whole thing gets Zlib compressed. The SHA1 is calculated from the header-plus-content before it gets Zlib compressed.

    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \                  
    > zpipe -d | \
    > shasum
    3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -
    $

What I would like to do is record an audio file (e.g. LPCM BWF), take its SHA1 and store it in the CAS as raw content, then reference it somehow from a Git commit. That way it will be part of the history and will travel with `push` and `clone`, won't get gc'd, etc.

> That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

That's a neat suggestion! However, I don't see how it would be compatible with random access, which is important for my application.

GauntletWizard · on Oct 8, 2021

Basically that's what Git-LFS does; it takes the SHA of the file, stores it in the git version of the file, and then stores the contents next to it. It's all transparent and works pretty well.

rectang · on Oct 8, 2021

Hmm, but the point of Git-LFS is to store large files outside the CAS so that they don't burden operations like `clone`. And Git-LFS does lots of magic.

Maybe to achieve what I've laid out, I really would need to write a Git extension a la Git-LFS. But then vanilla Git wouldn't be able to make full use of it, which undermines the purpose of using Git in the first place.

As an alternative, maybe I just commit the darn audio files to the repo.

• In relative terms, audio files grow smaller ever year.

• Large repository size isn't as critical for a music composition tool as it is for perpetually maintained software source code.

• I'm imagining a tool to prune edit history which would consolidate commits and potentially garbage collect audio files that become unreferenced.

I wish there was a way in vanilla Git to just associate a CAS object containing arbitrary bytes with a commit object, though.

GauntletWizard · on Oct 9, 2021

You can set git-lfs to automatically checkout your LFS operations on clone; That's a setting. Yes, they're outside the repo - But not far off.

It's the same magic you want to do; Really and truly, there's magic there, but it's a pretty thin and well defined layer of magic.

kvnhn · on Oct 9, 2021

I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud

johnisgood · on Oct 9, 2021

What exactly is the issue? Why do not you just use submodules? Or you might want to associate the SHA-1 of commits to files outside git? Ugh, I need sleep.

rektide · on Oct 8, 2021

the core of cat-file.c is quite short. i think you could get the random access you want with minimal effort. ideally, upstream support for --offset and --count or what not to git; a lot of people would benefit.

https://github.com/git/git/blob/master/builtin/cat-file.c

you can absolutely make tools to expand out & load git repos into content stores. it's going to depend on the content store how you do that.

johntash · on Oct 9, 2021

I don't know the answer to your question, but take a look at git-annex [0] if you haven't already.

One of the examples it gives is storing a music collection. If I understand correctly, I don't think it automatically compresses every file - or at least gives you the ability to not compress it.

[0]: https://git-annex.branchable.com/

remram · on Oct 9, 2021

You can't do that, every object is compressed (and then added to pack files with delta-encoding). Even if you did manually write a non-zlib object, things would break the first time Git tried to access that object (for example to show, compare with working copy, gc, or repack).

rectang · on Oct 9, 2021

Thank you, that's very helpful to know! It makes perfect sense, too.

I wonder if I can abuse the pack file format. Mua ha ha. Probably not but learning about Git innards pays dividends even if the experiments don't work out.