Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How feasible is it to store raw content in the Git content-addressable-store (CAS)? Git blobs are Zlib compressed.

I'd like to be able to store audio files uncompressed, so that they could be read directly from the CAS, rather than having to be expanded out into a checkout directory.



IIRC, a git blob has the size of the data encoded in the first 4 bytes of the file, and the data itself appended to it. It could be stored uncompressed, but I don't think there's anything in the git plumbing layer that could deal with it directly.

That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.


The header for a blob file is "blob", a space, the length of the content as ASCII integer representation, then a null byte.

    $ echo "hello world" > HELLO.txt
    $ git add HELLO.txt 
    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
    > zpipe -d | \
    > hexdump -e '"|"24/1 "%_p" "|\n"'
    |blob 12.hello world.|
    $
The header and the content get concatenated together, and the whole thing gets Zlib compressed. The SHA1 is calculated from the header-plus-content before it gets Zlib compressed.

    $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \                  
    > zpipe -d | \
    > shasum
    3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -
    $
What I would like to do is record an audio file (e.g. LPCM BWF), take its SHA1 and store it in the CAS as raw content, then reference it somehow from a Git commit. That way it will be part of the history and will travel with `push` and `clone`, won't get gc'd, etc.

> That said, even if it is compressed, a command like git cat-file could be used to pipe the contents of the file to stdout or any other program that could use them as input without having to create a file on disk.

That's a neat suggestion! However, I don't see how it would be compatible with random access, which is important for my application.


Basically that's what Git-LFS does; it takes the SHA of the file, stores it in the git version of the file, and then stores the contents next to it. It's all transparent and works pretty well.


Hmm, but the point of Git-LFS is to store large files outside the CAS so that they don't burden operations like `clone`. And Git-LFS does lots of magic.

Maybe to achieve what I've laid out, I really would need to write a Git extension a la Git-LFS. But then vanilla Git wouldn't be able to make full use of it, which undermines the purpose of using Git in the first place.

As an alternative, maybe I just commit the darn audio files to the repo.

• In relative terms, audio files grow smaller ever year.

• Large repository size isn't as critical for a music composition tool as it is for perpetually maintained software source code.

• I'm imagining a tool to prune edit history which would consolidate commits and potentially garbage collect audio files that become unreferenced.

I wish there was a way in vanilla Git to just associate a CAS object containing arbitrary bytes with a commit object, though.


You can set git-lfs to automatically checkout your LFS operations on clone; That's a setting. Yes, they're outside the repo - But not far off.

It's the same magic you want to do; Really and truly, there's magic there, but it's a pretty thin and well defined layer of magic.


I agree that decoupling from Git has its benefits, and I've built a tool[1] that seems to meet some of your needs above. The idea is to save binary data in a separate content-addressed store and have Git track references to specific files in said store. If you check it out, I'd be happy to hear what you think!

[1]: https://github.com/kevin-hanselman/dud


What exactly is the issue? Why do not you just use submodules? Or you might want to associate the SHA-1 of commits to files outside git? Ugh, I need sleep.


the core of cat-file.c is quite short. i think you could get the random access you want with minimal effort. ideally, upstream support for --offset and --count or what not to git; a lot of people would benefit.

https://github.com/git/git/blob/master/builtin/cat-file.c

you can absolutely make tools to expand out & load git repos into content stores. it's going to depend on the content store how you do that.


I don't know the answer to your question, but take a look at git-annex [0] if you haven't already.

One of the examples it gives is storing a music collection. If I understand correctly, I don't think it automatically compresses every file - or at least gives you the ability to not compress it.

[0]: https://git-annex.branchable.com/


You can't do that, every object is compressed (and then added to pack files with delta-encoding). Even if you did manually write a non-zlib object, things would break the first time Git tried to access that object (for example to show, compare with working copy, gc, or repack).


Thank you, that's very helpful to know! It makes perfect sense, too.

I wonder if I can abuse the pack file format. Mua ha ha. Probably not but learning about Git innards pays dividends even if the experiments don't work out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: