Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I used to be very gung ho on IPFS, until I learned that the content ID does not depend solely on the content of the file. When one puts a file into the system one can choose different hashing algorithms, which will cause the content ID to be different obviously. However, even when using the same algorithm the content ID will change depending on how it is chunked. I would expect any sane system to consistently produce the same hash/content ID a file. I can see if the system is moving from using SHA2 to SHA3 that it could be stored twice. Don't know whether they have changed things so that a consistent content ID will be produced or not.


The content ID is not the hash of the content, it is the hash of the root of the Merkle DAG that carries the content.

Doing it like that has many advantages, like being able to verify hashes as small blocks are downloaded and not after downloading a huge file. Being able to de-duplicate data, being able to represent files, folders and any type of linked content-addressed data structure.

As long as your content is under 4MiB you can opt out of all this and have a content ID that is exactly the hash of the content.


As I just replied to "cle", some disadvantages doing the way that it is because one can't predict what content ID would be produced. Perhaps the hash of the entire contents of the file could point the hash that is current the content ID would solve this issue. To me, IPFS does not seem useful unless this issue is solved. Also, multiple hashes (different algorithms) of the file could point to the content ID/merkle DAG; so if both SHA2 and SHA3 were both used and one of them had a security issues, then just use the one that is OK.


How would you produce the same hash for different encodings of data?


not sure that I follow what you are asking. I would expect if sha2-256 is used then the content ID would be the same. However, depending on how the content is chunked, the content ID will change. Two disadvantages that I see:

1. if new packages are produced for a release of open source, could I see if there is a copy available via IPFS? No, because one can't predict how it would be chunked. So, one would have to download and then derive a content ID and one can only tell if it is available if the same chunking algorithm is available.

2. if I want to push a package or other binary, can I figure out if it is already available via IPFS? No, one can't.


Wow good to know!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: