Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There needs to be a global effort to backup the Internet Archive at this point.


Just need to find someone with ~220pb of storage and the ability to increase that by approximately 50% annually forever more.


That's only about 38 racks of storage, at a cost of ~$3.5M for the hard drives (redundancy included). Not that big, in the grand scheme of things.


That actually sounds remarkably accessible. Considering how much of a donation you need to make for naming rights to a rural university professorship/library building, surely this would appeal to some freshly minted startup decamillionaire with a slight peterthielite anti-establishment bent?


Actually 15 racks if you’re using backblaze storage pods. Which now that I think about it, is about how many racks I saw in the various rooms of the church. [I just happened to be at IA headquarters last weekend.] The storage pods hardware itself would be another $1m, and then let’s assume other $0.5M for various things I’m not considering (network equipment, power transformers, etc.). Still just $5m for the base hardware to store that info.

Yeah, pretty affordable.


Well buying 220pb of storage space is really not the problem nowadays, at least from a cost perspective. But you need to maintain all that stuff. What happens when a disk goes broke, what if a network switch goes broke, how do you update your software at scale and so on.

I think it would be best to put it on AWS S3 Glacier Deep Archive for about 2.5 million dollar per year.


2.5 million per year is about 10x what the worst case ongoing costs would be.


I doubt that you can do it cheaper. To permantenly archive the whole internet is an ongoing task that basically requires a small company, thats why Internet Archive (169 employee) exist (which costs more than 2.5 million dollar per year). It is not done with buying a huge bunch of disk. Setting up a permanent stream to S3 would be the only solution I can think of a single human could handle.


Whenever you have that much data stored how do you actually know the data is still there and can be retrieved? Even if you have absolutely insane connectivity to it at some point don't you run out of time to check it? Apparent 200 PiB at 1 GiB per second would take about 58254 hours to retrieve.


It's not like it's all coming from one disk, or going to one single CPU.

20TB drives with 500mb/s sequential read are available today. Reading the whole disk takes about half a day.

If your storage pod has 12 of those, even a $50 n100 CPU can run xxHash at 6gb / s (could probably even manage MurmurHash).


I won't even pretend to know how to begin with this type of project.


Crawlers with jobs, building searchable indexes? Similar to youtube. Down at the source its blobs, but above it all floats a layers of tags, metrics and searchable text. That is what the searches run against and the preferences algo builds its lineup against?


There is, at least with book's etc:

https://annas-archive.org/torrents




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: