That actually sounds remarkably accessible. Considering how much of a donation you need to make for naming rights to a rural university professorship/library building, surely this would appeal to some freshly minted startup decamillionaire with a slight peterthielite anti-establishment bent?
Actually 15 racks if you’re using backblaze storage pods. Which now that I think about it, is about how many racks I saw in the various rooms of the church. [I just happened to be at IA headquarters last weekend.] The storage pods hardware itself would be another $1m, and then let’s assume other $0.5M for various things I’m not considering (network equipment, power transformers, etc.). Still just $5m for the base hardware to store that info.
Well buying 220pb of storage space is really not the problem nowadays, at least from a cost perspective. But you need to maintain all that stuff. What happens when a disk goes broke, what if a network switch goes broke, how do you update your software at scale and so on.
I think it would be best to put it on AWS S3 Glacier Deep Archive for about 2.5 million dollar per year.
I doubt that you can do it cheaper. To permantenly archive the whole internet is an ongoing task that basically requires a small company, thats why Internet Archive (169 employee) exist (which costs more than 2.5 million dollar per year). It is not done with buying a huge bunch of disk. Setting up a permanent stream to S3 would be the only solution I can think of a single human could handle.
Whenever you have that much data stored how do you actually know the data is still there and can be retrieved? Even if you have absolutely insane connectivity to it at some point don't you run out of time to check it? Apparent 200 PiB at 1 GiB per second would take about 58254 hours to retrieve.
Crawlers with jobs, building searchable indexes? Similar to youtube. Down at the source its blobs, but above it all floats a layers of tags, metrics and searchable text. That is what the searches run against and the preferences algo builds its lineup against?