Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Gwern has a good summary of the research in this: https://www.gwern.net/Archiving%20URLs

>In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the Internet. McCown et al 2005 discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication [the irony!], and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year. Bruce Schneier remarks that one friend experienced 50% linkrot in one of his pages over less than 9 years (not that the situation was any better in 1998), and that his own blog posts link to news articles that go dead in days2; Vitorio checks bookmarks from 1997, finding that hand-checking indicates a total link rot of 91% with only half of the dead available in sources like the Internet Archive; the Internet Archive itself has estimated the average lifespan of a Web page at 100 days. A Science study looked at articles in prestigious journals; they didn’t use many Internet links, but when they did, 2 years later ~13% were dead3. The French company Linterweb studied external links on the French Wikipedia before setting up their cache of French external links, and found - back in 2008 - already 5% were dead. (The English Wikipedia has seen a 2010-2011 spike from a few thousand dead links to ~110,000 out of ~17.5m live links.) The dismal studies just go on and on and on (and on). Even in a highly stable, funded, curated environment, link rot happens anyway. For example, about 11% of Arab Spring-related tweets were gone within a year (even though Twitter is - currently - still around).



There's an automatic system on Wikipedia now which attempts to rescue dead links by finding the page in the Internet Archive and updating the Wikipedia page accordingly.


It should also do the reverse -- find links in wikipedia that aren't in archive.org and initiate an archival task.


A few years ago, there was a bot automatically submitting all links to archive.is and adding the archive links to Wikipedia. It got blocked and the site banned for spam. There was another discussion about it last year, and the consensus was to remove the site from the spam list so that links would be allowed again. (Not sure if that actually happened or not though.)

If you're curious, take a look at the discussions at the following links:

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_2

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_3

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_4

I'm also not sure what the current status of automatically archiving links is though, but as you can see, the idea has been attempted before.


Those RFCs seem to have nothing to do with my suggestion.

In Wikipedia's usual frustrating manner, it's unclear to me what was even going on to trigger those RFCs or why people thought it was a problem. For some reason they were upset with links to archive.is. But why? Was archive.is replacing working links with archive links, or something?

Edit: From what I can tell, the archive.is bot was doing the same thing the archive.org bot Animats mentioned was doing. It's just archive.is didn't follow Wikipedia's policies and procedures.


I'm pretty sure that's whats happening now too.


What about personal caches?

I know that few people likely keep locally saved Web pages in near line accessibility. But if you only build the tool, I would happily mount the nearly 5 terabytes of ULTRASCSI drives I've got in storage, which were largely Squid cache snapshots, from my office WAN proxy box.


You should add them to IPFS, it sounds ideal for your use case.


Thanks for all your help with suggestions everyone!

I'm more inclined to use Apache Lucene.

Both because of the abundance of people doing similar extraction from comparable data stores, but also because Lucene is a part of the Hitachi data suites which are my business choice.. Allowing I ever return to business.

I mention Hitachi only because by anointing Lucene in a Tier 1 product, if I encounter the need to provide a audit that is relied upon to attest any sensitive data is purged from the cache, I'm sure that it will be acceptable in the format of such a high end system report.

Of course that's window dressing,but my aim is to get big businesses to contribute more old data... just who might aggressively retain data for their own reasons, compliance and security the usual needs,

Well those donors are who need the reassurance of a more vendor.

I wrote above the delightful ideas this prompted me to forget about dinner last night, so I will stop here.

But I cannot help but feel in my gut this is a real and viable project that can snowball and be a huge thing.

Oh, sorry, I misled you, I have to add what just now came to my mind : anything that encourages big companies to give into really old data archives, is just when they are ripe for selling new bulk capacity. The need for refreshing the media is so overlooked by businesses that otherwise have great practices. So you can nudge even the least hungry prospect customers towards your sales team.

Can't a vendor jump on this one?

Edit format


Can someone enlighten me to a better appreciation of the moderation of my above comment?

I don't understand, unless maybe I was gushing somewhat. I do that. I'm female. And despite decades writing programs mainly for my personal research, I have a non technical, traditional media business, corporate experience of computing.

So when I look around here, I am both very happy to not see references to big ticket kit, and the feeling of independence is fresh without question.

But I cannot recall any acknowledgement of the tooling that corporations have habitually to hand. That GitLab controversy I thought hardly touched the issues of providing the level of service that major enterprises require and do obtain from the market, instead of using open source software for singularly critical infrastructure, without any analysis of the alternatives. It would be very boring to discuss storage products, its hard to get real information about them, even, despite the influence of a host of new companies. But the debate didn't touch a single advantage or aspect of the options exist.

I'm in total darkness trying to guess what I spoke improperly. I thought I might prompt someone who sells hardware support for say Ceph clusters, to comment, using the idea to gain awareness their own business.

But a silent, cold, zero, feels badly in need of some sort of company that commiserates the comment I wrote. I can totally grok if the Shut Up Woman reflex is one I have prompted you to feel. My bad, I usually have a lid on the gush. Is the idea so awful? I only know that the commission on storage systems is still enormous, and I think in particular younger HN types could make sheer bank, in a EDS sales setting, where real understanding of technology is, well, imperfect. It seems like the HN world doesn't acknowledge the corporate world much. That's a pity, if true. Below the household name corporations are so many in real need of serious minds, and I know personally that the salaries are not as bad as they are made to look at with statistics when debated here, I mean you think a typical character here is going to be a bottom decile performer, or top centile producer, compared against the big wobbly world of corporate computer departments?

Edit horrid auto correction


I can't see anything wrong in what you typed, so you're not alone.


Yes does anything like this exist? I would love to remember all of my surfing. I currently bookmark nearly 100% of my visited links.


I use (as the other reply said) wget, and SiteSucker. I find SiteSucker seems to do a slightly better job at times, and requires less finessing, which is useful when I'm about to hop on a plane for a long flight but want to grab documentation or a website to browse through.

http://ricks-apps.com/osx/sitesucker/index.html


wget is probably the tool you are looking for. You need to do a bit of work to get the options right. Ones to consider are --input-file=file, --level=1, --convert-links, --page-requisites, --follow-ftp, --span-hosts, and --adjust-extension. You can just export your bookmarks or history or whatever as a file full of links and use that as input to wget, and it will retrieve all of them.

There are some other handy tools like youtube-dl, which let you archive not just youtube videos, but many different types of media content. Including soundcloud and bandcamp.


I don't create web pages often, but whenever I link to something, I always generate a backup PDF.




Very interesting, I was thinking just an hour ago that IPFS would be perfect for references in academic literature, as it's immutable and always available.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: