Gwern has a good summary of the research in this: https://www.gwern.net/Archivin...

Animats · on July 23, 2017

There's an automatic system on Wikipedia now which attempts to rescue dead links by finding the page in the Internet Archive and updating the Wikipedia page accordingly.

zargon · on July 23, 2017

It should also do the reverse -- find links in wikipedia that aren't in archive.org and initiate an archival task.

Figs · on July 23, 2017

A few years ago, there was a bot automatically submitting all links to archive.is and adding the archive links to Wikipedia. It got blocked and the site banned for spam. There was another discussion about it last year, and the consensus was to remove the site from the spam list so that links would be allowed again. (Not sure if that actually happened or not though.)

If you're curious, take a look at the discussions at the following links:

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_2

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_3

- https://en.wikipedia.org/wiki/Wikipedia:Archive.is_RFC_4

I'm also not sure what the current status of automatically archiving links is though, but as you can see, the idea has been attempted before.

zargon · on July 24, 2017

Those RFCs seem to have nothing to do with my suggestion.

In Wikipedia's usual frustrating manner, it's unclear to me what was even going on to trigger those RFCs or why people thought it was a problem. For some reason they were upset with links to archive.is. But why? Was archive.is replacing working links with archive links, or something?

Edit: From what I can tell, the archive.is bot was doing the same thing the archive.org bot Animats mentioned was doing. It's just archive.is didn't follow Wikipedia's policies and procedures.

Retr0spectrum · on July 23, 2017

I'm pretty sure that's whats happening now too.

amygdyl · on July 23, 2017

What about personal caches?

I know that few people likely keep locally saved Web pages in near line accessibility. But if you only build the tool, I would happily mount the nearly 5 terabytes of ULTRASCSI drives I've got in storage, which were largely Squid cache snapshots, from my office WAN proxy box.

stavros · on July 23, 2017

You should add them to IPFS, it sounds ideal for your use case.

amygdyl · on July 24, 2017

Thanks for all your help with suggestions everyone!

I'm more inclined to use Apache Lucene.

Both because of the abundance of people doing similar extraction from comparable data stores, but also because Lucene is a part of the Hitachi data suites which are my business choice.. Allowing I ever return to business.

I mention Hitachi only because by anointing Lucene in a Tier 1 product, if I encounter the need to provide a audit that is relied upon to attest any sensitive data is purged from the cache, I'm sure that it will be acceptable in the format of such a high end system report.

Of course that's window dressing,but my aim is to get big businesses to contribute more old data... just who might aggressively retain data for their own reasons, compliance and security the usual needs,

Well those donors are who need the reassurance of a more vendor.

I wrote above the delightful ideas this prompted me to forget about dinner last night, so I will stop here.

But I cannot help but feel in my gut this is a real and viable project that can snowball and be a huge thing.

Oh, sorry, I misled you, I have to add what just now came to my mind : anything that encourages big companies to give into really old data archives, is just when they are ripe for selling new bulk capacity. The need for refreshing the media is so overlooked by businesses that otherwise have great practices. So you can nudge even the least hungry prospect customers towards your sales team.

Can't a vendor jump on this one?

Edit format

amygdyl · on July 25, 2017

Can someone enlighten me to a better appreciation of the moderation of my above comment?

I don't understand, unless maybe I was gushing somewhat. I do that. I'm female. And despite decades writing programs mainly for my personal research, I have a non technical, traditional media business, corporate experience of computing.

So when I look around here, I am both very happy to not see references to big ticket kit, and the feeling of independence is fresh without question.

But I cannot recall any acknowledgement of the tooling that corporations have habitually to hand. That GitLab controversy I thought hardly touched the issues of providing the level of service that major enterprises require and do obtain from the market, instead of using open source software for singularly critical infrastructure, without any analysis of the alternatives. It would be very boring to discuss storage products, its hard to get real information about them, even, despite the influence of a host of new companies. But the debate didn't touch a single advantage or aspect of the options exist.

I'm in total darkness trying to guess what I spoke improperly. I thought I might prompt someone who sells hardware support for say Ceph clusters, to comment, using the idea to gain awareness their own business.

But a silent, cold, zero, feels badly in need of some sort of company that commiserates the comment I wrote. I can totally grok if the Shut Up Woman reflex is one I have prompted you to feel. My bad, I usually have a lid on the gush. Is the idea so awful? I only know that the commission on storage systems is still enormous, and I think in particular younger HN types could make sheer bank, in a EDS sales setting, where real understanding of technology is, well, imperfect. It seems like the HN world doesn't acknowledge the corporate world much. That's a pity, if true. Below the household name corporations are so many in real need of serious minds, and I know personally that the salaries are not as bad as they are made to look at with statistics when debated here, I mean you think a typical character here is going to be a bottom decile performer, or top centile producer, compared against the big wobbly world of corporate computer departments?

Edit horrid auto correction

animal531 · on July 26, 2017

I can't see anything wrong in what you typed, so you're not alone.

chillingeffect · on July 23, 2017

Yes does anything like this exist? I would love to remember all of my surfing. I currently bookmark nearly 100% of my visited links.

girvo · on July 24, 2017

I use (as the other reply said) wget, and SiteSucker. I find SiteSucker seems to do a slightly better job at times, and requires less finessing, which is useful when I'm about to hop on a plane for a long flight but want to grab documentation or a website to browse through.

http://ricks-apps.com/osx/sitesucker/index.html

Houshalter · on July 24, 2017

wget is probably the tool you are looking for. You need to do a bit of work to get the options right. Ones to consider are --input-file=file, --level=1, --convert-links, --page-requisites, --follow-ftp, --span-hosts, and --adjust-extension. You can just export your bookmarks or history or whatever as a file full of links and use that as input to wget, and it will retrieve all of them.

There are some other handy tools like youtube-dl, which let you archive not just youtube videos, but many different types of media content. Including soundcloud and bandcamp.

gwbas1c · on July 23, 2017

I don't create web pages often, but whenever I link to something, I always generate a backup PDF.

subroutine · on July 23, 2017

Check out http://archive.is

minikites · on July 23, 2017

Pinboard sees similar effects: https://twitter.com/pinboard/status/501406303747457024

stavros · on July 23, 2017

Very interesting, I was thinking just an hour ago that IPFS would be perfect for references in academic literature, as it's immutable and always available.