Surprisingly, the default for the Internet Archive is Don’t Archive

NEWS FLASH! My wayback is back!

Let me first say that the Internet Archive is one of the most important, impressive crowning achievements we have for preserving and understanding what is truly a transient information space. And I rely regularly on the WayBack Machine to research, discover, and fall into interesting rabbit holes.

And let me add, that I am most likely not understanding enough here, but I am rather worried about what I have discovered about what you can find in the archive for sites you contributed to in the past do not manage in the present.

Apparently it hinges totally on access to text file sitting on the current domain.

As alluded to recently, it makes for a fragile archive.

Allow me to play this back.

I was responsible for maybe 10 Gb of early web stuff in my position from 1992-2006 as an instructional technologist for the Maricopa Center for Learning & Instruction (MCLI) at the Maricopa Community Colleges. I am not 100% sure, but the MCLI web site might have been up before Maricopa had one. I wrote a very early HTML tutorial that was my first “break” in online space (archived myself at http://mcli.cogdogblog.com/tut). Active in the mid 1990s in multimedia, I built Director Web, the leading resource for Macromedia Director, before Macromedia as a company had a web site.
This all existed at the long unwieldily URL http://www.mcli.dist.maricopa.edu/
The site continued after I left in 2006; in early January 2009 the center changed and moved it’s site to http://mcli.maricopa.edu (how do I know this? Why because of the Wayback machine)
All my old stuff was gone; I understand because a lot of it was built on perl scripts doing things like writing information to open text files. It was a mess. But the re-assuring thing was that all my old stuff was findable with the Wayback machine. The folks at Maricopa did set up a forwarding DNS entry so all traffic from the old domain http://www.mcli.dist.maricopa.edu/ at least ended up at 404 message on the new one.
That was until last week, when I found links that previously worked to archives of the old site ended up with this:

“Page Cannot be displayed due to robots.txt” If you do not know of robots.txt, learn about it now. It’s a plain text file where web site owners can tell web crawling robots (the things that make search engines work) to either come inside, or to stay away. It’s a way for owners to not have their site indexed if (a) they prefer not to be or (b) to reduce repeated traffic on their servers.

But this is not an issue with the robots.txt file at Maricopa. The problem is robots.txt cannot be found, because the DNS entry for the old URL is broken, http://www.mcli.dist.maricopa.edu/ will not connect to anything, and so http://www.mcli.dist.maricopa.edu/robots.txt is never found.
I emailed the internet archive and did get a confirmation that if the Wayback Machine cannot find a robots.txt file, it removes the archive.
Or, unless the current domain for an archive says “Internet Archive, please archive me” it won’t. The default is not to archive.

If this is true, than many old sites must be disappearing or never added to the Internet Archive because there is no active robots.txt file at the current (cue domain squatters, or in my case, a IT department’s good in a DNS entry).

Should not an archive reflect the robots.txt policy at the time of archiving, not retroactively from the present?

I am hardly the first to trip on this, there are many threads like this in the Internet Archive’s forums.

The “policy” for the Internet Archive is based on the “Oakland Policy” or Recommendations for Managing Removal Requests And Preserving Archival Integrity:

Online archives and digital libraries collect and preserve publicly available Internet documents for the future use of historians, researchers, scholars, and the general public. These archives and digital libraries strive to operate as trusted repositories for these materials, and work to make their collections as comprehensive as possible.

At times, however, authors and publishers may request that their documents not be included in publicly available archives or web collections. To comply with such requests, archivists may restrict access to or remove that portion of their collections with or without notice as outlined below.

Because issues of integrity and removal are complex, and archivists generally wish to respond in a transparent manner, these policy recommendations have been developed with help and advice of representatives of the Electronic Frontier Foundation, Chilling Effects, The Council on Library and Information Resources, the Berkeley Boalt School of Law, and various other commercial and non-commercial organizations through a meeting held by the Archive Policy Special Interest Group (SIG), an ad hoc, informal group of persons interested the practice of digital archiving.

In addition, these guidelines have been informed by the American Library Association’s Library Bill of Rights http://www.ala.org/work/freedom/lbr.html, the Society of American Archivists Code of Ethics http://www.archivists.org/governance/handbook/app_ethics.asp, the International Federation of Library Association’s Internet Manifesto http://www.unesco.org/webworld/news/2002/ifla_manifesto.rtf, as well as applicable law.

But this policy if I read it correctly, is meant as a tool for collections to say to archives, “Please exclude me” — done through a robots.txt exclusion for the Internet Archive’s crawler

You can exclude your site from display in the Wayback Machine by placing a robots.txt file on your web server that is set to disallow User-Agent: ia_archiver. You can also send an email request for us to review to info@archive.org with the URL (web address) in the text of your message.

An explicit exclusion is one thing. But, in my case, what is happening is a lack of an inclusion, not an explicit exclusion. This is quite different. The Wayback Machine’s policy is saying, “If we cannot locate a robots.txt file, we remove an archive”.

According to their A for the FAQ Why isn’t the site I’m looking for in the archive?

Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It’s also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.

None of these apply to http://www.mcli.dist.maricopa.edu/.

I fully expect there is something at play I do not understand. I don;t know who to contact at Maricopa; I sent a message with subject “Technical Question for ITS” to the general email for questions, but heard nothing.

Hi Maricopa,

Please direct this request to the ITS department.

I worked at District Office 1992-2006 for MCLI; all of the web sites I worked on have been retired, but because they were some of the earliest educational content on the web, thousands of external sites still link to them.

In the past, these old sites were reachable via the Internet Archive, but now the archive for the original MCLI web site has vanished.

This is because the old URL http://www.mcli.dist.maricopa.edu no longer forwards correctly to the new site http://mcli.maricopa.edu (it never connects) — If the Internet Archive cannot locate the robots.txt file it removes archived web sites.

All it would take to fix is a correction to a DNS record. I am sure this is of lowest possible importance, but I’d appreciate consideration.

More details are written at

Archive Fragility

Thanks

I doubt that a bunch of old web sites from the 1990s are on anybody’s list of important stuff to do. I’d hate to have to directly contact the Maricopa CIO, whom I worked with when I was there, to do something as mundane as fix a broken DSN entry.

But archives and history are important! Especially when they are something you had a part in.

And if the default is Not Archive, then the archive is in trouble especially for old web sites.

Tell me where I have missed something.

UPDATE JULY 5, 2016… Back in the Wayback!

Apparently the loss of my archive had nothing to do with stray or absent robot.txt files- according to Wayback Machine Director Mark Graham, it was a glitch:

We found, and addressed, a condition that was causing playback of some sites to be unintentially blocked under certain conditions.

Thanks Mark for the followup.

If you believe in the open internet and it’s sustainability, then line up with me to donate to the Internet Archive — if they are not archiving it, nobody will.

Top / Featured Image : Who needs google when I have my own supply of openly licensed images tagged “closed” and “open”? I have lots to choose from, settled in this graffiti covered one from a place I am not recognizing– a flickr photo https://flickr.com/photos/cogdog/3665986532 shared under a Creative Commons (BY) license

Ah, but because these are in flickr, I can locate adjacent photos that tell me where it was- this was late June, 2009, on a hike in Oahu with my friend Bert Kimura.

Share this barking on social media

If this kind of stuff has value, please support me by tossing a one time PayPal kibble or monthly on Patreon

Comments

mhawksey says:

June 29, 2016 at 11:01 am

Hit this very problem in the past. Interestingly the British Library UK Web Archive ignores the lack of robots.txt … the downside archived pages can only be viewed on terminals in selected libraries http://www.bl.uk/aboutus/legaldeposit/websites/websites/faqswebmaster/

This all came up recently when the UK Conservative party started deleting pages off their site:

“The Tory plan to conceal the shifting strands of policy by previous leaders may not work. The British Library points out it has been archiving the party’s website since 2004. Under a change in the copyright law, the library also downloaded 4.8m domains earlier this year – in effect, anything on the web with a .co.uk address – and says although the Conservative pages use a .com suffix they will be added to the store “as it is firmly within scope of the material we have a duty to archive”. But the British Library archive will only be accessible from terminals in its building, raising questions over the Tory commitment to transparency”

http://www.theguardian.com/politics/2013/nov/13/conservative-party-archive-speeches-internet

Colin Madland says:

June 29, 2016 at 11:41 am

So, you’ve inspired me to start looking at what of my own stuff is accessible through Wayback.

Turns out that searching for http://merelearning.ca nets an identical error to yours above. ‘No access because robots.’ hmmm…as far as I can see, my site shouldn’t be in the same category as your old stuff at Maricopa. Am I missing something?

Do I need to do something to submit my stuff to Wayback?

Thanks for the inspiration!

1. Alan Levine aka CogDog says:
  
  June 29, 2016 at 3:01 pm
  I cannot answer definitively because I’ve not heard back from the person at the Internet Archive as to the reason for my old site’s disappearance. My hunch is, from looking over the FAQs, is that the best bet is to make sure the robots.txt file on your server explicitly allows crawlers, or at least their crawler. While there is a place on the site to submit a single URL, the site wide crawl is driven by the Alexa services.
  
  Okay, so your site’s robots.txt looks a bit odd, like something to allow sitemaps to work (?). To me, it keeps crawlers out of the wp-admin directory, but except for one ajax file:
```
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
```
  It might work if you change it to first allow access to all:
```
User-agent: *
Allow: /
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
```
  or you could add a statement just for the Internet Archive Crawler
```
User-agent: ia_archiver
Allow: /
```
  No promises!
2. Alan Levine aka CogDog says:
  
  June 29, 2016 at 3:12 pm
  
  Also, maybe look at https://wordpress.org/plugins/archiver/
  
Brian W says:

June 29, 2016 at 3:17 pm

This makes me very angry. robots.txt was never a required file, it was only used by people who either misunderstood the utility of archivers, spiders, and search engines, people with very fragile servers, people who wrote very fragile code, and people with something to hide. Most people either didn’t know about robots.txt or deliberately chose not to use it. This is retroactive oversensitivity, and amounts to a deletion of the past. Sure, it’s not “deleted”, but if I can’t access it, it might as well be.

This is shameful.

1. Alan Levine aka CogDog says:
  
  June 30, 2016 at 1:40 pm
  
  I’m still trying to sort out what is really happening (I can only guess until I hear back from the person at the archive who assured me they were fixing it)– his seems like a new approach by the IA, and if so, is backwards from the intent of the Oakland Policy, which is about having control over what is excluded from the archive, not a requirement for inclusion.
  
Jim says:

July 5, 2016 at 10:48 am

Maricopa had a massive security breach in 2013. Might have something to do with your issue. Ed Kelty is the head of IT for the District. Call Ed.

Blog Pile

Surprisingly, the default for the Internet Archive is Don’t Archive

UPDATE JULY 5, 2016… Back in the Wayback!

Comments

Leave a Reply Cancel reply

My Profile

Your Profile