Let me first say that the Internet Archive is one of the most important, impressive crowning achievements we have for preserving and understanding what is truly a transient information space. And I rely regularly on the WayBack Machine to research, discover, and fall into interesting rabbit holes.
And let me add, that I am most likely not understanding enough here, but I am rather worried about what I have discovered about what you can find in the archive for sites you contributed to in the past do not manage in the present.
Apparently it hinges totally on access to text file sitting on the current domain.
As alluded to recently, it makes for a fragile archive.
Allow me to play this back.
- I was responsible for maybe 10 Gb of early web stuff in my position from 1992-2006 as an instructional technologist for the Maricopa Center for Learning & Instruction (MCLI) at the Maricopa Community Colleges. I am not 100% sure, but the MCLI web site might have been up before Maricopa had one. I wrote a very early HTML tutorial that was my first “break” in online space (archived myself at http://mcli.cogdogblog.com/tut). Active in the mid 1990s in multimedia, I built Director Web, the leading resource for Macromedia Director, before Macromedia as a company had a web site.
This all existed at the long unwieldily URL
- The site continued after I left in 2006; in early January 2009 the center changed and moved it’s site to
http://mcli.maricopa.edu(how do I know this? Why because of the Wayback machine)
- All my old stuff was gone; I understand because a lot of it was built on perl scripts doing things like writing information to open text files. It was a mess. But the re-assuring thing was that all my old stuff was findable with the Wayback machine. The folks at Maricopa did set up a forwarding DNS entry so all traffic from the old domain
http://www.mcli.dist.maricopa.edu/at least ended up at 404 message on the new one.
- That was until last week, when I found links that previously worked to archives of the old site ended up with this:
“Page Cannot be displayed due to robots.txt” If you do not know of robots.txt, learn about it now. It’s a plain text file where web site owners can tell web crawling robots (the things that make search engines work) to either come inside, or to stay away. It’s a way for owners to not have their site indexed if (a) they prefer not to be or (b) to reduce repeated traffic on their servers.
But this is not an issue with the robots.txt file at Maricopa. The problem is robots.txt cannot be found, because the DNS entry for the old URL is broken,
http://www.mcli.dist.maricopa.edu/will not connect to anything, and so
http://www.mcli.dist.maricopa.edu/robots.txtis never found.
- I emailed the internet archive and did get a confirmation that if the Wayback Machine cannot find a robots.txt file, it removes the archive.
Or, unless the current domain for an archive says “Internet Archive, please archive me” it won’t. The default is not to archive.
If this is true, than many old sites must be disappearing or never added to the Internet Archive because there is no active robots.txt file at the current (cue domain squatters, or in my case, a IT department’s good in a DNS entry).
Should not an archive reflect the robots.txt policy at the time of archiving, not retroactively from the present?
I am hardly the first to trip on this, there are many threads like this in the Internet Archive’s forums.
The “policy” for the Internet Archive is based on the “Oakland Policy” or Recommendations for Managing Removal Requests And Preserving Archival Integrity:
Online archives and digital libraries collect and preserve publicly available Internet documents for the future use of historians, researchers, scholars, and the general public. These archives and digital libraries strive to operate as trusted repositories for these materials, and work to make their collections as comprehensive as possible.
At times, however, authors and publishers may request that their documents not be included in publicly available archives or web collections. To comply with such requests, archivists may restrict access to or remove that portion of their collections with or without notice as outlined below.
Because issues of integrity and removal are complex, and archivists generally wish to respond in a transparent manner, these policy recommendations have been developed with help and advice of representatives of the Electronic Frontier Foundation, Chilling Effects, The Council on Library and Information Resources, the Berkeley Boalt School of Law, and various other commercial and non-commercial organizations through a meeting held by the Archive Policy Special Interest Group (SIG), an ad hoc, informal group of persons interested the practice of digital archiving.
In addition, these guidelines have been informed by the American Library Association’s Library Bill of Rights http://www.ala.org/work/freedom/lbr.html, the Society of American Archivists Code of Ethics http://www.archivists.org/governance/handbook/app_ethics.asp, the International Federation of Library Association’s Internet Manifesto http://www.unesco.org/webworld/news/2002/ifla_manifesto.rtf, as well as applicable law.
But this policy if I read it correctly, is meant as a tool for collections to say to archives, “Please exclude me” — done through a robots.txt exclusion for the Internet Archive’s crawler
You can exclude your site from display in the Wayback Machine by placing a robots.txt file on your web server that is set to disallow User-Agent: ia_archiver. You can also send an email request for us to review to email@example.com with the URL (web address) in the text of your message.
An explicit exclusion is one thing. But, in my case, what is happening is a lack of an inclusion, not an explicit exclusion. This is quite different. The Wayback Machine’s policy is saying, “If we cannot locate a robots.txt file, we remove an archive”.
According to their A for the FAQ Why isn’t the site I’m looking for in the archive?
Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It’s also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.
None of these apply to
I fully expect there is something at play I do not understand. I don;t know who to contact at Maricopa; I sent a message with subject “Technical Question for ITS” to the general email for questions, but heard nothing.
Please direct this request to the ITS department.
I worked at District Office 1992-2006 for MCLI; all of the web sites I worked on have been retired, but because they were some of the earliest educational content on the web, thousands of external sites still link to them.
In the past, these old sites were reachable via the Internet Archive, but now the archive for the original MCLI web site has vanished.
This is because the old URL http://www.mcli.dist.maricopa.edu no longer forwards correctly to the new site http://mcli.maricopa.edu (it never connects) — If the Internet Archive cannot locate the robots.txt file it removes archived web sites.
All it would take to fix is a correction to a DNS record. I am sure this is of lowest possible importance, but I’d appreciate consideration.
More details are written at
I doubt that a bunch of old web sites from the 1990s are on anybody’s list of important stuff to do. I’d hate to have to directly contact the Maricopa CIO, whom I worked with when I was there, to do something as mundane as fix a broken DSN entry.
But archives and history are important! Especially when they are something you had a part in.
And if the default is Not Archive, then the archive is in trouble especially for old web sites.
Tell me where I have missed something.
UPDATE JULY 5, 2016… Back in the Wayback!
Apparently the loss of my archive had nothing to do with stray or absent robot.txt files- according to Wayback Machine Director Mark Graham, it was a glitch:
We found, and addressed, a condition that was causing playback of some sites to be unintentially blocked under certain conditions.
Thanks Mark for the followup.
If you believe in the open internet and it’s sustainability, then line up with me to donate to the Internet Archive — if they are not archiving it, nobody will.
Top / Featured Image : Who needs google when I have my own supply of openly licensed images tagged “closed” and “open”? I have lots to choose from, settled in this graffiti covered one from a place I am not recognizing– a flickr photo https://flickr.com/photos/cogdog/3665986532 shared under a Creative Commons (BY) license
Ah, but because these are in flickr, I can locate adjacent photos that tell me where it was- this was late June, 2009, on a hike in Oahu with my friend Bert Kimura.
The post "Surprisingly, the default for the Internet Archive is Don’t Archive" was originally pulled charred and crispy from a smoky charred oven at CogDogBlog (http://cogdogblog.com/2016/06/dont-archive/) on June 28, 2016.