Pretend you have found memories as a kid gawking at the dinosaur fossils on exhibit in the great hall of the Field Museum in Chicago. Your family vacation trip is planned to Chicago, and you wish to have your kids share the same experience. You know that Gorgeous George that you saw has been moved elsewhere, but even better, your kids will get to see Sue the Tyrannosaurus.
Imagine the kids tears of shattered disappointment when they see the empty platform and this sign in the hallway:
Angered, you head to the information booth for an explanation, and are told:
The modern descendants of dinosaurs, birds, have not published a robots.txt file allowing our museum curators.
That’s right, because of some pigeons who care not for their history, the museums have cleaned out their exhibits.
Excuse some stretching of the metaphor, but in my small minded interpretation, this is exactly what the Internet Archive is doing to their collection. I have outlined my own disappearing dinosaur of a web archive, and while still waiting on a follow-up offered by a representative, I still struggle to make sense of this.
Furthermore, I have been unable to locate any discussion or rationale of what is to me, a significant change in archival approach.
At the core is a policy the Archive has adopted, the Oakland Policy. As I interpret the policy, it makes sense. If a web site owner indicates certain information should not be included in the archive, they provide instructions via the robots.txt protocol.
This makes sense, and from my memory, this has always been followed by the Internet Archive’s Wayback Machine. Robots.txt is primarily an statement of what is to be excluded. According to the Robots Control RFC
To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.
This says rather clearly, that if not order to exclude is provide, it is assumed content is allowed to be crawled (and therefore be included in the Internet Archive). The default assumption is open.
Yet what appears to have been happening in the Internet Archive is they have completely turned this inside out. It looks like, and I can only go from what I can see, that they have made this policy, “If we cannot find a robots.txt directive allowing us, than our assumption is to exclude”.
And this seems a poor way to run a public museum. And while they have been promoting ways to preserve a future decentralized web, why are they cleaning out the great halls of the web’s past?
I am fairly sure I am making a major logical mistake or misinterpretation, but no one can tell me where that is.
Please, put the dinos back where they belong.
Top / Featured Image: A remix I made of a photo of Daspletosaurus that previously was on exhibit in the Field Museum’s Stanley Field Hall, found at The Glorious Journey of Gorgeous George.
So I am taking some liberty here. It was a bit of fun to brush the dinosaur completely out of the image.
UPDATE JULY 5, 2016… Back in the Wayback!
Apparently the loss of my archive had nothing to do with stray or absent robot.txt files- according to Wayback Machine Director Mark Graham, it was a glitch:
We found, and addressed, a condition that was causing playback of some sites to be unintentially blocked under certain conditions.
Thanks Mark for the followup.
If you believe in the open internet and it’s sustainability, then line up with me to donate to the Internet Archive — if they are not archiving it, nobody will.