No More Dinosaurs in the Museum of the Internet Archive Due Lack of Permission from Pigeons

NEWS FLASH! My wayback is back!

Pretend you have found memories as a kid gawking at the dinosaur fossils on exhibit in the great hall of the Field Museum in Chicago. Your family vacation trip is planned to Chicago, and you wish to have your kids share the same experience. You know that Gorgeous George that you saw has been moved elsewhere, but even better, your kids will get to see Sue the Tyrannosaurus.

Imagine the kids tears of shattered disappointment when they see the empty platform and this sign in the hallway:

Angered, you head to the information booth for an explanation, and are told:

The modern descendants of dinosaurs, birds, have not published a robots.txt file allowing our museum curators.

That’s right, because of some pigeons who care not for their history, the museums have cleaned out their exhibits.

Excuse some stretching of the metaphor, but in my small minded interpretation, this is exactly what the Internet Archive is doing to their collection. I have outlined my own disappearing dinosaur of a web archive, and while still waiting on a follow-up offered by a representative, I still struggle to make sense of this.

Furthermore, I have been unable to locate any discussion or rationale of what is to me, a significant change in archival approach.

At the core is a policy the Archive has adopted, the Oakland Policy. As I interpret the policy, it makes sense. If a web site owner indicates certain information should not be included in the archive, they provide instructions via the robots.txt protocol.

This makes sense, and from my memory, this has always been followed by the Internet Archive’s Wayback Machine. Robots.txt is primarily an statement of what is to be excluded. According to the Robots Control RFC

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

This says rather clearly, that if not order to exclude is provide, it is assumed content is allowed to be crawled (and therefore be included in the Internet Archive). The default assumption is open.

Yet what appears to have been happening in the Internet Archive is they have completely turned this inside out. It looks like, and I can only go from what I can see, that they have made this policy, “If we cannot find a robots.txt directive allowing us, than our assumption is to exclude”.

And this seems a poor way to run a public museum. And while they have been promoting ways to preserve a future decentralized web, why are they cleaning out the great halls of the web’s past?

I am fairly sure I am making a major logical mistake or misinterpretation, but no one can tell me where that is.

Please, put the dinos back where they belong.

Top / Featured Image: A remix I made of a photo of Daspletosaurus that previously was on exhibit in the Field Museum’s Stanley Field Hall, found at The Glorious Journey of Gorgeous George.

The image is credited to the Field Museum Photo Archives but I had no luck even finding a search tool there. I hoped it was in the Field’s Flickr Commons collection, but alas, no.

So I am taking some liberty here. It was a bit of fun to brush the dinosaur completely out of the image.

UPDATE JULY 5, 2016… Back in the Wayback!

Apparently the loss of my archive had nothing to do with stray or absent robot.txt files- according to Wayback Machine Director Mark Graham, it was a glitch:

We found, and addressed, a condition that was causing playback of some sites to be unintentially blocked under certain conditions.

Thanks Mark for the followup.

If you believe in the open internet and it’s sustainability, then line up with me to donate to the Internet Archive — if they are not archiving it, nobody will.

Share this barking on social media

If this kind of stuff has value, please support me by tossing a one time PayPal kibble or monthly on Patreon

Comments

I also remain baffled by this. It’s as if what was an archive has become…exactly the opposite. Further, this is like a library that replaces all old editions with only the current one (and only if the publisher thinks to say “please keep me” when it comes to the most recent acquisition). Not only should the default be the opposite, but the point of the WayBack Machine should be to have those copies even if later versions of the site change, including changing their own desire to be archived. So weird and pointless.

Alan Levine aka CogDog says:

July 5, 2016 at 9:22 pm

It took a while to get an answer, but apparently it was “something else” though I was not given anything specific despite my request for info I could share.

Still, I cannot find a concrete answer to how they implement this Oakland policy, if it is pure exclusion or ???

Reply

Blog Pile

No More Dinosaurs in the Museum of the Internet Archive Due Lack of Permission from Pigeons

UPDATE JULY 5, 2016… Back in the Wayback!

Comments

Leave a Reply Cancel reply

Follow CogDogBlog

My Profile

Your Profile