Archive Fragility

My web warehouse has been emptied.

In my first job in ed-tech, a 14 year stint as an instructional technologist at the Maricopa Community Colleges, I spawned a vast, messy array of web sites from 1993 until I left in 2006, hosted at http://www.mcli.dist.maricopa.edu – that URL is retired, and should redirect to the center’s current site at https://mcli.maricopa.edu/… but the old link is stuck in the spin cycle.

My stuff was a hodgepodge of HTML, PHP, old per scripts, creaky wikis some really insecure ones that wrote to open text files… and was completely decommissioned from their web site sometime after I left. I cannot fault anyone, especially since I could count on the Internet Archive’s Wayback machine — it always worked to find my old web sites. I counted on it.

Until today, I was writing a comment about archiving, and meant to show how well the Wayback machine did to preserve my digital past, and what I got was… bubkahs.

Got no wayback, peabody

It seems odd since my other old servers, stuff I ran really old wikis and blog platforms, still show up in the Wayback Machine

I can only guess that some DNS configuration at Maricopa has rendered the old URL that pointed to my stuff as unresolved, and thus the Wayback Machine gets hung testing the URL? I am wildly guessing, as I do not know how this works. If it really irked me, I could try and find someone I know who works at Maricopa to see if they can fix the redirect.

But if the Web Archive hinges on a remote DNS setting, what does that say?

The lesson is, again, how weak the links are in the web fabric chain when they rely on other entities to manage your stuff– or how I barked recently Digital Durability? My Money is on the Individual.

I have all me web content from those years on a hard drive, and began re-archiving them myself, on my own domain — http://mcli.cogdogblog.com… but that does not solve the problems of old links that no longer work, and now when the next recourse, the Wayback Machine, is an empty warehouse.

Because some IT person changed a setting on a server.

It’s castles of internet sand we think we live in.

UPDATE Jun 23, 2016

I contacted the Internet Archive via email about this issue, and appreciate their quick response:

Hi Alan,

Thank you for contacting the Internet Archive.

You are right, according to the current Wayback Machine policy, if robots.txt cannot be reached, we no longer allow access to this domain through our website.

And thus, confirms my assertion on the fragility. Now I have to see if I can contact anyone at the Maricopa IT department…

Top / Featured Image: Searched google images (set for results licensed for reuse) on “empty warehouse” — more options than I could dream of. I like to use flickr ones, as I have this nifty tool for generating attributions– so used the flickr photo by nickton https://flickr.com/photos/18203311@N08/4589222821 shared under a Creative Commons (BY) license

Share this barking on social media

If this kind of stuff has value, please support me by tossing a one time PayPal kibble or monthly on Patreon

Comments

The Wayback Machine will withhold archives of URLs for which crawlers are specifically asked to keep away. The robots.txt file is one way for a content owner to intentionally suppress archives, but it undoubtedly is usually inadvertent.

http://archive.org/about/faqs.php#14

Alan Levine aka CogDog says:

June 22, 2016 at 3:19 pm

Yes, I understand that.

But there is no explicit robots.txt file at that URL (there was one there for the last 8 years) because the domain http://www.mcli.dist.maricopa.edu/ never resolves so http://www.mcli.dist.maricopa.edu/robots.txt is never found, so this means that the Wayback Machine assumes exclusion?

My point is more that IT staff does not know anything about an 8 year old DNS entry, decides its not needed, and the entire archive of 14 years of web work are gone. That is fragile (and also easily fixed).

Reply

Blog Pile

Archive Fragility

UPDATE Jun 23, 2016

Comments

Leave a Reply Cancel reply

My Profile

Your Profile