cc licensed ( BY ND ) flickr photo shared by ind{yeah}

A rather annoying grain of sand in the ds106 machine of syndication is the loss of web sites when students give up their domains, or just obliterate their blogs after class. It’s a common theme of the distributed web, the stench of link rot – that in our own acts of reorganizing content we manage, that we lose sight that there may be external links into us.

Such links are invisible, silent, we do not even know they are there.

These has been cropping up more frequently in ds106; for our video assignments, we asked students to review the works of previous students that we (sort of) elegantly syndicate into the Assignment Collection, the ones that appear as Submissions So Far – try the random spinner to take a peek.

And that is a thing with a Domain of One’s Own- “On”e can delete or migrate their own stiuff or just let go of their domain after ds106 is done. We have to try and be okay with that.

Though my own thought, now nearing 29n years of web content created, that one you launch a URL into the web, you should do everything you can to make sure it stays there. We are the web, we make, it, and when we destroy a published URl, well I like to think every time that happens we make Tim Berners-Lee’s cat cry in pain.

cc licensed ( BY NC ) flickr photo shared by Hell-G

But back to the ds106 assignments site- all of the examples are syndicated into this site as a secondary syndication from the main site. But once it is in the Assignments site, its really a snapshot. The maint site, via Feed WordPress can rejig its fed content into a local copy, but its a more tortuous math to propgate that change outward,

So I coded the reaper. The cleanser of dead links.

The Dark One is a function, I harvested The One for, a function for testing whether a URL is alive – this one dependent on cURL to test a link. I modified it since the original code assumed a HTTP status code of 200 was good enough; to me a redirect code (301 or 307) should get a green light as well — this went inside the functions.php file of my theme (I also added a “curl_close” since I would hitting the function repeatedly in a loop, not sure if was needed).

function url_exists($url) {
// from http://www.barattalo.it/2010/01/06/test-if-a-remote-url-exists-with-php-and-curl/
    $ch = @curl_init($url);
    @curl_setopt($ch, CURLOPT_HEADER, TRUE);
    @curl_setopt($ch, CURLOPT_NOBODY, TRUE);
    @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
    @curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    $status = array();
    preg_match('/HTTP\/.* ([0-9]+) .*/', @curl_exec($ch) , $status);
    
    @curl_close($ch);
    return ( $status[1] == 200 OR $status[1] == 301 OR $status[1] == 307 );
}

So how to put the reaper into action? I could have created some sort of admin dashboard thingie. But just to try it out, the easy way is to set up the code in a page template, and then just call the page as a URL. In WordPress, I create a new page called anything, like The Happy Reaper. The text in the body is not needed, it will not be used. The main part is setting up the slug for the page (or just using what wordpress gives you.

In this case my page, hosted on a blah at www.blahblah.org will be available at http://www.blahblah.org/happy-reaper/. I then make a copy of the page.php template and name it page-happy-reapar.php. This means my special template will be used only for this page.

Got it?

The set up for this is a little particular to the ds106 site, since the posts I am dealing with are custom post-types– in the assignments site, the things that people do in response to an assignment are created as post types called “stuffthatisdone”.

IN the part of the template where the content goes (in place of the normal wordpress loop), we first set up a custom query, and also initialize some counters.

< ?php
// set arguments for WP_Query()
$args = array(
	'post_type' => 'stuffthatisdone',
	'post_status' => 'publish',
	'posts_per_page' => -1,
	'year' => 2012,
	'monthnum' => 10,
);

// The Query
$the_query = new WP_Query( $args );

// initialize counters		
$link_count = 0;
$link_fail = 0;

The stuff in $args controls what we get back from WordPress- content that is of the stuffthatisdone type, and ones that are not drafts (or revisions or attachments), we want all of them (posts_per_page = -1). There are so many items in our site, that I had to limit it by time as well, adding year and month. I was able to to do all of 2011 in one call, but 2012 crashed in a memory overload, so I added the month, and diod 2012 month by month.

The two counters keep track fo how many links we process and how many failed. This is solely for writing some output when we are done.

Now we have the guts of the action. We will run a loop for the custom query we created. We have to find the URL we are looking for $link_to_check (in our site they are stored in post meta data field. The status is returned from the function we created above. If the link is good we echo something like “GOOD” if not we say “BAD”, and we bump counters as needed.

Note that we are nto actually doing anything at this point, this merely gives a sense if the code is working. Note at the bottom we echo some feedback for how many links we checked and how many were bad.

echo '<ol>';

while ( $the_query->have_posts() ) : $the_query->the_post();
  
  // current post id      
  $theid = get_the_ID();
  
  // retrieve url from meta data field
  $link_to_check = ( get_post_meta($theid, 'syndication_permalink', true ) ) ? get_post_meta($theid, 'syndication_permalink', true) : 0; 
 
  if ( $link_to_check ) {
  
   // feedback for the bored person watching the page
   echo '<li>verifying link for "' . get_the_title() . '"  <a href="' . $link_to_check . '">' . $link_to_check , '</a>:  ';
   
   // function to verify link
   $link_status = url_exists($link_to_check);
   
   // bump counter
   $link_count++;
   
   if ( $link_status ) {
    // she's good the link
    echo '<strong>GOOD</strong>';
   } else {
   
    //BAD LINK BAD LINK!
    
    // here will got the stuff we will do... TBA
    
    echo '<strong><span style="color:#900">FAIL</span> ' . $update_str  . '</strong>';
    
    $link_fail++; // bump counter
   }       
   echo '</li>';
   
  }
endwhile;
?>
</ol>

<p><strong>< ?php echo $link_count?></strong> links checked; out of these 
<strong>< ?php echo $link_fail?></strong> failed and < ?php echo $update_str?>.</p>
?>

Okay, if this is working, I just have to set up the actions it should do when a bad link is found. The way I did it was to change the status from published to draft, essentially removing it from the site, but still having it available in case I wanted to revert. IN this case my “BAD” code would be


	// set up for changing the post status to draft
	$dead_post = array();
	$dead_post['ID'] = $theid;
	$dead_post['post_status'] = 'draft';
	
	// make it happen
	wp_update_post( $dead_post );
	
	// output text
	$update_str = 'status set to draft';

This worked fine, and in fact, I found one students posts that failed because they were listed in the site as https (long story but my url check above fails for https). But then I ended up with a whole raft of drafts I have to delete one screen at a time.

If I do this again, I would change the action to delete post, which actually works like your computer, when you put things in the trash, they are marked for deletion, and you can still revive them if needed. And its shorter code.

	// delete the post
	wp_delete_post($theid);
	
	$update_str = 'post deleted';

The Sloppy Alan Code part is that for each iteration where I changed the paramaters, I had to edit the $args value, upload the script, and reload the URL.

It’s pretty intensive to be cURL checking the sites, even though it merely taps the site for HTTP headers (it does not doa full retrieval request, if I understand correctly). Still, some of the 2012 months had 1000 posts and took maybe 10 minutes to process.

The results for dead links found:

  • 2011: 245
  • 2012:
    • January: 3
    • February: 494
    • March: 220
    • April: 186
    • May: 47
    • June: 54
    • July: 46
    • August: 0
    • September: 3

All in all the Happy Reaper removed 1078 dead links, still leaving over 3600 active links of ds106 assignment examples.

Some of the interpretation in the numbers (?)- for Spring of 2012, students were buying their own domains and web hosting. From my class, at least 3/4 of these students let their sites lapse, and in the tush of the end of the semester, I did not do a great job of getting them to archive to UMW Blogs (we had a ds106 server issue where exports kept crapping out). The summer 2012 student numbers are ower, and the first batch of them were ones hosted under Domain of Ones Own, so their sites should stay around longer.

Except I saw at least 3 of my summer students who deleted or wiped out their ds106 blogs. I am not sure why.

It is after all, a Domain of Their Own.

Bottom line, I was able to wipe out a ton of bad links. We still have another lump of them that are duplicated, I am not sure what happened, but sometimes FeedWordPress seems to step on itself (or I did something funky with some manual updates).

Still this should free up a lot of gunk on the assignments, to have only working examples. You can help out even more by doing an assignment today! Or tomorrow! Or a week from Thursday. Try what the random spinner gives you. I just spun and got the 5 Second Video one

And if you have not seen, the way I modified the site, you can add examples to an assignment without needing a blog that we subscribe to- there is a form where you can add an example from any URL, so you can post a ds106 example on any web site, or in flickr, youtube, etc.

For the 5 Second Video assignment, you can see under the instructions for tagging, a link to the form where you can add directly to this assignment:

The first example listed here came vis the form, and goes to an example on YouTube, a five second summary of The Notebook. The second one came in the usual way, via a blog ds1206 subscribes to, where the post is properly tagged – a 5 second version of “Drive” (and fmor a ds106 blog I’ve not seen before).

Yes, the Reaper is Happy, not for what he reaped, but for what he left behind.

cc licensed ( BY NC SD ) flickr photo shared by Johnson Cameraface

Creative Commons License
Reaping the Dead ds106 Links by CogDogBlog, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

2 Comments

  • geoffhiggins

    Hey Alan, I love your work. I get lost when I read the code, but it is great to have someone looking out for us.
    …Geoff

    • Alan Levine aka CogDog

      Thanks Geoff, yeah code can be arcane, but this is more to let people know it is possible….

Leave a Comment

All fields are required. Your email address will not be published.