For the obviously obvious statement, WordPress is built on a database. The question is, besides data like visitor counts, what can you infer from the data in the posts and metadata itself?
The question was swirling in preparation for a research interview I did today with David Porter and Valerie Lopes for the Ontario Extend project.
My cartoon lightbulb went on over my head, thinking about that we have a blog syndication hub set up, and because of the way Feed WordPress does it’s thing, it means a copy of all posts is saved locally on the site.
I already have it display for any list of the blogs, like all of them, a count of the number of blogs subscribed too as well as the total number posts syndicated in:
This is done as well for each of the cohorts, since posts for each are assigned to a designated category, like the blog list for all in the West Cohort.
The lightbulb as that quite sometime ago, I actually built a plugin for exporting data from posts in a category, I have my own tool– wp-posts2csv. The plugin allows you to choose the category to pull data from (or just for all posts)
And a button to click. It returns a .csv file to download.
The thing I never was quite sure (insert disclaimer here of not being a data scientist) what is useful in having here, in spreadsheet format:
- post ID
- source indicated (either ‘local’ or ‘syndicated’)
- post title
- publication date and time
- author name (first and last name from profile, this is added to user profiles via the gravity form signup thing I built)
- author username on site
- blog name (host blog or remote blog if syndicated post)
- post character count (string character count after HTML stripped out)
- post word count (after HTML stripped out)
- number of links in post (count of ‘‘ tags)
- list of hyperlink urls (from all href= tags, hoping my regex is on target)
Here is a peek at the data (showing for two of my posts, I give myself permission to use my data about my blog in a blog post on my blog).
I had designed this first, and used yesterday, for syndication hubs, but it would work fine on any WordPress site. By “work fine” I mine it will spit out some spreadsheet stuff.
But really, what can one infer from this? Is there meaning in looking at word/character count? Use of tags? Use of links?
I dunno (remember the disclaimer)?
I did the due diligence of some googling, where first you have to find out how to filter out all the SEO seeking and marketing stuff, the best searches I found were for
content analysis of blog posts but that seemed dusty too. Studies that referred to old horses like “technorati” and done in the mid 2000 to late 2000s.
A few focussed on comment data, which is something we do not get when syndicating posts (long story, it’s really messy).
I’d like to think my search skills were weak here, so I go Lazy Web and ask for help. What can you do with this kind of data? What else is worth getting to do activity/content analysis? Does anyone really know what time it is?
A long standing curiosity is that DS106 has been syndicating in content from thousands of blogs for tens of different classes since 2011. The tanks deep inside the database have copies of 79,000+ syndicated blog posts.
How many research projects have taken on looking at that data?
Near as I know… zip.
I guess there’s no interesting data there.