→ Nederlandstalig? E: kevin@roam.beT: +32475435169

Converting a WordPress blog into a static site

My personal blog at el73 hasn't seen an update in years. Constantly upgrading WordPress because of a newly discovered security hole on a website I no longer publish to feels a lot like wasting time.

That's why I decided to finally remove the WordPress parts from the equation and turn the blog into a static, archived site. Two tools made this conversion easier than I could have imagined from the start.

Wget

The first step was to archive the entire published site. With a little help from wget I got all of around 500 pages downloaded in a little less than an hour, thanks to Peter Upfold and this command: $ wget -mk -w 20 http://el73.be/

Cleaning up the HTML

So now I've got a local mirror of the site, but I need to change some things around. Since I'm not planning on keeping WordPress around, the form to add a comment will have to be removed. Plus there's a link to search the website's contents in the navigation. Since that's also powered by WordPress, it'll have to be removed as well.

My first attempt to clean up the HTML consisted of processing all the HTML files and removing those two parts using regular expressions. Now, I'm no command line guru chaining 5 separate commands into a single perfectly working combination of nix glory. And as it turns out, OSX doesn't support the full range of arguments offered by grep on other systems such as Linux anyway. Sure, I could have delved in and wasted a lot of time on getting this to work, but it would be just that: wasting time. I'm not planning on doing this again anytime soon, so whatever works best and *right now is going to get my vote.

As it turns out, PhantomJS is what worked best for me. PhantomJS allowed me to use a simple script that would walk the entire mirrored site and collect the HTML files. After that, it was a matter of loading each HTML file, removing the elements I wanted removed using plain DOM operations and saving the content back to file.

var cleanUpHtml = function(fp) {
    // Open the file at the given location
    console.log('Cleaning up HTML in ' + fp);
    var address = 'file://' + fp;
    console.log('Opening file at ' + address);
    var page = new WebPage();
    page.open(address, function(status) {
        if (status !== 'success') {
            console.log('Unable to load ' + fp);
        } else {
            var innerHtml = page.evaluate(function() {
                // Remove the comment form
                var el = document.getElementById('commentform');
                if (el) {
                    el.parentNode.removeChild(el);
                }
                // Remove the search link
                el = document.getElementById('nav');
                if (el && el.tagName == 'UL') {
                    var lis = el.getElementsByTagName('LI');
                    if (lis && lis.length == 4) {
                        el.removeChild(lis[3]);
                    }
                }
                // And return the inner HTML of the root
                return document.documentElement.innerHTML;
            });
            // Wrap the inner HTML
            fs.write(fp, '<!DOCTYPE HTML><html lang="nl">' + innerHtml + '</html>', 'w');
            console.log(fp + ' processed');
        }
    });
}

Upload... and profit!

So that's how I could turn an entire website backed by WordPress into a site consisting of nothing but static assets in about 2 hours of work, not counting the time needed to download and process the pages. Upgrading to the next two patched versions of WordPress would probably cost me the same or more.

§