Tuesday, October 25, 2011

Sysadmin culture: where's the innovation?




I came across an interesting problem the other day.


To give some background: at my company we have a pretty complex publishing ecosystem, where newly authored or altered assets (articles, video clips and the like) cause change notifications to be cascaded down to dependent systems, which in turn pull information from applications higher up the chain - or each other - to complete actions based on the earlier update.


Problems start, however, when one of these individual systems fail, particularly outside of normal working hours. We have good documentation of individual applications, decent monitoring, well-tuned alerting. But crucially, none of these factors help our 24 x 7 staff - strongly skilled, but necessarily non-specialist - in reversing the ripple effect of an incident to reach the root cause: effect can often manifest itself some distance from cause. Without this visibility, and not being well-versed in the tell-tale signs of a particular intricate issue, operators resort to "restarting stuff" in the hope of stumbling into a fix. Sometimes this approach is successful; all too often, though, it just makes matters worse.


The first step to solving an issue like this is invariably to take a look at what others are doing in a similar situation: an hour or so spent reading around the subject pays dividends. Except, as so often in the sysadmin world, it doesn't. An equivalent problem in the software development arena would be addressed by competing philosophies, spawning numerous blog posts authored by enthusiastic practitioners, through which approaches are further refined. Instead, Google's results page greets a sysadmin with a number of options for ITIL consultancies – and not a lot else. And, whatever the question, ITIL isn't the answer.


Sure, there are a lot of concepts that can be 'borrowed' from our developer cousins: as an example, the sort of traffic light based, real-time visibility of state that continuous delivery mandates - but applied to visualisation of IT services – could help a lot in this instance. But it's disappointing to have to fall back on applying innovations from other IT areas to system administration, rather than those tailored to our particular discipline,


Which raises the question of sysadmin culture: where's the innovation?



Monday, May 2, 2011

Ubuntu Unity Launcher: show all devices

By default, the Unity Launcher in Ubuntu Natty shows only devices that have been previously mounted.  Here's how to see them all:

In a shell, type
sudo apt-get install dconf-tools
dconf-editor

Once the editor's launched, navigate to Desktop -> Unity -> Devices and you're able to switch devices-option to either Never (don't show any), Always (show all), and OnlyMounted (the default).  Changes are applied instantaneously.


Wednesday, April 27, 2011

Amazon EC2 downtime a manufactured storm?

So Amazon EC2 on the US East Coast went down for a day.  Big deal.

Many IT folks are smugly extolling the virtues of their own co-located setups, and berating those naive, risk-happy organisations for ever attempting to use a public cloud in the first place.  The reality, though, is that similar events will have happened in their own DC, but when you're busy fixing the issue, you're not so exposed to the pain of the service being down.  For the user, the situation's identical - the site's still down - even if there's more of an illusion of control for the techies involved.  And, to those individuals: can you absolutely guarantee that you have the resources and expertise to be able to fix any serious breakage that may arise?  Do you have spare servers sat on the shelf, ready to swap in and rapidly provision?  Know every element of your configuration intimately?  And, even if you can answer "yes" to all these questions, will you be really be able to respond as rapidly and comprehensively as the mini-army that cloud economies of scale allow a provider to employ?

For ultimate 'head in sand' thinking, you could fork out for a Tier 1, enterprise-level hosting provider on the basis that by doing so, this sort of thing could never happen.  But, from personal experience, it does.  Of course, using said provider's network of global datacenters, together with good architectual practices, will provide a very resilient infrastructure; equally, if you're working for an organisation of this size, then you wouldn't have all your systems concentrated in a single region if you were using EC2 anyhow, never mind a single availability zone.

This isn't to say that Amazon have covered themselves in glory during this particular episode: communications have been wholly inadequate throughout, and there are questions to be answered in terms of how the outage was able to spread to supposedly separate availability zones.  Those using the cloud have lessons to learn too.  Deploying to a single availability zone is just dumb; so too is the unnecessary use of EBS-based instances, which are fundamentally less scalable and more performance-sensitive than their instance-store cousins.

But in many ways the speed and vigour with which supporters of traditional hosting models have flocked to deride Amazon for the outage is the interesting story here.  You'd doubt that a similar outage for even a cluster of enterprise-level hosting providers would cause such a stir.  Cloud is looking very enticing to many businesses fed up with the inflexibility and expense of their IT infrastructure, AWS is being taken extremely seriously, and those invested in traditional infrastructure are feeling the heat.



Monday, April 25, 2011

Automate EBS snapshots with Ebs2s3 on Amazon EC2

I've spent a while looking for a simple way to schedule regular snapshots of EBS volumes for Amazon EC2-based production environments. The snapshot feature is really useful: it allows you to take regular delta backups of your EBS volumes and save them it S3, which can then be used to rebuild the volume in the event of a system failure.  The solutions I've come across, though, seem to mostly consist of running scripts via cron, which isn't intuitive or easy to manage.  So I've put together a little rails app to do the job: Ebs2s3

Ebs2s3 allows you to flexibly schedule backups of your volumes using standard cron syntax, and to specify the number of backups to retain.  Some screenshots:

The login page


Displaying current jobs 


Showing an individual job



More details (including installation instructions) can be found on the project page, here