I came across an interesting problem the other day.
To give some background: at my company we have a pretty complex publishing ecosystem, where newly authored or altered assets (articles, video clips and the like) cause change notifications to be cascaded down to dependent systems, which in turn pull information from applications higher up the chain - or each other - to complete actions based on the earlier update.
Problems start, however, when one of these individual systems fail, particularly outside of normal working hours. We have good documentation of individual applications, decent monitoring, well-tuned alerting. But crucially, none of these factors help our 24 x 7 staff - strongly skilled, but necessarily non-specialist - in reversing the ripple effect of an incident to reach the root cause: effect can often manifest itself some distance from cause. Without this visibility, and not being well-versed in the tell-tale signs of a particular intricate issue, operators resort to "restarting stuff" in the hope of stumbling into a fix. Sometimes this approach is successful; all too often, though, it just makes matters worse.
The first step to solving an issue like this is invariably to take a look at what others are doing in a similar situation: an hour or so spent reading around the subject pays dividends. Except, as so often in the sysadmin world, it doesn't. An equivalent problem in the software development arena would be addressed by competing philosophies, spawning numerous blog posts authored by enthusiastic practitioners, through which approaches are further refined. Instead, Google's results page greets a sysadmin with a number of options for ITIL consultancies – and not a lot else. And, whatever the question, ITIL isn't the answer.
Sure, there are a lot of concepts that can be 'borrowed' from our developer cousins: as an example, the sort of traffic light based, real-time visibility of state that continuous delivery mandates - but applied to visualisation of IT services – could help a lot in this instance. But it's disappointing to have to fall back on applying innovations from other IT areas to system administration, rather than those tailored to our particular discipline,
Which raises the question of sysadmin culture: where's the innovation?