They say 'quality over quantity,' but quantifying IT performance is a good shout too
Show the board some graphs, boards love graphs
Every year in living memory I’ve sat in the obligatory “how to complete your annual goals in the HR system” meeting, and each time I’ve been told: make sure you make your objectives “SMART” – Specific, Measurable, Achievable and so on.
Our HR cousins have been telling us this for years, and yet we seem to continue to measure the performance of our IT department on a largely qualitative – that is, subjective – basis. This is something of a surprise, given how measurable many aspects of IT operations can be. There’s so much you can do – much of which is dead easy – to make the performance of your IT department quantifiable. And if you can quantify it, you can identify what’s good and what you have the chance to improve.
Measure the easy stuff
There’s no excuse for not doing basic monitoring on your systems. There are loads of competitively priced commercial monitoring packages from the likes of SolarWinds and ManageEngine, and if you really don’t want to spend money then there are plenty of free tools as well – Spiceworks is the one I come across most. Tracking system uptime, disk and RAM usage, network traffic, switch-port up/down activity – all of which are fundamental and straightforward – gives real-time visibility of what’s going on, and helps you to react to issues and predict capacity growth.
Measure the less obvious stuff
The other element of measurement is perhaps less obvious: the behind-the-scenes stuff. Your basic monitoring will allow you to collect and report on the uptime and performance statistics of your system, but what lies behind that uptime? Producing a graph that has a straight line at 100 per cent from the beginning of time is great when you’re showing the company board how stable your services are.
But what about the facts behind that 100 per cent figure? Why not keep track of, say, instances where your firewall cluster automatically failed over because of a cable failure? Or where the UPS kicked in to cater for a mains power fluctuation? This is just as interesting as the fact that the end service has remained up – and is hugely beneficial as empirical justification for implementing resilience when you come to request funds from the bean counters for new equipment.
Particularly useful is that it also lets you identify ongoing deterioration of underlying services. Hardware doesn’t always go from perfect to dead in a millisecond: it can decay over time, with dropouts beginning in dribs and drabs and then becoming more common. Power supplies in particular can degrade too: the inaccessible (hidden/buried) elements often remain untouched for tens of years, and six or twelve-monthly scheduled spot tests won’t show up long-term degradation. Monitoring all your services at all layers will let you report on performance but will also give you clues as to what might fail in the future.
How many times a day do users cry “The system’s really slow” to the IT service desk? But what does that actually mean? Generally not that it’s slow, but that it’s slower than usual. Humans detect change much better than they measure normality. Fly straight and level in an aircraft and you won’t notice, but bank ever so slightly into a turn and everyone feels it; the same applies to an IT system’s performance changing from the norm.
The first thing you can do is monitor normal transactions at whatever layers you have the tools for. Can you configure a span port on the core switch and use it to monitor, say, the time between a lookup request arriving at the DNS server and the response being delivered back? Can you identify back-end transactions between your main CRM application and the back-end database and measure the time each one takes?
Sometimes you can, but often you can’t. When customers call to pay their bills the shape of a lookup will be pretty uniform from event to event – a database search against the account ID field, perhaps – and so it’s valid to compare the durations of the various events and seek degradation over time. And that’s fine for basic lookups because they’ll be consistent.
It’s a lot harder to compare apples with apples for more complex queries that exercise the back-end systems more rigorously, because the search criteria may differ widely and hence make the database work more or less hard. In that case you can use synthetic transactions – you manufacture the queries yourself in order that you can control the consistency of them.
One thing before we finish this section: I mentioned that people will notice the performance of a system change from the norm, but that’s not the whole story: they generally won’t notice if it’s changing infinitesimally each day, just as you don’t really notice your own children growing up day by day. Measuring transaction speed – both normal and synthetic – will give you over time the visibility you lack, because in real life the only thing users notice is where something’s actually gone seriously wrong, not a slight variation.
So we’ve quantified the performance of our world, we’ve introduced synthetic transactions to give us data we wouldn’t otherwise have, we’ve measured it and we’ve reported to management.
All of which is very nice, but is of no use whatsoever if you don’t actually react to what you find. Following up what you find is absolutely essential if your organisation is to continue running well – particularly if what you find is that although you thought it was running well, it actually isn’t.
Your quantitative measurements will tell you one of three things.
First, they could show that system performance is improving over time, and that there’s nothing to worry about. And if they do: you need to identify why that’s the case. Stuff doesn’t just magically get faster – there’s always a reason, whether it’s because you upgraded a virtualisation or because someone discovered the daily database optimisation script was failing and fixed it without telling anyone.
Second, they could show that system performance is degrading over time, in which case you need to do something about it. And this shouldn’t be a problem because you’re monitoring everything and so you can be proactive and address the issue before it becomes noticeable.
Finally, they could show that performance is humming along nicely and that nothing’s broken and there’s no immediate need to fix anything. But there’s something to mention here: you’re showing your funky graphs to the board, and demonstrating objectively why the performance of the systems is rock solid and that the speed of everything is consistent over time.
And if I’m on your board I’ll be asking you why, given all this money we’re giving you for these systems, the graph isn’t showing it getting consistently better. ?