Tuesday 17 July 2012

Planning for better IT operating margin

You'll hear the phrase "predictive analytics" coming from most of the major players in capacity management these days.  Looking into the future to support planning any major infrastructure or software initiatives, or even to account for variations in workload growth, all require predictive analytics to some extent.  Whether you're an infrastructure provider or consumer, better planning drives more efficient operations and hence improved margins.  Let's explore more:




At it's most basic form, predictive analytics is about extrapolation.  By gathering a set of historical data, we can begin to spot patterns and make some assessment into the future trajectory.  The type of extrapolation that can be made depends on the power of the analytics - at it's most basic, linear regression analysis looks at long term trends and plots a single straight line trend out into the future.  This works fine for persistent metrics like disk space.  In fact, it works reasonably well for less persistent metrics, provided you bolster the analysis with some variability assessment.  However, better curve fitting algorithms (lognormal, exponential, binomial etc.) can provide more accurate predictions if the data is well behaved.  Take a look at the graph above.  The binomial fit is closer to the capacity used metric, which is a combination of a steady organic growth and a seasonal variation trend.  In this case, a linear trend on the peaks (or 98th percentiles) can give the same net result, but it a little more cumbersome.  


There are two problems however with extrapolation.  First, is one of scale.  With no roll-up mechanism, you quickly get drowned in data.  There's a simplification process that needs to support the trends.  The second, and more fundamental, is that is assumes all other variables remain constant - meaning, it's only the workload that changes - the environment itself is static. 


Is this a good assumption?  Well, for some platforms it is.  For disk capacity, it is a pretty good rule.  Only when disks are running out of space, will some change be made - and these can be easily reflected in the extrapolation.  For physical infrastructure, or statically allocated partitions, this can be a decent assumption too - provided the software itself isn't changing.  


But where the extrapolation and curve-fitting algorithms really fail, is where either the software or the operating environment are changing.  Determining the impact assessment of these step-changes in capacity is a task too complex for curve-fitting alone - and some configuration information must be reflected in the predictions.  At this stage, a modelling approach must be used.  There are in fact many different modelling algorithms and approaches, but the most popular provide both an infrastructure and a service perspective on capacity.  The service-centric capacity plan takes a cross-section of Data Centre capacity allocated or used by a heirarchy of service definitions, which can be taken from a service definition or CMDB.  The benefit of this view is to enable dialogue with business owners about plans for their relevant domain.  If you're capacity planning in the cloud, the relevant conversation should involve budgeting, quality and optimization opportunity.  If you have a model, then the relevant KPI for trending and extrapolation becomes workload volumetrics - and this means you can manipulate forecast data based on changing business requirements in the future.


The modelling approach really is beneficial in managing shared virtual infrastructures like the cloud; where the bottleneck may appear at the physical or virtual layer, where the virtual configuration may be changing rapidly; and where DRS workloads may be shifting around within a cluster.  It is also beneficial in planning for new software releases, upgrades or (major) reconfigurations - thereby incorporating a life-cycle approach to capacity management.  Surely this is where predictive analytics is at it's most powerful?  In helping architects to size new cloud environments, testers to validate the scalability of their new release, and capacity managers to measure the impact of their release into a congested production environments.


In Summary
In the technology life-cycle, the role for capacity management predictive analytics should support sizing, provisioning, managing and decommissioning.  Whether you choose to use a tool for that, or operate a consultative approach - leaving holes in your planning process has been shown to add risk and cost to your IT operations.