Friday, 14 February 2014

Capacity Management - 5 top tips for #DevOps success

An esteemed consultant friend of mine once commented - "in capacity management, it is the step changes in capacity that are the most difficult to plan for". In agile release practise, such step changes are increasing in frequency. As each new release hits, the historical metrics describing quality of service data lose relevance, making capacity planning harder.

To respond to this change, an agile capacity management practice is called for, which must be lightweight, largely automated, and relevant to both deployed software and software not yet released. Indeed, the process must be able to support all aspects of the DevOps performance cycle - from infrastructure sizing, through unit and load testing, to operational capacity management. In shared environments, such as cloud infrastructures, it is easy to become lost in the "big data" of application or infrastructure performance.

When executing a DevOps strategy however, it is critical to embed performance and capacity management as a core principle - structuring the big data to become relevant and actionable.  Here are 5 top tips for success:

1. A well-defined capacity management information system (CMIS) is fundamental

The foundation of your capacity management capability is data - building a strong foundation with a capacity
CMIS takes data from real-time monitors
management information system is crucial. The purpose of this foundation is to capture all relevant metrics that assist a predictive process, a process that provides insight about the current environment to help drive future decision-making. Context is crucial, and configuration information must be captured - to contain virtual and physical machine specifications along with service configuration data.  It is advisable also to design this system to be able to accommodate business contextual data as well, such as costs, workloads or revenues. Automation of the data collection is critical when designing an agile process, and this system should be designed to be scalable so to deliver quick wins, but grow to cover all the platforms in your application infrastructures.  This system should not replace or duplicate any existing monitoring, since it will not be used for real-time purposes.  Also note: it is easy to over-engineer this system for its purpose, hence another reason to adopt a scalable system that can grow to accommodate carefully selected metrics.  

2. Aquire a knowledge base around platform capacity

A knowledge base is crucial when comparing platform capabilities. Whether you are looking at legacy AIX
Quantify capacity of different platforms
server or a modern HP blade, you must know how those platforms compare in both performance and capacity. The knowledge base must be well maintained and reliable, so that you have accurate insight over the latest models on the market as well as the older models that may be deployed in your data centres.  For smaller organisations, building your own knowledge base may be a viable option, however beware of architectural nuances which affect platform scalability (such as logical threading, or hypervisor overheads). For this reason, it is practical to acquire a commercially maintained knowledge base - and avoid benchmarks provided by the platform vendors.  Avoid the use of MHz as a benchmark, it is highly inaccurate.  Early in the design stage for new applications, this knowledge base will become a powerful ally - especially when correlated against current environmental usage patterns.

3.  Load Testing is for validation only

For agile releases, incremental change makes it expensive to provision and assemble end-to-end test
DevOps and performance testing
environments, and time-consuming to execute.  However, load testing still remains a critical part of the performance/capacity DevOps cycle.  Modern testing practise has "shifted left" the testing phase, using service virtualization and release automation, resulting in component-level performance profiling activity that provides us with a powerful datapoint in our DevOps process.  By assimilating these early-stage performance-tested datapoints into our DevOps thinking, we can provide early insight into the effect of change.  For this to be effective, a predictive modelling function of some sort is required, where the performance profile can be scaled to production volumes and "swapped in" to the production model.  Such a capability has been described in the past as a "virtual test lab". For smaller organisations, this could be possible with an Excel spreadsheet, although factoring in the scalability and infrastructure knowledge base will be a challenge.

 4.  Prudently apply predictive analytics

Predictive Analytics at work
To be relevant, predictive analytics need to account for change in your environment - predictive analytics applied only to operational environments are no longer enough. In a DevOps process, change is determined by release, so investing in a modelling capability that allows you to simulate application scalability and the impact of the new release is crucial. Ask yourself the question - "how detailed do you need to be?" to help drive a top-down, incremental path to delivering the results you need.  Although it is easy and tempting to profile performance in detail, it can be very time-consuming to do.  Predictive analytics are fundamentally there to support decision-making on provisioning the right-amount of capacity to meet demand - it can be time-consuming and problematic to use them to predict code- or application- bottlenecks.  Investment in a well-rounded application and infrastructure monitoring capability for alerting and diagnostics remains as important as it ever did.

5.  Pause, ensure to measure the value

As a supporting DevOps process, it can be easy to overlook the importance of planning ahead for
Showing cost-efficiency of infrastructure used
performance and capacity.  Combining the outputs with business context, such as costs, throughputs or revenues will highlight the value what you are doing.  One example is to add your infrastucture cost model to your capacity analyics - and add transparency into the cost of capacity.  By combining these costs with utilization patterns, you can easily show a cost-efficiency metric which can drive further optimization.  The capacity management DevOps process is there to increase your agility by reducing the time spent in redundant testing, provide greater predictability into the outcomes of new releases, improve cost-efficiency in expensive production environments, and provide executives with the planning support they need in aligning with other IT or business change projects.


Thursday, 6 February 2014

is performance important?

Over the last decade, seismic progress has been made in the realms of application performance management - development in diagnostics, predictive analytics and DevOps enable application performance to be driven harder and measured in more ways than ever before.

But is application performance important?  On surface value it seems like a rhetorical question: performance relating to the user experience is paramount, driving customer satisfaction, repeat business, competitive selection, brand reputation - yes, performance is important. However, it is more often the change in performance that more directly influences these behaviours. A response time of 2 seconds may be acceptable if it meets the user expectation - but could be awful if users were expecting a half-second latency. User experience is more than just performance, and the quality of the user experience is related to performance, availability, design, navigability, ease-of-use, accessibility and more.  Performance is important, yes - to a point.

The flip-side of performance is throughput, the rate at which business is processed.  Without contention, throughput rises directly in proportion to workload volume, without compromising performance. However, when contention starts - performance suffers and, crucially, throughput starts to drop in proportion to the arrival rate. In other words, in a contention state - the rate at which business is transacted becomes impacted.

So - is performance important?  Yes, clearly it is important, but only in the context of user-experience. However, a far more important measure of business success is throughput, as it is directly related to business velocity - how fast can a business generate revenue?

Consider the graph below, showing the relationship between performance and throughput for a business service.  The point at which throughput is compromised corresponds to a 20% drop in response time.  Yet, user-experience is largely maintained at this level of performance, customers are not complaining en masse until performance is degraded by double that amount.  At this point, the damage is already done.


SUMMARY
When seeking to understand the risk-margin in service delivery, the more pertinent metric for business performance is to focus on throughput.  By building out a scalability assessment of your business services, the relationship between performance and throughput can be derived - and the right amount of capacity allocated in order to avoid the potential throughput issue.  Such an assessment can be empirical, but for highest fidelity - a simulation approach should be adopted.

The chart above was created using CA Performance Optimizer - a simulation technology that predicts application scalability under a range of different scenarios.

Monday, 27 January 2014

Finding the spare capacity in your VMware clusters

I recently oversaw a project for a large petrochemicals company, where we identified a potential 500% saving in capacity in a highly-used VMware cluster.  I was gobsmacked at the over-allocation of capacity for this production environment, and decided to share some pieces of advice when it comes to capacity analysis in VMware.

How to find the savings in your VMware environment

Hook into your VMware environment, and extract CPU utilization numbers for host, guest and cluster.  Use the logical definitions of VMware folders or the cluster/host/guest relationships to carve meaningful correlations.  Be careful with heterogeneous environments, not all hosts or clusters will have the same configuration - and configuration is important.  Use a tool like CA Capacity Management to provide a knowledge-base of the different configurations so you can compare apples and oranges.  Overlay all the utilization numbers and carry out a percentile analysis on the results - the results here represent a 90th percentile analysis, arguably a higher percentile should be used for critical systems.  Use the percentile as a "high watermark" figure and compare against average utilization to show the "tidal range" of utilization.


Memory utilization is a little bit more challenging given the diversity of the VMware metrics.  Memory consumed is the memory granted to the physical OS, but if the data collection period includes an OS boot - will be distorted due to the memory allocation routines.  Memory active is based on "recently touched" pages, and so depending on the load type may not capture 100% of the actual requirement.  Additionally, there is a host overhead which becomes significant when the number of virtual machines reaches a crucial level.  Memory figures are further distorted by platforms like java, SQL server or Oracle who consume memory in hamster-fashion for when it may be useful.  For these purposes, it may also be relevant to consider OS-specific metrics (such as from performance monitor).  It now seems as if the capacity manager should be using a combination of these metrics for different purposes, and should refine their practise to avoid paging (the symptom of insufficient memory).  .


It is also worth reviewing the IO figures from a capacity point of view, although there is a little more work required in determining the capacity of the cluster, due to protocol overheads and behaviours.  The response time metrics are a consequence of capacity issues - not a cause, and although important, are a red herring when it comes to capacity profiling and right-sizing (you can't right-size based on a response time, but you can right-size based on a throughput).  I've disregarded disk-free stats in this analysis - which would form part of a storage plan, but check on the configuration of your SAN or DAS to determine which IO loads represent a risk to capacity bottlenecks.  


The Actionable Plan

Any analysis is worthless without an actionable plan, and this is where some analytics are useful in right-sizing every element within that VMware estate.  CA Virtual Placement Manager gives this ability, correlating the observed usage against the [changing] configuration of each asset to determine the right-size.  This seems to work effectively across cluster, host and guest level - and also incorporates several 'what if' parameters such as hypervisor version, hardware platform (from it's impressive model library) and reserve settings.  It's pretty quick at determining what is the right size of capacity to allocate to each VM - and how many hosts should fit in a cluster, even factoring in forecast data.  Using this approach, a whole series of actionable plans were generated very quickly for a number of clusters - showing capacity savings of 500% and more.

Thursday, 12 December 2013

IT's Day of Reckoning Draws Near

Bob Colwell, Intel's former chief architect, recently delivered a keynote speech proclaiming that Moore’s law will be dead within a decade.  Of course, there has to come an end to every technological revolution - and we've certainly noted the stablization of processor clock speeds over recent years, in conjunction with an increasing density of cores per chip.

Moore's Law has been so dominant over the years, it has influenced every major hardware investment and every strategic data center decision.  Over the last 40 years, we have seen a consistent increase in processing capacity - reflected in both the increase in processor speeds and the increased density of transistors per chip.  In recent years, whilst processor clock speed has reached a plateau - the density of cores per chip has increased capacity (though not performance) markedly.

The ramifications of Moore's Law were felt acutely by IT operations, in two ways.

  1. It was often better for CIOs to defer a sizable procurement by six or twelve months, to get more processing power for your money.  
  2. Conversely, the argument had a second edge - that it was not worthwhile carrying out any Capacity Management, because the price of hardware was cheap - and getting cheaper all the time.

So, let us speculate what happens to IT operations when Moore's Law no longer holds:

  1. IT Hardware does not get cheaper over time.  Indeed, we can speculate that costs may increase due to costs of energy, logistics etc.  Advancements will continue to be made to capability and performance, though not at the same marked rate charted above.
  2. The rate of hardware refresh slows due to the energy and space savings available in the next generation kit.  Hardware will stay in support longer, and the costs of support will increase.
  3. Converged architectures will gain more traction as the flexibility and increased intra-unit communication rates drive performance and efficiency.
  4. You can't buy your way out of poor Capacity Management in the future.  Therefore the function of sizing, managing and forecasting capacity becomes more strategic.


Since capacity management equates very closely to cost management, we can also speculate that these two functions will continue to evolve closely.  This ties in neatly, though perhaps coincidentally, with the maturing of the cloud model into a truly dichotomous entity - being that a supplier and a provider will have two differing views of the same infrastructure.  As the cloud models mature in this way, it becomes easier to compare the market for alternative providers on the basis of cost and quality.

Those organisations with a well-established Capacity Management function are well placed to navigate effectively as these twin forces play out over the next few years, provided they:

  1. Understand that their primary function is to manage the risk margin in business services, ensuring sufficient headroom is aligned to current and future demands
  2. Provide true insight into the marketplace in terms of the alternative cost / quality options (whether hardware or cloudsourced)
  3. Develop effective interfaces within the enterprise to allow them to proactively address the impacts of forthcoming IT projects and business initiatives.

So - the day of reckoning draws near - and IT operations will adapt, as it always does.  Life will go on - but perhaps with a little bit more careful capacity planning....

Tuesday, 3 December 2013

The dichotomy of Capacity Management in a private cloud

The Pushmi-Pullyou - an analogy for the dichotomy of private cloud

The Fable

The two great heads of IT sat and stared at each other across a meeting room table.  It was late in the day, and thankfully their services had all been restored.  Now was the time for recriminations.  The CIO had been called into firefighting meetings with the board all day.  They knew he was going to be royally pissed off, but who was going to get the blame?

The beginning

The story began when service performance nose-dived.  It was always a busy period, the lead-up to Christmas, but this season had been marked by some hugely successful promotional campaigns, and their online services had been humming with traffic.  Nobody quite knew what caused it, but suddenly alarms started sounding.  Throughput slowed to a trickle - and response times rocketed through the roof.  Downtime.  At first the infrastructure team, plunged into a triage and diagnostics scenario, did what they always did.  Whilst some were busy pointing fingers, they formed a fire-fighting team, and quickly diagnosed the issue - they'd hit a capacity limit at a critical tier.  As quickly as they could, they provisioned some additional infrastructure and slowly brought the systems back online.

The background

But why had all this happened?  Some months ago, and at the advice of some highly-paid consultants, the CIO had restructured the business into a private cloud model.  The infrastructure team provided a service to the applications team, using state-of-the-art automation systems.  Each and every employee was soon enamoured with this new way of working, using ultra-modern interfaces to request and provision new capacity whenever they needed it.  Crucially, the capacity management function was disbanded - it just seemed irrelevant when you could provision capacity in just a few moments.

The inquisition

But as the heads of IT discussed the situation it seemed there were some crucial gaps they had overlooked. The VP of Applications confessed that there there was very little work being done in profiling service demand, and in collaborating with the application owners to forecast future demands.  He lacked the basic information to be able to determine service headroom - and crucially was unable to act proactively to provision the right amount of capacity.  In an honest exchange, the VP infrastructure also admitted to failings in managing commitment levels of the virtual estate, and in sizing the physical infrastructure needed to keep on top of demand.  In disbanding the Capacity Management function, they realized that they had fumbled - and in fact needed those skills in both of their teams.

The Conclusion

The ability to act pro-actively on infrastructure requirements distinguishes successful IT organisations from the crowd.  What these heads of IT had realised is that the private cloud model enhances the need for Capacity Management, instead of diminishing it.  The dichotomy of Capacity Management in the private cloud model is that these two functions belong to both sides of the table - to the provider, and to the consumer.  Working independently, they would be able to improve demand forecasts and diminish the risk of performance issues.  Working collaboratively, these twin dichotomies combine in a partnership that allows a most effective way of addressing and sizing capacity requirements to align and optimize cost and service headroom.


Take-aways


  1. As a consumer, ensure you are continually well-informed on service demand and capacity profiles.  Use these profiles to work with your application owners in forecasting different 'what if' scenarios.  Use the results to identify which are the most important metrics, and prepare a plan of action when certain thresholds are reached.
  2. As a provider, ensure you are continually tracking your infrastructure commitment levels and capacity levels.  Use the best sizing tools you can find to identify the right-size infrastructure to provision for future scalability.
  3. Have your capacity management teams work collaboratively to form an effective partnership that ensures cost-efficient infrastructure delivery and most effective headroom management.

Will you wait for your own downtime before acting?

Tuesday, 24 September 2013

Dear CFO - 4 factors to evaluate before you invest in new IT

written in response to this great article on WSJ Blogs: http://t.co/lbyNeeMlxm

Dear CFO,

  before you decide on your strategy for investing in IT assets, whether on-premise or public cloud, there are a couple of important correlations I think it's important for you to consider.  

  Clearly, your decision ultimately will be based on:
  • The fit with your longer-term business plans
  • The measurable benefit to your business
  • The investment needed
  Investments in IT capacity enable your business to transact a certain amount of business.  In similarity with other areas of your business, investments in capacity should be made to alleviate bottlenecks and increase the ability to transact business.  However, in IT there are certain complexities to factor in:
  • comparing and contrasting capacity options has descended into a "dark art", with many stakeholders and an over-riding aversion to risk
  • measuring capacity usage has become a specialized platform function, leading to difficulties in getting an end-to-end perspective of how much business can be transacted
  • an increasingly agile enterprise is causing rapid fluctuations in capacity requirements, again with an aversion to risk

  Long ago, a management function was created to address these problems for the mainframe platform - Capacity Management.  That function can be leveraged again now, to allow you to plan ahead effectively for your long-term IT future.  Evaluating that function within your IT department, you should ensure it:
  1. endeavours to provide a complete picture of capacity usage across all silos
  2. provides visibility of service headroom, potential bottlenecks and abundances
  3. couples together with your financial management controls, providing governance over capacity allocation
  4. gives insight into future business scenarios, allowing investments to be rebalanced against the needs to transact business
In summary - if capacity management can be harnessed to better manage the costs of IT capacity, greater focus can be made on transformational activities that add value in other ways.