How good monitoring can help the business

2013-09-20 2029 words, 10 minutes
capacity management
capacity planning
graph
health status
metrology
monitor
monitoring
mrtg
nagios
poll
qos
quality of service
rrdtool
status

There are many times when I see IT managers not monitoring their user’s services at all or setup a general application and consider the job done. Most of the time, they realise much is missing when they have to report about (recurrent) issues regarding business critical services. Monitoring is not to be considered as a cure solution but as a forecast tool. When planned and configured as such, it’ll help prevent predicable failures and drive capacity planning.

What is monitoring?

Quoting Oxford’s online dictionary , a monitor is “a device used for observing, checking, or keeping a continuous record of something”. I like this definition very much because it highlights the reason why monitoring system are build and used as they are nowadays. For historical and cultural reasons, they are not used as a pro-active systems.

Two main monitoring directions can be identified: polling for UP/DOWN services status and collecting statistics. Polling is what tools like Nagios or HPOV do: regularly check if a hardware or a service is UP. If not, it sends an alert to the referred person. Grabbing statistics is what tools like MRTG and RRDtool do. For configured metrics, it connects to equipments or applications and collect a set of values and states.

Why is monitoring important?

Being aware of the global and detailled health of the infrastructure services is one of the key to success when dealing with user satisfaction and business support. Monitoring is often set up to address straight forward questions such as “Is my application up and running?”, “Is the network down?”, “Does my server have hardware issue?”, “Do I provide poor quality of service or is my application slow?”. Those questions can be answered immediately for current and past times. But the answers are not predictable in the future. Unless you have configured your monitoring system with proactivity in mind.

There are times when sh!t just happens. This can’t be avoided and has to be solved fast when it happens. There is no magical wand for such events. All you can do is make sure you get the information fast and can correct the issue according to the Recovery Time Objectives (RTO). How you deal with recovering services is not the point here so I won’t detail it. But basic monitoring helps getting what’s wrong and identify the point of failure so that you can correct it quickly and/or communicate about it and recovery time.

But there are things that can be done to predict issues. The most current problem you’ll hear about is slowness. “Access to this service takes ages”, “That application is damn slow”, “Email arrive within tenth of minutes”, etc. Monitoring won’t solve those issues by itself. But it’ll help you deal with capacity planning, capacity management and quality of service. By regularly and proactively monitoring key features of hardware and software, you’ll be able to predict charge issues and prevent slowness ; or a least be aware of the timing where unacceptable slowness is bound to be reached if nothing’s changed.

How to monitor efficiently?

Minimum monitoring consists on polling various hardware and software. When ever implemented, basic monitoring covers mostly every required metrics.

CPU: usage percentage is the least to be monitored. It allows you to know if the required power to process tasks is enough or not. Health value depends on the task you’re looking at. Most of the time, a load average under 30% means you have far enough process power. Above this limit, it’s not really the number itself that matters but rather the way the load varies in time.
RAM: memory usage is a quite disturbing metric to monitor. In fact, you’d expect free memory to be the key for system health. But what is unused memory useful for. Most of the time, the system or application will use as much RAM as possible and you’ll end with systems that have no (or little) free memory available. This can be a problem only if the memory management of the system is wrong or if the application does not have enough RAM to run properly. Configuring the application to limit RAM usage helps preserving memory for the rest of the system. Then, it’s the internal application memory usage that will have to be monitored. In the end, the most important thing to monitor about memory are the swapping states. Having 80% of RAM usage is not a problem if, and only if, every data are kept in the memory rather than being swapped from memory to disk and vice-versa.
DISK: it is no news that data storage need expands nearly every day. Monitoring percentage usage let’s you know if you have enough space to store the data. The upper limit where action should be taken is about 70% of disk capacity. There may not be big issues until 95% of disk usage though ; and trouble may happen far before 50% of disk usage. There are very few times when basic monitoring is enough to deal with storage health. I’ll detailed this later on.
NETWORK: bandwidth monitoring is probably this oldest monitoring metrics that exists. It is generally well handled and monitored. Nowadays, with Virtualization systems, bandwidth monitoring has to deal with aggregation. This is a notion that already existed in the standard network world long ago but also has to be considered this days by system administrators. With the Cloud Computing, network monitoring adds another requirement: response time. Once again, networks administrators dealt with response time for ages. But services in the Cloud extends requirements for network access. And response time is what helps you understand why an application seem slow.
Others: hardware problems (like temperature, fan speed, disk errors…) and open network ports are often considered the Graal of monitoring. Indeed, those issues are very important. But they are still basic monitoring keys in the sense that they only answer to the “what’s up now” question.

Extending basic monitoring isn’t that complex: keep an eye on response time or latency. Because I.T. systems are more and more mutualized and/or uses Cloud Computing, the global charge of a service is not enough to deal with quality of service.

CPU: virtualization adds abstraction to power monitoring. CPU usage inside a virtual machine can be quite unrelated to CPU usage on the hypervisor. An hypervisor that hosts tenth of virtual machines can be quite charged without it being a problem ; since the hosted VM get their required power. In that particular case, you’ll want to extend monitoring to keep an eye on the time it takes for a virtual CPU to gain access to the physical processing unit. This has various names depending on the technology (CPUready in VMware for example) but the challenge is always the same: how long does it take for a virtual CPU to get its job done by the physical CPU?
RAM: memory is also a metric that can be affected by virtualization. From the hypervisor point of view, global memory usage is monitored as explained previously. There are specific memory management tasks that should be monitored on virtualized systems like memory compression or memory sharing. Those will give you a better idea of the capacity of the infrastructure. An hypervisor, as a “standard” server, can swap to disk if not enough RAM is available ; hence lowering the global quality of service.
DISK: in the “old times”, data were stored locally on a single or a “small” group of disks. In those configuration, filling disks was a issue regarding the available space but also to speed ; because of physical distribution to the disk that was not equal all over the storage. Nowadays, most of the storage is provided by dedicated equipment such as SAN or NAS. That particular way of storing data, shared towards front-end services, lead to a specific issue: access time. Two metrics are useful and extends monitoring: IOPS and response time. The first will define how much disk operations are available per seconds ; this gives you an idea of how many parallel applications can access the data, how fast can an application get it’s data and how is the storage system occupied according to what it’s capable of. Access time provides information on the time the application will wait before getting it’s data. Number highly depends on what application accepts or not ; but numbers above 50ms are probably not good regarding user experience.
NETWORK: most of the time, network monitoring consists in getting information on how much data comes in an out. Regardless of the monitored technology, grabbing information on network latency can greatly improve comprehension on the global quality of service of the infrastructure. On physical hardware, whether there’re wired or wireless, monitoring errors like “retry” and “collision” will help you identify a hardware failure or software misconfiguration.

Both previous cases addressed monitoring from the I.T. point of view ; hear the technical point of view. It’ll allow you to prove that services are up and hardware properly configured and sized. But it will lack the user experience point of view. Monitoring application quality of service requires the following:

Applicative chain: a n-Tier application, possibly sharing backend services with others, must be monitored as a whole. Individual components’ access time is good. But what counts is the global time from user’s request to application answer. The application response time is the sum of every component’s access time it is made of. If possible, monitor the time delay between user’s request and application’s reply. Thresholds depend on what is acceptable to the user and what’s provided by the service agreements.
Returned value: an application can be up and running from the technical point of view (network port, system process, metrics value) and still have to be considered out of order from the user’s point of view ; most of the time, this is when application replies with error messages or status rather than with comprehensive data. Monitoring should be configured to expect “normal” replies and identify error messages ; both should determine if the application is in an acceptable state of not. It should not be enough to check if the network port is opened and the applicative process replies. One should implement a way to parse replies and recognize errors from valid data.
Parsing the logs: log file are mostly used afterwards to understand what happened. A better monitoring approach is to proactively identify error messages than may pop up in logs and send the proper alert should this happen. That requires help from the application editor or developer to list appropriate messages and error codes. Recurrent errors are often the sign of a configuration issue or an incoming (critical) problem.

Anticipate to not undergo

The most important thing with monitoring is probably to accept that numbers by themselves have limited usage. Numbers can be interpreted in a multiple way according to the context in which they appear. To make it simple, 99% of CPU usage is an issue ; if this leads to poor service performance. But it can be OK if, for example, it only happens from time to time because of a scheduled process (like backup or statistics production) and allows the result to be delivered in an acceptable amount of time. What’s important here is: will such numbers affect the quality of service ; and if so, when should this become critical to user experience?

Once you have collected the relevant metrics, you have to analyze them in order to identify forecasts on the various aspects of the infrastructure or monitored applications. One-shot audits provide states of the monitored environment at a particular moment. Statistical projections enables a further view on the global health of the monitored environment.

“When will CPU power or RAM lack to an unacceptable rate?”, “When will storage stop providing access to data in a time that provides acceptable performance?”, “When should I renew or buy extra hardware to ensure proper quality of service?”. Those are questions which can be addressed by a complete monitoring solution ; regarding that you use it as an input for supervision and capacity planning rather than a performance history browsing tool. This is called Metrology.