But is high resolution monitoring really necessary?
For one, there are some very real side-effects: We refer to it as the ‘stock exchange fever’, where level 1 staff, monitoring the platforms, tend to panic at the slightest spike or drop in the per-second intervals at which data is being sampled. Since the sampling interval is so short, every small spike and low is observed in its full glory.This typically leads to false positives and a general over-amplification of concern (even panic) at the slightest bump in a monitoring graph, regardless of the specific resource involved.
The case for high resolution monitoring
On the other hand, sometimes a high resolution sampling interval is worth its weight in gold. This was apparent during a recent benchmarking exercise, which we perform as part of our performance engineering services for customers, where an application deployed on a public cloud platform was benchmarked to understand the future load and growth performance characteristics of the codebase.
Since this application was under the control of an automatic scaling configuration, the initial expectation was that, given unlimited hardware resources from the public cloud provider, the application would just scale without any problem as load increased. This was not the case.
During the load ramp-up, we observed that at a low utilisation of the nodes, the platform struggled to serve requests. Only after a few minutes of auto provisioning of new nodes did the service recover, only to struggle again shortly thereafter. This was very odd.
Up to this point, the public cloud metric platform was used to visualise the metrics of the nodes. However, these log at a resolution of 1 minute and failed to highlight any problems. We decided to deploy Photon agents on the nodes, with a 10-second interval sampling rate to try and troubleshoot the problem by colouring in the picture.
Within a few minutes of starting a new load benchmark, we noticed that the nodes all had high processor spikes for ~ 1 second, at regular intervals. As load increased, this spike soon maxed out the processor capacity… albeit only for ~ 1 second.,
The time at peak processor utilisation became longer as the load increased. Soon, the node essentially paused for 2-3 seconds, every 10 seconds. At this stage, the average processor utilisation of the node (over a 60-second interval) was still quite low. However, the pause in processing caused a bottleneck of transactions and even some failures.
At an average utilisation of ~ 40%, the node essentially became a bad actor with the load balancer marking it as bad, due to the extreme jitter and processing pause being generated. Since the auto-scaling was set at 40%, the newly provisioned node arrived late to the unhealthy workload and was soon, justifiably, struggling to keep up.
The 60-second systems monitoring graph showed nodes with ~ 45% load being marked unhealthy and auto-scaling being engaged… only for the new nodes themselves to be marked unhealthy. Why?
Using Photon’s high resolution monitoring, we were able to identify the extreme jitter on the nodes and were able to deduce what the problem was. Without going into too much detail, a component on the host responsible for shipping logs caused a lock that paused processing and created an unhealthy jitter.
Low resolution monitoring is great for gathering statistics to produce information to project future load, cost estimates, and problems on the horizon, but it is very much a high-level management overview of your platform. Sometimes you need a high resolution picture, perhaps not in a typical IT infrastructure management and planning role for the future, but in a troubleshooting and fault-finding role.
The beauty of Photon is that it provides both, and can be configured to assist in both scenarios.