What is monitoring?
Monitoring, in its simplest form, has been around since the beginning of time. It’s not a new phenomenon developed by engineers to ensure that your application is up and still running. The Oxford dictionary has multiple definitions of the word “monitor”, the most pertinent one being “Observe and check the progress or quality of (something) over a period of time; keep under systematic review.” No matter whether it’s application monitoring, operating system monitoring or hardware monitoring the principle is the same. Make sure it’s effective. In this article I’m focusing on system monitoring, which we call Photon.
Is your monitoring effective?
A monitoring system is only as good as its implementation. There is no product that can be deployed and left unattended, which happens far too often. It’s easy to be tempted to buy the fanciest tool that’s the flavour of the week, but it’s not going to deploy and manage itself. Environments change constantly which means your monitoring solution needs to adapt accordingly. Monitoring is generally an afterthought, something that’s required to tick a box as a business requirement, but the efficiency of the system isn’t questioned until something goes wrong.
Is it possible to alert on too much?
With the sheer amount of data being generated it is tempting to try and alert on every single metric you collect. Just because it’s possible doesn’t mean it’s going to be effective. Most of the data collected will help with troubleshooting a specific issue, however alerting on this data will create so much noise that it becomes almost impossible to distinguish between something critical and something that should be classed as informational. If you receive 100 alerts notifying you that your number of transactions have increased by one per second and you miss the alert notifying you that your primary database has run out of disk space you know you’ve gone too far.
Are you using the right tool for the right job?
Over the years monitoring has been split up into multiple categories and sometimes the lines become blurred. For instance; using your monitoring tool as your Business Intelligence (BI) tool is not a good idea. Scraping logs, pulling statistics from an API, querying a database or collecting in-flight data doesn’t mean it should be used for reporting on business data. There is a reason your application stores this information, using a monitoring tool to do your financial calculations doesn’t work. Alerting on trends and anomalies is one thing, but don’t confuse monitoring data with a BI tool.
How long should you store this data for?
If you are collecting CPU metrics at a 10 second interval the chances are you won’t be needing such fine grain metrics 12 months down the line. Although the data points are generally small it still takes up space. At Breakpoint we use time series databases to store metrics.
Although these databases can ingest large amounts of data using minimal resources, querying it can be a challenge. For example, you want to overlay the CPU utilisation of all servers for the past year. The chances of needing this data at a sample rate of 10 seconds is unlikely. Running this query will require a huge amount of resources which is unnecessary. Downsampling this data into hourly averages should be enough for a meaningful answer and it will require far less resources.
Monitoring is not a very glamorous topic. In most cases it’s simply an insurance policy to notify you when something’s not working as expected. It won’t help you land a deal (unless of course you develop monitoring products), however if your system is unstable you’re likely to lose customers. It’s unlikely that a customer will contact you if your system is unstable, they’ll simply move their business elsewhere. It takes time, commitment and effort to maintain a monitoring platform. It’s particularly essential that new services are monitored when they are deployed.
Most of the work can be automated, but a monitoring solution is not something that can be installed and forgotten about. There is nothing worse than a client notifying you of a problem that could’ve been avoided if your monitoring system was functioning as expected.