INSIGHTS

Processing log events at scale with Alluvia

by | Mar 31, 2019 | Monitoring Solutions

Lets face it: processing logs take up loads of cycles – your own, but also heaps of CPU cycles, especially when you need to trawl through complex patterns contained in logs, and most of the log entries don’t have proper context to bind it to a transaction, process, function or whatever parameter is of interest. Various standards and log layouts are available and actively used, and it takes time to understand what the application developer logged and what value it might provide to business. Hidden in high velocity logs, there might be a very specific entry occurring every 5 minutes that’s crucial for a given role or line function, but all the other log lines may need to be interpreted collectively to determine if the application performance is stable and functioning correctly.

Writing the rules that extract information from logs is where the fun starts. Once we have an understanding of what we want captured, we can then proceed with configuring the log processor. In this post, we’ll be looking at logstash – a hugely popular open source ETL engine from elastic.co.

logstash
Logstash has various input and output plugins. For now, we’re only going to focus on beats input. Sending the events from logs to logstash presents our first challenge. Very often we encounter managed vendor solutions at the telcos where we are engaged, and understandably, it’s often hard or impossible to persuade them to let a well meaning third party (us!) install any agents on their servers because, it is argued, this may compromise accountability and therefore support. We are therefore often challenged to find innovative ways of getting to the precious logs.

Let’s look at a few of these options.

Not much can be said about rsyslog that is not already known. If setup properly, all logs can easily be aggregated to various collection servers for pickup. It is fairly robust and Just Works™, but changing configuration system wide will impact system logging, and configurations will be overwritten when vendors re-image hosts with the latest software releases. This approach often also requires access to dev/test environments so that the configurations can be formally tested and approved through change control processes.

NFS shares

In many cases, and more so in Unix environments, good old NFS might be the simplest way of solving log collection, but it holds interesting problems – especially at scale. Firstly, the number of mount points required can grow quickly, as log files might not all be under the same parent folder which requires that the administrator exports multiple mount points. Coupled with the sheer number of servers, each with multiple mount points, creating a logical structure becomes tricky to implement and manage. Lastly, maintaining the mounts needs near constant attention, should a few of them fail to connect due to server reboots, network fluctuations, etc. This causes issues with the agent that will be tailing the files to send the events to logstash. NFS is almost certainly the best option for file access over network shares, but even so, it is a problematic option for the above mentioned reasons.

rsync

Rsync is one of those indispensible Unix utilities with a massive and growing fanbase. Rsync is awesome. One thing that it does by default is create new inodes for the synced files. Inodes are important for the log tailer software to pick up changes and log rollover. Using rsync for our purposes with the default settings, however, will most likely cause duplicate events and seriously distort metrics gathered from log files. Rsync does have a flag called `–in-place` that forces the diff sync to write to the same inode instead of creating a new inode for each rsync that was run. Using this method, we can have files updated every few minutes, depending on the log velocity and network throughput. Also, when scripting rsync, you should not allow two rsync tasks operating on the same path, and also ensure a lock file is created, else you’ll invariably run into race conditions. See the example below.

#!/bin/bash

LOCK=/opt/vantage/alluvia/tmp/${0}-running

if [ -f $LOCK ]; then

  echo "exiting as another $0 is running"

  exit 1

fi

touch ${LOCK}

echo “script logic goes here”

rm ${LOCK}