Lessons learned while deploying DEM

by | Nov 14, 2018 | Monitoring Solutions

Do you really know what issues your customers are experiencing when using your digital services? How long does it take for a transaction to go through? How many transactions fail on the first attempt? How often are sessions interrupted, or never make it to your service API in the first place? And more importantly, how many potential customers fail to return and how much business are you losing in the process?

When one deals with a service that is delivered over digital channels and your only feedback is the analytics, stats and logs provided by the application, then your only reference as to whether it’s successful or not is confined to the data for those transactions that actually made it all the way to service completion. And because analytics often begin at the point where transactions execute against the stack (ignoring the carriers, gateways and middleware in between), we’d never know that there is often a morass of timed out logins, failed transactions, lost states and interrupted sessions out there in the real world. Unless we actively monitor that part of the service landscape, that is. And that’s where Apogee comes in.

When our customers are first presented with the real-time Digital Experience Monitoring (DEM) data that Apogee provides, they are almost always surprised, then a little shocked, and then driven to bring more visibility to their digital channels by expanding their DEM initiatives. That is, of course, after the immediate initiatives are put in play to isolate and correct the newly uncovered service issues. After all, the whole point of DEM is to improve service quality.

Let us illustrate using the following case study for an application issue that Apogee discovered.

Apogee was deployed as part of a migration and expansion project for a client – a mobile network operator. We implemented the first synthetic transaction test, and within hours noticed the success rate was hovering between 80 and 90%. A second synthetic test was implemented on another channel and the same success rates were observed there, which doubly confirmed the findings. The operator was heretofore unaware of the relatively high transaction failure rates, and was at first incredulous of the findings. Honestly speaking, this scenario is familiar territory for us by now, so we patiently and politely stuck to our guns…

One of Apogee’s newly implemented features is that the raw USSD output, as received by the network, is visualised and stored for each synthetic test, and we observed via the dashboard that the returned outputs for the failed transactions were empty return strings, and even more frustratingly, these events appeared to be completely random.

While there were no user complaints at this point, with Apogee still indicating between 10 and 20% failure rate, the investigation was shifted towards the backend application, where further traces were run, and network captures were taken and sent to the application vendor for analysis.

After reviewing the traffic captures, the software vendor immediately recognised the problem and that there were some small changes to be done to a gateway component. The software vendor formally acknowledged the issue, and provided a patch. After a few days of testing in the staging environment, the customer was satisfied that the fix was stable and implemented the patch in production. The measured success rate improved immediately, as expected.

So, what are the common lessons to be learnt here?

  1. Your service doesn’t end at the point of handoff.

    Even though your core service at the backend and the interface to the delivery layer might be working perfectly, there is still a way to go before your end user interacts with it. This takes the form of carrier networks – fixed or mobile – that bridge the gap between your service and the customer endpoint. There is also the actual endpoint or device to contend with, but that falls in the realm of UX.

  2. Don’t rely on your application’s analytics alone.

    Remember that when you look at logs and events from the application stack, you are only looking at transactions that made it to your platform in the first case. This is not good enough. It needs to be correlated with the outside-in perspective which DEM provides, in order to address the often staggering number of requests that never make it out the gate.

  3. There’s more to a successful transaction than meets the eye.

    A surprising finding when one begins to measure the performance of transactions from the end-user perspective is often that performance problems are masked by user “think time” or wait time, to borrow some nomenclature from the performance testing arena. Because Apogee’s DEM synthetics execute transactions as fast as the system will allow, it often picks up small variations in service performance which would ordinarily be masked by the time that users take to make selections, input menu options, type passwords, etc. These small variations often indicate trends which are predictive of service problems down the road. Early warning is great.

  4. Widespread, severe service impact is needed before customer care is alerted to service problems.

    You’d be surprised how high the tolerable failure rate of services are before users complain. In fact, Facebook ran a study a few years ago where they intentionally crashed the Facebook app on Android with increasing frequency, in order to determine the threshold at which users would give up and not come back. They kept upping the rate at which the app crashed, but eventually gave up because they could not manage to reach that threshold. Users put up with a lot – but that doesn’t mean they’re not irritated by service issues. Even if your service desk is not reporting any complaints, it doesn’t mean that things are balmy out in the real world. You have to monitor services from the end-user perspective yourself in order to get a true reflection of your service quality.