I recently came across an interesting white paper from the Netrounds team titled, “Reimagining Service Assurance in the Digital Service Provider Era.” You can find a copy here. It’s well worth a read, so much so that I’ll unpack a few of the concepts it contains in a series of articles.
It rightly points out that, “Alarms and fault management are what most people think of when hearing the term service assurance. Classical service assurance systems do fall into this category, as they collect indicators from network devices (such as traps, syslog messages and telemetry data) and try to pinpoint faulty devices and interfaces that need fixing.”
This takes us into the rabbit-hole of what exactly is a service (a rabbit-hole that this article partly covers). But let’s put that aside for a moment and consider a service as being an end-to-end “thing” that a customer uses (and pays for, and therefore assumes will behave as “they” expect).
To borrow again from Netrounds, “… we must be able to measure and report on service KPIs in order to accurately measure network service quality from the end user, or customer, perspective. The KPIs should correspond to the service that the customer is paying for. For example, internet access services should measure network KPIs like loss, latency, jitter, and DNS and HTTP response times; a storage backup service should measure data throughput rate; IPTV should measure video frame loss, video buffer underrun events and channel zapping time; and VoIP should measure Mean Opinion Score (MOS).”
There’s just one problem with traditional assurance measuring techniques (eg traps, syslog messages). They are only an indirect proxy for the customer’s experience (and expectations) with the service they’re paying for. Traditional techniques just report on the links in the chain rather than the integrity of the entire length of chain. We have to look at each broken link and attempt to determine whether the chain’s integrity is actually impaired (considering the “meshing” that protects modern service chains). And if there is impairment, to then determine whose chain is impacted, in what way, and what priority needs to be given to its repair.
If we’re being completely honest, the customer doesn’t care about the chain links, or even their MOS score, only that they couldn’t understand what the person at the other end of the VoIP line was trying to communicate with them.
Exacerbating this further, with increasing dependency on cloud and virtualised resources means that there are more chain links that fall outside our domain of visibility.
So, this thing that we’ve called service assurance for the last few decades might actually be a misnomer. We’ve definitely been monitoring the health of network devices and infrastructure (the links), but we tend to only be able to manage services (the chain) through reverse-engineering – by inference, brute force and wizardry.
Is there another way? I dig further in other posts here.