80% Faster root cause analysis. 65% Fewer tickets. 50% Faster mean time to recover. 40% Increase in first time right. For most video or TV service providers, these results sound like a dream. In this blog, you will read about how you can transform them to reality by taking three, data-driven steps.
Step 1
Identify outages through 360 degree monitoring
A TV Platform is a complex ecosystem built out of many building blocks that are highly dependent on each other. For example, to be able to play a recorded program, several platform components need to work in harmony to be able to serve the customer the content it requested with the highest possible Quality of Experience (QoE). To measure the QoE, you need to monitor the TV platform on multiple service Key Performance Indicators (KPI’s). Each KPI fits within one of these categories: Availability, Capacity and Reliability. There are different tools that you can use to monitor service availability and capacity. To monitor service reliability, you will need to ingest as much data (e.g. log files) as possible from all platform components. Then you will be able to monitor service flows, for instance, for playout of Live TV or Cloud PVR recordings.
Step 2
Solve outages and do it fast through data analysis and insights
You are ready to discover correlations using data in all local and central systems. These correlations will allow you to support the problem management process by being able to detect anomalies, identify problems and execute fast root cause analysis. Based on the data that you collect and the insights you gain, you will gather valuable information to enable your technology suppliers so that they can identify the problem in their code and work on a software fix. In this way, you will have the data to prove which one of the ecosystem components is causing the failure and can reduce noise and unnecessary back and forth discussions.
Step 3
Optimize continuously and deliver first time right into the cloud
You need to have a process in place that allows you to manage steps 1 and 2 efficiently. Before deployment of a vendor’s software fix, you need to test its performance within the ecosystem to ensure first time right. What you don’t want to do is deploy a change that is going to have a negative impact or cause regression. Also, within this process, you will want to experiment with machine learning algorithms that could help you perform prediction of performance so that you know what to expect and plan accordingly. AI-enabled algorithms can learn from previous outages and look for patterns in the current data, which might suggest the beginning of an incident or correlate it through different sets of the ecosystem.