Detecting fibre cuts before the first customer calls
AI auto-crawling and root-cause analysis gave the NOC a 30-minute head start on outages — turning reactive firefighting into proactive assurance.
−47%
MTTR
−38%
Inbound complaints
−61%
SLA breaches
A metropolitan broadband provider with 140,000 FTTH subscribers across one of the Gulf's densest urban regions had a problem that was getting worse, not better, as the network grew. They learned about outages from angry customers. By the time an engineer started looking, the customer impact had already been broadcast on social media.
Background
The operator runs a mixed Huawei and ZTE OLT estate, with a strong reliance on aerial fibre in older districts and underground fibre in newer ones. Weather events — particularly the rare but intense sandstorms — produced clusters of fault events that the existing NMS surfaced as thousands of independent alarms. The NOC was overwhelmed not by faults but by the noise.
The pre-existing assurance flow was: customer calls, ticket opens, NOC engineer investigates across five tools, root cause identified after 12–25 minutes, action taken, ticket closed. Mean time to repair sat at just over 38 minutes. SLA breaches on enterprise circuits ran 8–12 per month. The board wanted improvements that did not require doubling the headcount.
What we set out to do
- Reduce visible alarm volume to actionable signal — without losing real faults.
- Backtrace any new fault to its real cause within seconds.
- Detect outages before subscribers call — i.e., proactively, from telemetry alone.
- Auto-remediate the categories of fault that are safe to auto-fix (CPE high temperature, RADIUS auth flap, common firmware-induced flaps).
Approach
Six-month rollout, gated by health checks at each phase
- Onboard all OLTs to NetXol NMM (SNMP, ssh, vendor APIs).
- Stream all alarms into NetXol's event store with topology context attached at ingest.
- Establish baseline volumes: alarms / week, complaints / week, MTTR distribution.
- Enable topology-aware grouping — sibling ONUs flapping together collapse to a single OLT-port event.
- Time-window correlation suppresses aftershock alarms within 90 seconds of a parent event.
- Visible alarm volume drops 86% in first measurement window; no real faults missed in the same period.
- Bring RCA online in shadow mode for 3 weeks. NOC sees AI conclusions side-by-side with their own.
- Calibration sample: 200 outputs at 90%+ confidence — 184 of 200 correct (calibration acceptable).
- Promote RCA to active mode: AI conclusion attaches to every ticket as it opens.
- Pattern library for early-warning signatures (Rx-power drift, repeating ONU hardware re-init).
- AI Auto-Crawler runs continuously over the live network — opens its own tickets ahead of customer calls.
- First measured month: 42% of outage tickets opened before any customer call.
- Auto-Fix policy library: CPE high temp → QoS limit + notify; ONU stuck in stage 4 → controlled reboot.
- Blast radius limits and verification gates on every action.
- 57% of fault tickets resolved without human action in the first month of Auto-Fix.
The unexpected win — alarm-storm composure
The most useful effect of the deployment was not anticipated up-front. Sandstorms had historically produced "alarm storms" — 4,000+ events in 30 minutes, swamping the NOC entirely. With topology-aware suppression and proactive AI handling, the same weather event in month 5 produced 71 events for the NOC to action; the rest were either suppressed (related to known parents) or auto-fixed (reboots, profile re-pushes). NOC reported a "manageable bad day" instead of a "lost shift."
A new SLA category was possible
After Phase 4, the operator launched a "Proactive Care" tier for SME customers — backed by a contractual commitment to a 30-minute proactive notification window. They could only sell that tier because the platform now consistently delivers on it.
Outcomes
| Metric | Before | After | Δ |
|---|---|---|---|
| Visible alarms / day | 2,400 (avg) | 230 | −90% |
| Mean time to repair | 38 min | 20 min | −47% |
| Inbound fault calls / month | 11,400 | 7,050 | −38% |
| SLA breaches / month | 10 (avg) | 4 | −61% |
| Proactive detection of outages | 0% | 42% | new capability |
Tech stack used
- NetXol NMM — Huawei + ZTE adapters, SNMP, syslog, vendor APIs.
- NetXol AI RCA Engine — Bayesian + topology graph + history backtrace.
- NetXol AI Auto-Crawler — continuous anomaly detection.
- NetXol Auto-Fix policy engine with blast-radius limits.
- Event store with topology-aware suppression and grouping.
Lessons learned
- Run RCA in shadow mode for at least 3 weeks before promoting to active. The NOC needs to see it think.
- Calibration is non-negotiable — measure precision at every confidence band and only act on bands above your threshold.
- Blast-radius limits are the difference between a confident system and a dangerous one. Hard-cap actions per unit time.
- Communicate aggressively to customer care — they need to know the platform is now opening tickets ahead of them.
Put your ISP on autopilot
See NetXol on your own network in a live demo — or send us your RFP and let our team scope the whole project for you.
