FTTH ISP · 140k subscribers · Middle East

Detecting fibre cuts before the first customer calls

AI auto-crawling and root-cause analysis gave the NOC a 30-minute head start on outages — turning reactive firefighting into proactive assurance.

AI RCAAuto-FixAssuranceTopology

Detecting fibre cuts before the first customer calls

−47%

MTTR

−38%

Inbound complaints

−61%

SLA breaches

A metropolitan broadband provider with 140,000 FTTH subscribers across one of the Gulf's densest urban regions had a problem that was getting worse, not better, as the network grew. They learned about outages from angry customers. By the time an engineer started looking, the customer impact had already been broadcast on social media.

Background

The operator runs a mixed Huawei and ZTE OLT estate, with a strong reliance on aerial fibre in older districts and underground fibre in newer ones. Weather events — particularly the rare but intense sandstorms — produced clusters of fault events that the existing NMS surfaced as thousands of independent alarms. The NOC was overwhelmed not by faults but by the noise.

The pre-existing assurance flow was: customer calls, ticket opens, NOC engineer investigates across five tools, root cause identified after 12–25 minutes, action taken, ticket closed. Mean time to repair sat at just over 38 minutes. SLA breaches on enterprise circuits ran 8–12 per month. The board wanted improvements that did not require doubling the headcount.

What we set out to do

Reduce visible alarm volume to actionable signal — without losing real faults.
Backtrace any new fault to its real cause within seconds.
Detect outages before subscribers call — i.e., proactively, from telemetry alone.
Auto-remediate the categories of fault that are safe to auto-fix (CPE high temperature, RADIUS auth flap, common firmware-induced flaps).

Approach

Six-month rollout, gated by health checks at each phase

1Phase 1 — Telemetry consolidationWeeks 1–4

Onboard all OLTs to NetXol NMM (SNMP, ssh, vendor APIs).
Stream all alarms into NetXol's event store with topology context attached at ingest.
Establish baseline volumes: alarms / week, complaints / week, MTTR distribution.

2Phase 2 — Alarm suppressionWeeks 4–8

Enable topology-aware grouping — sibling ONUs flapping together collapse to a single OLT-port event.
Time-window correlation suppresses aftershock alarms within 90 seconds of a parent event.
Visible alarm volume drops 86% in first measurement window; no real faults missed in the same period.

3Phase 3 — AI RCA engineWeeks 8–14

Bring RCA online in shadow mode for 3 weeks. NOC sees AI conclusions side-by-side with their own.
Calibration sample: 200 outputs at 90%+ confidence — 184 of 200 correct (calibration acceptable).
Promote RCA to active mode: AI conclusion attaches to every ticket as it opens.

4Phase 4 — Proactive detectionWeeks 14–20

Pattern library for early-warning signatures (Rx-power drift, repeating ONU hardware re-init).
AI Auto-Crawler runs continuously over the live network — opens its own tickets ahead of customer calls.
First measured month: 42% of outage tickets opened before any customer call.

5Phase 5 — Bounded auto-remediationWeeks 20–24

Auto-Fix policy library: CPE high temp → QoS limit + notify; ONU stuck in stage 4 → controlled reboot.
Blast radius limits and verification gates on every action.
57% of fault tickets resolved without human action in the first month of Auto-Fix.

The unexpected win — alarm-storm composure

The most useful effect of the deployment was not anticipated up-front. Sandstorms had historically produced "alarm storms" — 4,000+ events in 30 minutes, swamping the NOC entirely. With topology-aware suppression and proactive AI handling, the same weather event in month 5 produced 71 events for the NOC to action; the rest were either suppressed (related to known parents) or auto-fixed (reboots, profile re-pushes). NOC reported a "manageable bad day" instead of a "lost shift."

A new SLA category was possible

After Phase 4, the operator launched a "Proactive Care" tier for SME customers — backed by a contractual commitment to a 30-minute proactive notification window. They could only sell that tier because the platform now consistently delivers on it.

Outcomes

Metric	Before	After	Δ
Visible alarms / day	2,400 (avg)	230	−90%
Mean time to repair	38 min	20 min	−47%
Inbound fault calls / month	11,400	7,050	−38%
SLA breaches / month	10 (avg)	4	−61%
Proactive detection of outages	0%	42%	new capability

Tech stack used

NetXol NMM — Huawei + ZTE adapters, SNMP, syslog, vendor APIs.
NetXol AI RCA Engine — Bayesian + topology graph + history backtrace.
NetXol AI Auto-Crawler — continuous anomaly detection.
NetXol Auto-Fix policy engine with blast-radius limits.
Event store with topology-aware suppression and grouping.

Lessons learned

Run RCA in shadow mode for at least 3 weeks before promoting to active. The NOC needs to see it think.
Calibration is non-negotiable — measure precision at every confidence band and only act on bands above your threshold.
Blast-radius limits are the difference between a confident system and a dangerous one. Hard-cap actions per unit time.
Communicate aggressively to customer care — they need to know the platform is now opening tickets ahead of them.

All case studies

Put your ISP on autopilot

See NetXol on your own network in a live demo — or send us your RFP and let our team scope the whole project for you.

Book a demo Upload your RFP