[lookups] Service issue: get_country and hashed_imsi
Incident Report for Ziron
Postmortem

(Note: At the time of the initial publication we have had no further details from our operator partner as to the nature of the issues they experienced during this incident. We have now received further information, and have updated accordingly)

Incident Timeline

All times are in UTC/GMT unless otherwise stated.

At approximately 13:21, Ziron monitoring detected an increasing error rate in response to lookup requests made to an operator that provides such services to Ziron. The on call engineer was paged.

A loss of connectivity to the operator’s IP network was also detected, and Ziron’s network team was engaged to investigate further - as well as an urgent ticket being raised with the operator’s service management centre. Initial investigation showed that VPN connectivity was the cause of the issue, and no fault could be found on the Ziron side.

At 13:41, a major incident was declared and the senior management team were paged. Status notifications were posted to all customers via the Ziron status page.

By 13:49, the operator had acknowledged they were experiencing a major network outage. An update was received at 14:25 advising that the problem was affecting all clients connecting to their network via VPN. A further update at 14:53 suggested that they had narrowed focus to a firewall.

Further updates followed from the operator at 15:38 and 16:06 advising that engineers were still investigating.

Ziron monitoring showed service was restored at 16:07. At 16:25 we advised customers that service had been restored and we would continue to monitor. Notification from the operator of service restoration followed at 16:44, and we closed this incident at 16:47.

In a root cause analysis provided on 20th November 2018, our operator partner advised that the outage was initially caused by the software crash of a primary VPN firewall device. Whilst the failover to the secondary VPN firewall device at a second site went to plan, a missing VLAN configuration on a layer-2 switch at the second site meant that an extended outage was caused. The operator has advised that they are planning to complete the introduction of a second set of VPN firewall devices by the end of this month - a project that was already underway before this outage.

Posted Nov 15, 2018 - 16:57 UTC

Resolved
This incident has now been resolved - a postmortem will follow as soon as we have received similar from our operator partner.
Posted Nov 13, 2018 - 16:47 UTC
Update
We are continuing to monitor for any further issues.
Posted Nov 13, 2018 - 16:26 UTC
Monitoring
Our monitoring shows the issue has cleared at 16:07 UTC, and services have been stable since then. However, we have not yet received an all clear from the operator partner. We will continue to monitor and update when we have a further update from the operator partner.
Posted Nov 13, 2018 - 16:25 UTC
Update
Operator partner engineers continue to work on this issue. Further update by 1615 UTC.
Posted Nov 13, 2018 - 15:45 UTC
Update
Operator partner has advised that engineers are on site and have advised that there is a firewall issue. Next update due from operator partner by 15:30 UTC.
Posted Nov 13, 2018 - 14:56 UTC
Update
Our operator partner is still working to resolve this issue, and engineers are currently focussing on an issue within their network around VPN termination. They have advised that they will be providing another update within the next 30 minutes.
Posted Nov 13, 2018 - 14:39 UTC
Identified
Our operator partner have advised that they are experiencing an outage and are working to resolve it. More details to follow.
Posted Nov 13, 2018 - 13:53 UTC
Investigating
We are investigating an issue with operator partner is currently affecting the following lookup services:

get_country
hashed_imsi

An update will follow shortly.
Posted Nov 13, 2018 - 13:41 UTC
This incident affected: Number Lookup.