Posted on January 29, 2023 at 6:58 AM
Microsoft blames a five-hours-long Microsoft 365 outage on a router IP address change
Microsoft recently saw a five-hour-long outage of its Microsoft 365. The outage spanned around the world, and it caused quite a lot of concern, but now, the company has come out with an explanation. According to the firm, the outage was caused by a change of the router IP address, which resulted in packet forwarding issues between other routers that were connected in its WAN (Wide Area Network).
At the time of the outage, Redmond’s official announcement was that the incident happened due to configuration issues involving WAN and DNS, which came due to a WAN update. Meanwhile, users in all the regions serviced by the impacted infrastructure suffered the same problems when trying to access Microsoft 365 services.
The incident affected a large number of services
The original issue impacted Microsoft 365 services in waves, peaking every 30 minutes on average on the Microsoft Azure service status page, which was itself affected by the issue. In fact, the list of impacted services is quite lengthy, and it includes Microsoft Teams, Outlook, Exchange Online, SharePoint Online, PowerBi, OneDrive for Business, Microsoft Graph, Microsoft 365 Admin Center, Microsoft Defender for Cloud Apps, Microsoft Intune, as well as Microsoft Defender for Identity.
Redmond needed more than five hours of constant work, focused solely on this single issue in order to fix it. The outage started at around 7:05 AM UTC and it lasted until 12:43 PM UTC, which is when the Microsoft team finally managed to restore the service fully.
The company then published a preliminary post-incident report, explaining what happened on January 25th. It pointed out that most regions and services managed to recover by 9:00 AM UTC, but the intermittent packet loss issues were not fully mitigated until 12:43 UTC. The incident even impacted Azure Government cloud services, which depended on Azure public cloud, according to the announcement.
Since then, Microsoft reported that it discovered more information about the incident, including the fact that it was triggered by a change in the IP address of a WAN router. This happened due to a command that was not vetted thoroughly enough. According to the company, the command behaves differently on different network devices, which is what caused the issue.
The router received the command as part of a planned change to update the IP address on the Router, but the router then sent messages to all other routers in the WAN. As a result, some of them started recomputing their adjacency and forwarding tables. The process of recomputation caused the routers to become unable to forward packets correctly, which led to the disruption in all other services.
The network health system was paused when the incident happened
The network actually started a recovery on its own around 8:10 UTC, thanks to the automated system that was put in place to maintain the health of the WAN. However, the system happened to be paused at the time when the network was impacted, which may be why the impact was possible in the first place.
The system has built-in features for identifying and eliminating any and all unhealthy devices. It also has traffic engineering systems used for optimizing data flow across the network. However, the pause took place for mentioned reasons, and as a result of it, certain network paths ended up experiencing greater packet loss until the systems were manually restarted in full.
In the end, WAN returned to its optimal operating conditions at 12:43 PM UTC, which is when the recovery process ended. With the incident behind it, Microsoft said that it learned its lesson and that it would start blocking highly impactful commands, preventing them from being executed. In addition, it will also require command execution to follow certain guidelines for safe configuration changes.
Despite the outage affecting so many services and lasting for such a long time, users around the world were still relieved to hear that it was all just a result of a system error rather than a security incident.


 
 


 
  
  
  
 