Skip to content

Commit

Permalink
pm about 241220 network outage
Browse files Browse the repository at this point in the history
  • Loading branch information
hitchhooker committed Dec 20, 2024
1 parent 7bcc32c commit fd3e828
Show file tree
Hide file tree
Showing 4 changed files with 7,861 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,5 @@
- [Carbon offset](./carbon.md)
- [Team](./team.md)
- [Resources](./resources.md)
- [Post Mortems](post_mortems.md)
- [241219 - Network Outage](./network_outage_pm_241219.md)
69 changes: 69 additions & 0 deletions docs/src/network_outage_pm_241219.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Network Outage Postmortem (2024-12-19/20)

## Summary
A planned intervention to standardize router-id configurations across our edge
routing infrastructure resulted in an unexpected connectivity loss affecting our
AMSIX Amsterdam, BKNIX, and HGC Hong Kong IPTx peering sessions. The incident
lasted approximately 95 minutes (23:55 UTC to 01:30 UTC) and impacted our validator
performance on both Kusama and Polkadot networks. Specifically, this resulted in
missed votes during Kusama Session 44,359 at Era 7,496 and Polkadot Session
10,010 at Era 1,662 with a 0.624 MVR (missed vote ratio).

## Technical Details
The root cause was traced to an attempt to resolve a pre-existing routing anomaly
where our edge routers were operating with multiple router-ids across different
uplink connections and iBGP sessions. The heterogeneous router-id configuration
had been causing nexthop resolution failures and inability to transit in our
BGP infrastructure.

The original misconfiguration stemmed from an incorrect assumption that router-ids
needed to be publicly unique at Internet exchange points. This is not the case - router-ids
only need to be unique within our Interior Gateway Protocol (IGP)
domain. This misunderstanding led to the implementation of multiple router-ids
in loopback interfaces, creating unnecessary complexity in our routing
infrastructure.

During the remediation attempt to standardize OSPF router-ids to a uniform
value across the infrastructure, we encountered an unexpected failure mode
that propagated through our second edge router, resulting in a total loss of
connectivity regardless of router&&uplink redundancy. The exact mechanism of
the secondary failure remains under investigation - the cascade effect that
caused our redundant edge router to lose connectivity suggests an underlying
architectural vulnerability in our BGP session management.

## Response Timeline
- 23:55 UTC: Initiated planned router-id standardization
- ~23:56 UTC: Primary connectivity loss detected
- ~23:57 UTC: Secondary edge router unexpectedly lost connectivity
- 01:30 UTC: Full service restored via configuration rollback

## Mitigation
Recovery was achieved through an onsite restoration of backed-up router
configurations. While this approach was successful, the 95-minute resolution
time indicates a need for more robust rollback procedures and potentially an
automated configuration management system.

## Impact
- Kusama validator session 44,359 experienced degraded performance with MVR 1 in Era 7,496 and missed votes in Era 7,495
- Polkadot validator session 10,010 experienced degraded performance with 0.624 MVR in Era 1,662
- Temporary loss of peering sessions with AMSIX, BKNIX, and HGC Hong Kong IPTx

## Current Status and Future Plans
The underlying routing issue (multiple router-ids in loopback) remains unresolved.
Due to the maintenance freeze in internet exchanges during the holiday period,
the resolution has been postponed until next year. To ensure higher redundancy
during the next maintenance window, we plan to install a third edge router
before attempting the configuration standardization again.

## Future Work
1. Implementation of automated configuration validation testing
2. Enforce usage of Safe Mode during remote maintenance to prevent cascading failures
3. Investigation into BGP session interdependencies between edge routers
4. Read [RFC 2328](https://www.ietf.org/rfc/rfc2328.txt) to understand actual
protocol and how vendor implementation differ
5. Installation and configuration of third edge router to provide N+2 redundancy
during upcoming maintenance
6. Study route reflector architechture to move route management from edge
routers to centralized route server like birdc that is known for correctness
in implementation of RFC specs.
7. Implementation of [RFC 8195](https://www.rfc-editor.org/rfc/rfc8195.html) for improved traffic steering via large BGP communities
35 changes: 35 additions & 0 deletions docs/src/post_mortems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Post Mortems

## Why We Write Postmortems

At Rotko Network, we believe in radical transparency. While it's common in our
industry to see providers minimize their technical issues or deflect blame onto
others, we choose a different path. Every failure is an opportunity to learn
and improve - not just for us, but for the broader network engineering community.

We've observed a concerning trend where major providers often:
- Minimize the scope of incidents
- Provide vague technical details
- Deflect responsibility to third parties
- Hide valuable learning opportunities

A prime example of this behavior can be seen in the [October 2024 OVHcloud incident](https://blog.cloudflare.com/ovhcloud-outage-route-leak-october-2024),
where their initial response blamed a "peering partner" without acknowledging
the underlying architectural(basic filtering) vulnerabilities that allowed
the route leak to cause such significant impact.

In contrast, our postmortems:
- Provide detailed technical analysis
- Acknowledge our mistakes openly
- Share our learnings
- Document both immediate fixes and longer-term improvements
- Include specific timeline data for accountability
- Reference relevant RFCs and technical standards

## Directory

### 2024
- [2024-12-19: Edge Router Configuration Incident](network_outage_pm_241219.md)
- Impact: 95-minute connectivity loss affecting AMSIX, BKNIX, and HGC Hong Kong IPTx
- Root Cause: Misconceptions about router-id uniqueness requirements and OSPF behavior
- Status: Partial resolution, follow-up planned for 2025
Loading

0 comments on commit fd3e828

Please sign in to comment.