pm about 241220 network outage

rotkonetworks · Dec 20, 2024 · fd3e828 · fd3e828
1 parent 7bcc32c
commit fd3e828
Show file tree

Hide file tree

Showing 4 changed files with 7,861 additions and 0 deletions.
diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -22,3 +22,5 @@
 - [Carbon offset](./carbon.md)
 - [Team](./team.md)
 - [Resources](./resources.md)
+- [Post Mortems](post_mortems.md)
+  - [241219 - Network Outage](./network_outage_pm_241219.md)
diff --git a/docs/src/network_outage_pm_241219.md b/docs/src/network_outage_pm_241219.md
@@ -0,0 +1,69 @@
+# Network Outage Postmortem (2024-12-19/20)
+
+## Summary
+A planned intervention to standardize router-id configurations across our edge
+routing infrastructure resulted in an unexpected connectivity loss affecting our
+AMSIX Amsterdam, BKNIX, and HGC Hong Kong IPTx peering sessions. The incident
+lasted approximately 95 minutes (23:55 UTC to 01:30 UTC) and impacted our validator
+performance on both Kusama and Polkadot networks. Specifically, this resulted in
+missed votes during Kusama Session 44,359 at Era 7,496 and Polkadot Session
+10,010 at Era 1,662 with a 0.624 MVR (missed vote ratio).
+
+## Technical Details
+The root cause was traced to an attempt to resolve a pre-existing routing anomaly
+where our edge routers were operating with multiple router-ids across different
+uplink connections and iBGP sessions. The heterogeneous router-id configuration
+had been causing nexthop resolution failures and inability to transit in our
+BGP infrastructure.
+
+The original misconfiguration stemmed from an incorrect assumption that router-ids
+needed to be publicly unique at Internet exchange points. This is not the case - router-ids
+only need to be unique within our Interior Gateway Protocol (IGP)
+domain. This misunderstanding led to the implementation of multiple router-ids
+in loopback interfaces, creating unnecessary complexity in our routing
+infrastructure.
+
+During the remediation attempt to standardize OSPF router-ids to a uniform
+value across the infrastructure, we encountered an unexpected failure mode
+that propagated through our second edge router, resulting in a total loss of
+connectivity regardless of router&&uplink redundancy. The exact mechanism of
+the secondary failure remains under investigation - the cascade effect that
+caused our redundant edge router to lose connectivity suggests an underlying
+architectural vulnerability in our BGP session management.
+
+## Response Timeline
+- 23:55 UTC: Initiated planned router-id standardization
+- ~23:56 UTC: Primary connectivity loss detected
+- ~23:57 UTC: Secondary edge router unexpectedly lost connectivity
+- 01:30 UTC: Full service restored via configuration rollback
+
+## Mitigation
+Recovery was achieved through an onsite restoration of backed-up router
+configurations. While this approach was successful, the 95-minute resolution
+time indicates a need for more robust rollback procedures and potentially an
+automated configuration management system.
+
+## Impact
+- Kusama validator session 44,359 experienced degraded performance with MVR 1 in Era 7,496 and missed votes in Era 7,495
+- Polkadot validator session 10,010 experienced degraded performance with 0.624 MVR in Era 1,662
+- Temporary loss of peering sessions with AMSIX, BKNIX, and HGC Hong Kong IPTx
+
+## Current Status and Future Plans
+The underlying routing issue (multiple router-ids in loopback) remains unresolved.
+Due to the maintenance freeze in internet exchanges during the holiday period,
+the resolution has been postponed until next year. To ensure higher redundancy
+during the next maintenance window, we plan to install a third edge router
+before attempting the configuration standardization again.
+
+## Future Work
+1. Implementation of automated configuration validation testing
+2. Enforce usage of Safe Mode during remote maintenance to prevent cascading failures
+3. Investigation into BGP session interdependencies between edge routers
+4. Read [RFC 2328](https://www.ietf.org/rfc/rfc2328.txt) to understand actual
+   protocol and how vendor implementation differ
+5. Installation and configuration of third edge router to provide N+2 redundancy
+    during upcoming maintenance
+6. Study route reflector architechture to move route management from edge
+   routers to centralized route server like birdc that is known for correctness
+   in implementation of RFC specs.
+7. Implementation of [RFC 8195](https://www.rfc-editor.org/rfc/rfc8195.html) for improved traffic steering via large BGP communities
diff --git a/docs/src/post_mortems.md b/docs/src/post_mortems.md
@@ -0,0 +1,35 @@
+# Post Mortems
+
+## Why We Write Postmortems
+
+At Rotko Network, we believe in radical transparency. While it's common in our
+industry to see providers minimize their technical issues or deflect blame onto
+others, we choose a different path. Every failure is an opportunity to learn
+and improve - not just for us, but for the broader network engineering community.
+
+We've observed a concerning trend where major providers often:
+- Minimize the scope of incidents
+- Provide vague technical details
+- Deflect responsibility to third parties
+- Hide valuable learning opportunities
+
+A prime example of this behavior can be seen in the [October 2024 OVHcloud incident](https://blog.cloudflare.com/ovhcloud-outage-route-leak-october-2024),
+where their initial response blamed a "peering partner" without acknowledging
+the underlying architectural(basic filtering) vulnerabilities that allowed
+the route leak to cause such significant impact.
+
+In contrast, our postmortems:
+- Provide detailed technical analysis
+- Acknowledge our mistakes openly
+- Share our learnings
+- Document both immediate fixes and longer-term improvements
+- Include specific timeline data for accountability
+- Reference relevant RFCs and technical standards
+
+## Directory
+
+### 2024
+- [2024-12-19: Edge Router Configuration Incident](network_outage_pm_241219.md)
+  - Impact: 95-minute connectivity loss affecting AMSIX, BKNIX, and HGC Hong Kong IPTx
+  - Root Cause: Misconceptions about router-id uniqueness requirements and OSPF behavior
+  - Status: Partial resolution, follow-up planned for 2025