Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collating information about outages for Incident Reports #2791

Open
balajialg opened this issue Sep 23, 2021 · 6 comments
Open

Collating information about outages for Incident Reports #2791

balajialg opened this issue Sep 23, 2021 · 6 comments
Assignees
Labels
documentation Issues around adding and modifying docs

Comments

@balajialg
Copy link
Contributor

balajialg commented Sep 23, 2021

All,

I wanted to collate all the information about the outages with varied hubs in a single place. I see this useful from multiple perspectives,

  1. Help us write consolidated incident reports in the future
  2. Verify @felder's GCP queries
  3. Evaluate whether the outage caused is due to an issue which we already fixed
Date Hubs/Services Affected by the Outage User Impact Reasons
August 20th, 2021 Datahub RStudio, dlab.datahub RStudio 300+ students as part of the R workshop Due to this issue (#2585), this PR was created. Jupyter Client went through a major upgrade which broke the system.
August 26th, 2021 (First day of class) R hub, Datahub Stats, Econ students were not able to log into their hubs Due to this issue (#2628), this PR (#2629) was created. Related to blocking request for course scope through the canvas.
September 2nd, 2021 Data 100 around 10+ students This issue (#2688) was due to the addition of the voila package
September 13th, 2021 Prob 140 No Data on the impact of the outage Check the PR that fixed this issue here (#2749)! The size of the DB was full due to logs.
September 16th, 2021 Data 100 50+ students reported issues with their Hub instance Hub restarted with a delay after a PR(#2768) got merged resulting in an interim outage for users
September 29th, 2021 EECS Hub All students in EECS 16A lab reported memory-related issues with their Hub instance NFS disk was full resulting in this error. Issue description and solution can be found in this issue (#2808)
October 19th, 2021 R Hub All users in the R Hub storage problem with the hub resulting in this error. Issue description and solution can be found in this issue (#2902 )
January 20th, 2022 Many hubs Most GSIs across multiple hubs For more information, refer here
February 2nd, 2022 Data 100 hub Minor outage for a few students PR merge to prod triggered the pods to be knocked out of the hub
August 8, 2022, All hubs Outage that affected all hubs including Data 8 students @yuvipanda fixed the core node issue by killing the core node which resulted in the outage
August 23, 2022, Data 100 hub Outage that affected some Data 100 instructors and students
Sep 5, 2022 Data 100 hub Outage that affected a few Data 100 students
Sep 11, 2022 Data 100 hub, Biology Hub Outage that affected all hubs
Sep 12, 2022 Stat 20 hub, R Hub Outage that affected students using R Hub Issue details are in #3740
Sep 14, 2022 Data 102 Hub Outage that affected few students
Sep 15, 2022 Stat 20 hub, Data 100 Hub, R Hub Outage that affected all the hubs
Sep 18, 2022 Data 100 Hub Outage that affected all the hubs due to NFS server issue NFS restart brought hubs back
Oct 7, 2022 All hubs Hubs down due to NFS server issue which affected all users for a short period of time Yuvi restarted the NFS server which brought the hubs back
Oct 9, 2022 All hubs Hubs down due to NFS server issue which affected all users for a short period of time Yuvi restarted the NFS server which brought the hubs back
Oct 10, 2022 All hubs PR moving all hubs to NFS v3.0 from v4.0 resulted in a crash that affected all users for a short period of time Reverted back to the original state
Oct 12, 2022 Stat 20 hub Start-up times went really high for the Stat 20 hub. #3836 is tracking this Yuvi moved Stat 20 to a different node pool altogether
Oct 15, 2022 Data 101 hub Users reported 403 error process to delete inactive users resulted in race condition. Yuvi deleted the process which brought the hub back
Oct 27, 2022 Data 8 Hub Data 8 Hub users not able to access their pods Not able to recollect the reasons for this outage
Oct 30, 2022 Outage that affected all the hubs Hubs were unusable for a short duration of time Yuvi drained the nodes which had all the affected pods
Nov 11, 2022 Data 8, Data 100, and Data 101 Hubs are down Hubs were unusable for most users for a short duration of time Outage due to node auto scaler issue which is highlighted in this issue #3935
Dec 2, 2022 All hubs were down Hubs were unavailable for all users for a period of 2 hours Outage due to nginx related issue
Feb 24, 2023 All hubs were down Hubs were unavailable for all users for a period of 30 mins - particularly disruptive for Data 8 hub Outage due to nginx related issue
Sep 30, 2023 All hubs were down Hubs were unavailable for all users for a period of 40 mins Outage due to tcp OOM/nginx related issue
Dec 4, 2023 All hubs were down for 10-12 mins Hubs were unavailable for all users for 15 mins Outage due to tcp memory related issue
Dec 5, 2023 All hubs were down for 35 mins Hubs were unavailable for all users from 11:10 - 11:40 PM Outage due to tcp memory related issue
Feb 7 and 21, 2024 Users were getting "white screen" issue when they tried to log into Datahub Datahub, Data 100, Data 8, Prob 140 users were getting this error message when they log into Datahub. Clearing cache, restarting server, incognito window, using another browser are the available options There is no clarity around the reason for the issue. Piloting fork of CHP is considered a possibility but there is no definitive evidence around the root cause for the issue.
Feb 23, 2024 All hubs were affected Core node restart caused intermittent outage 5 times between 8.30 - 9 PM. Core node was being autoscaled down from 1 --> 0, which had the effect of killing and restarting ALL hub pods. @shaneknapp disabled autoscaling in the core node pool and pinned the node pool size to 1. Since then, we haven't had this issue again.
April 5, 2024 Multiple hubs such as Data 8, 100, 101 were affected while pulling the notebooks from github repositories Jupyterhub upgrade to 4.1.4 and nbgitpuller upgrade to the latest version 1.2.1 broke nbgitpuller functionality Fix to debump Jupyterhub 4.1.4 and nbgitpuller to 1.1.0 fixed the issue for users facing issues with nbgitpuller link.
Sep 10, 11 and 12 2024 Certain percentage of users across all hubs CHP CPU spike resulted in hub restart causing a brief downtime (~5 to 10 mins) Increased CHP memory to 3 GB plus filed an upstream issue with configurable-http-proxy maintainers to track this issue
October 23, 2024 Ephemeral port issue broke Nature Hub for a period of 30 minutes Hubs recovered on its own
October 24, 2024 Ephemeral port issue broke Bio Hub for a period of 10-15 Shane made a fix to recover the hub
October 30, 2024 CHP related changes broke all hubs Changes are reverted through a PR
@balajialg balajialg self-assigned this Sep 30, 2021
@balajialg balajialg added the documentation Issues around adding and modifying docs label Sep 30, 2021
@balajialg balajialg changed the title [For Incident Reports] Collating information about outages which happened during the past two months [For Incident Reports] Collating information about outages Jul 25, 2022
@balajialg balajialg changed the title [For Incident Reports] Collating information about outages Collating information about outages for After Action Reports Nov 28, 2022
@balajialg
Copy link
Contributor Author

@ryanlovett You had ideas about combining this issue with pre-existing after-action reports. Do you think #3539 seems like a viable next step or you had something else in mind?

@ryanlovett
Copy link
Collaborator

@balajialg I think every incident should be followed up by a blameless incident report, https://github.com/berkeley-dsep-infra/datahub/blob/staging/docs/admins/incidents/index.rst. Perhaps when there are outages, you can create a github issue which tracks the creation of the incident report and assign it to the admin with the most insight into it.

The reports should follow a template with a summary, timeline, and action items to prevent the issue from recurring. They should be published in the docs.

@balajialg
Copy link
Contributor Author

@ryanlovett For the future, I will create an incident report template that any of the admins with insight can fill. That would make it easy to start filling AAR when an outage happens.

However, what about the outages reported during fall 22? Do we want to create one incident report that collectively summarizes learnings and scopes the next steps? I am not sure whether doing an individual AAR is possible given the scope of work required.

Possibly this can be a discussion item for the Monthly Sprint Planning meeting.

@ryanlovett
Copy link
Collaborator

Ideally each incident would have a separate report since there are often different factors. This semester there were outages due to core nodes, image pulling delays, and the file server. The problem with creating reports too far after the fact is that our memories are hazy.

Are AARs and incident reports the same thing? Our previous incident reports contained an "action item" section which sounds similar to "After Action" reports. Is an AAR part of an RTL protocol? Wherever the action items are placed, it'd be good if they're found in a single place.

@balajialg
Copy link
Contributor Author

balajialg commented Dec 1, 2022

@ryanlovett Apologies for using AAR and incident reports interchangeably while meaning the same. There is an RTL protocol for sharing a detailed outage template with relevant information to the leadership. However, that is more about the logistics of resolving the outage. It doesn't focus on the technical specifics of the incident report.

If @shaneknapp has the bandwidth and is interested then we can publish an incident report for fall 22 which outlines the issues due to a) core nodes, b) file server, and c) image pulling delays and the steps we took in the near term to resolve the outage and the plans we have for the long term to eliminate the reasons for such outages.

@balajialg balajialg changed the title Collating information about outages for After Action Reports Collating information about outages for Incident Reports Dec 3, 2022
@balajialg
Copy link
Contributor Author

Review the data once again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues around adding and modifying docs
Projects
None yet
Development

No branches or pull requests

2 participants