-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add SOP for remote write incident management #659
Conversation
* Ensure Thanos Receive is running and healthy. | ||
* Ensure the hashring is not being OOM killed. | ||
* If there are OOM events, check if traffic into the hashring has increased. | ||
* If there is an unexpected increase in traffic, rate limit accordingly on observatorium-api for the tenant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secondary point, but i think we should probably document this process as well and link it here. (not an ask for right now, just future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
* If there are CPU throttling events, check if traffic into the hashring has increased. | ||
* If there is an unexpected increase in traffic, rate limit accordingly on observatorium-api for the tenant. | ||
* If there is an expected increase in traffic, adjust the CPU limits accordingly in the StatefulSet spec. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no increase in traffic, but we run into these issues we should have a documented 'check if there have been changes and roll them back' SOP (also for the future not rn)
Might be worth adding that as a thing to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that is getting into the obvious then and it comes to a point like where do you stop documenting things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! added some NIT's
b56e527
to
8542223
Compare
https://issues.redhat.com/browse/RHOBS-962