Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 05-19-2023 #4318

Closed
10 tasks
hkdctol opened this issue May 12, 2023 · 4 comments
Closed
10 tasks

O+M 05-19-2023 #4318

hkdctol opened this issue May 12, 2023 · 4 comments
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@hkdctol
Copy link
Contributor

hkdctol commented May 12, 2023

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

@hkdctol hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label May 12, 2023
@nickumia-reisys nickumia-reisys mentioned this issue May 15, 2023
10 tasks
@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented May 15, 2023

Day 1

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented May 16, 2023

Day 2

  • Inventory went down from 3:37a to 4:36a... not sure of the cause, but it seems to be back up and healthy.
  • Did running the dedupe script help reduce the db-solr-sync number of operations? (the world may never know 🤯)
    image

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented May 18, 2023

Day 3 + 4

  • Catalog Solr Follower 1 restarted @ 2:41a 5/18
  • Catalog Solr Follower 2 restarted @ 4:23a 5/18
  • Catalog Solr Follower 0 restarted @ 4:26a 5/18
  • Random flakiness in our github action pipelines.. but nothing that interferred long-term with restarts/actions.
  • 5/18: db-solr-sync and tracking-update took twice the time to execute on the same number of record changes...
  • Incident with federation.data.gov (record removed from dns by error)

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented May 19, 2023

Day 5

  • Catalog Solr Follower 0 restarted @ 7:42a
  • Catalog Solr Follower 2 restarted @ 9:39a
  • Both db-solr-sync and tracking-update are showing persistent slowdowns...
    image
    image
  • No meaningful harvesting issues this week.
  • Assessed Upcoming Elastic Load Balancing API changes: https://gsa-tts.slack.com/archives/C2N85536E/p1684526961142519

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

2 participants