Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 16-3-2022 #4224

Closed
8 tasks done
hkdctol opened this issue Mar 2, 2023 · 10 comments
Closed
8 tasks done

O+M 16-3-2022 #4224

hkdctol opened this issue Mar 2, 2023 · 10 comments
Assignees
Labels
Explore O&M Operations and maintenance tasks for the Data.gov platform

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Mar 2, 2023

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Routine Tasks

These repositories will automatically create failure tickets, so no need to check the Actions

Snyk Scans

For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated.

If either of these actions failed and a PR was created, review and approve/triage it as needed

If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability.

Daily Routine

GH Actions

Check Action tabs for each active repositories, as these will not create issues automatically on failure

  • Catalog DB-Solr-Sync Action The actions should finish in minutes. Examine the amount of datasets affected if it takes long to finish.
  • Tracking Update Action The action should take 1 - 2 hours to finish on prod. Examine the amount of datasets affected or Solr index speed if the time is way off.

Miscs

Weekly Routine

Solr

  • Verify each Solr Leader/Followers are functional

Use this command to find Solr URLs and credentials in the prod space.

$ cf t -s prod
$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
  • Verify their Start time is in sync with Solr Memory Alert history at path /solr/#/

  • Verify each follower stays with Solr leader at path /solr/#/ckan/core-overview

  • Verify each Solr is responsive by running a few queries at /solr/#/ckan/query

  • Inspect each Solr's logging for abnormal errors at /solr/#/~logging

  • Examine the Solr Memory Utilization Graph to catch any abnormal incidences.

  • Log in to tts-jump AWS account with role SSBDev@ssb-production, go to custom SolrAlarm dashboard to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.

@hkdctol hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Mar 2, 2023
@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 3, 2023

Day 1 Summary

Inventory:

  • No events

Dashboard:

  • Did not check (and won't check for the rest of the week unless something prompts me to)
  • It loaded in the browser 🤷

Catalog:

  • All three followers restarted this morning (Follower 1 --> 4:45a, Follower 0 --> 6:19a, Follower 2 --> 7:02a EST)
    • All times solr recovered well and did not show any issues.
    • Notable anomalies that are accepted as normal:
      • Follower 0 is using index.20221007160458545
      • Follower 1 is using index.20230130083830278
      • Follower 2 is using index.20230130083955899
  • catalog-web had multiple high error alerts (@FuhuXia said these were all white-noise probably due to cyber traffic and error rates > 5% for a few hours is possible. Since we can't do anything about it, it's a no-op). (7 counts)
  • Harvesting audit
    • 107 error emails
      • 6 Attention items (will update later)
    • 9 success emails

Other news:

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 6, 2023

Day 2 Summary

Delayed work because of large overhead

  • Debugging JSON loading issues from harvesting on 3/3

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 8, 2023

Day 3 Summary

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 8, 2023

Day 4 Summary

  • No major events on prod
  • Harvesting audit
    • 27 Successful harvesting emails
    • 11 Error harvesting emails (all validation errors)
  • Investigated and possibly solved
  • Cleaned up resources related to the following IDs (prod)
    • High Cost IDs - 8796cd0b-0066-4675-97e6-94ce402a9fef
      - b3d6f4b4-da69-4232-8571-eba8d8621c52
      - a11a422d-5650-4259-8ea7-785bd930ebd8
      - 3b2ec575-1e58-49c5-b3c0-d546eba12632
      - 69ec800a-4cdb-46a2-a90d-cc15c732a817
      - 858a5eca-7f6d-4ca1-875e-3813c8ab6af7
    • Lower Cost IDs - c4c0319a-f8b9-4fd2-a83b-6818ec0139c1
      - 034a2f4d-2698-4657-b17b-3d0f64767454
      - 03ddcebb-09fb-4048-9131-3e8e137fd3bb
      - 1d78d98c-db0f-436b-8a67-815609309ffc
      - 3c2071d9-812c-40f8-97f3-1b1e10b6e786
      - 49337bda-ad9b-467a-a826-21bad09b1f15
      - 62c54669-a850-4495-b539-4633ba959d7d
      - 85274566-b7d4-4748-a68b-96ad156c6c1c
      - 8fe5a471-e566-4be6-8efc-c4611415608c
      - a4721a0d-3c58-4107-aeab-af7c3fe31aa1
      - a921e44b-4ba1-4a46-899a-e4484d87a536
      - aa259c0c-4f8e-47a5-98fc-651d87af9eee
      - d15f531b-8181-4382-933c-2364a545e289
      - dadcfa14-ea32-4ecb-9bd1-270a2c797b71
      - dbcbb946-037f-4331-923e-c4bdae9cff18
      - e22da168-d181-4db6-9da3-4ce936c3c753
      - e67cfaee-fafb-434a-abac-5783f84ca9a3
      - e96ff9c0-58d4-4e3e-bea2-6a6c53a4f2bf
      - ecc5f824-2d9d-4269-82c8-054d335b5198

This was referenced Mar 9, 2023
@nickumia-reisys
Copy link
Contributor

Day 5 Summary

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 10, 2023

Day 6 Summary

Other news

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 13, 2023

Day 7 Summary

  • (catalog) solr follower 1 restarted 3/11@4:05a, no core issues
  • (catalog) solr follower 1 restarted today@3:38p, no core issues
  • (catalog) solr follower 2 restarted today@7:21p, no core issues, but [email protected] GB on leader + followers (maybe db-solr-sync will be a lot tomorrow?)
  • I noticed that the storage size of the followers dropped on Thursday to ~5.6G, today it's back up to ~7.84G. Given that DB-Solr-Sync had ~15k records to reindex on Thursday and ~13k to reindex on Friday, it is directly correlated. In terms of how it happened? I'm not sure
  • Updated DMARC Inspections:
    • @FuhuXia was not able to replicate the issue seen on Friday. However, any recipient that forwards emails will show up in the report.
    • Generated the following results with this gist.:
      • March 10: Success Rate 1270/1271 = 1.00
      • March 11: Success Rate 1668/1756 = 0.95
      • March 12: Success Rate 546/593 = 0.92
      • March 13: Success Rate 447/485 = 0.92
  • Harvesting Audit (including weekend)
    • 85 Successful harvesting emails
    • 56 Error harvesting emails
      • 55 validation errors
      • 1 failed harvest, but good link, NSF (will investigate) (according to @FuhuXia this server has nightly maintenance that gets in the way sometimes, it has since successfully run, so no real issues)
  • Talked through log review with @FuhuXia, there isn't a good way to find "discrepancies" right now, so we decided to make a ticket that'll help direct the logs from each application to a specialized parsing rule which will make the right type of information more searchable and recognizable
  • Fixed a typo in ckan_auto CI

@nickumia-reisys
Copy link
Contributor

Day 8 Summary

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 15, 2023

Day 9 Summary

  • (catalog) solr follower 2 restarted today@5:57a, no core issues
  • DMARC Audit
    • Success Rate 447/447 = 1.00
  • Harvesting Audit
    • 26 Successful harvesting emails
    • 26 Error harvesting emails (all validation errors)
  • I'm surprised there was nothing else (just paired with the team on other stuff).

@hkdctol hkdctol closed this as completed Mar 16, 2023
@nickumia-reisys
Copy link
Contributor

Day 10 Summary

  • DMARC Audit
    • Success Rate 460/460 = 1.00
  • Harvesting Audit
    • 25 Successful harvesting emails
    • 10 Error harvesting emails (all validation errors)

Final thoughts on this O&M shift

  • We need to assign priorities to certain things. I found myself trying to evaluate how important things were which slowed my response time.
  • We need to continue to find good baselines. There's a decent amount of holes in what normal looks like for our system.
  • I started with pretty good momentum, but lost it after the first week.
  • The minor improvements I was able to make should continue to be expanded upon. Maybe a few months from now, O&M won't be so painful..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Explore O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

3 participants