Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2024-06-24 #4799

Closed
12 of 14 tasks
FuhuXia opened this issue Jun 24, 2024 · 7 comments
Closed
12 of 14 tasks

O+M 2024-06-24 #4799

FuhuXia opened this issue Jun 24, 2024 · 7 comments
Assignees

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Jun 24, 2024

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

Monthly Checklist

ad-hoc checklist

  • audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

@FuhuXia FuhuXia self-assigned this Jun 24, 2024
@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 24, 2024

NewRelic shows catalog-web performed better after #4708.

For example:

transaction time reduced from ~1100ms to ~400 ms:

Image

Apdex score:

image

Web request timeout percentage dropped from ~1% to ~0.1%:

image

POSTGRES DB tracking_summary SELECT query throughput:

image

tracking_summary SELECT query dropped from the top to no.6 in most time consuming ranking:

image

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 24, 2024

Set the following harvest source to manual schedule until source url is fixed.

  • /harvest/city-and-county-of-durham-north-carolina-data-json
  • /harvest/santa-rosa-ca-data-json
  • /harvest/new-mexico-resource-geographic-information-system-nm-rgis/

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 24, 2024

Starting from June 5, 2024, our harvest agent is blocked by Institute of Museum and Library Services' web server, harvest source /harvest/imls-json can not be harvested.

@hkdctol
Copy link
Contributor

hkdctol commented Jun 24, 2024

Have reached out to the contact addresses I have for IMLS.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 26, 2024

Since 2024-06-25 03:50 EDT googlebot started to send nonsense traffic to catalog, doubling the total requests catalog receives, and doubling the catalog-web CPU usage. If this trends continues, we might have to block certain traffic based on the request pattern. Details in slack discussion.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 26, 2024

Reduced prod catalog-web instances from 5 to 3. Mem from 850M to 800M. The following two PRs save us 2050M memory.

@FuhuXia
Copy link
Member Author

FuhuXia commented Jun 27, 2024

Change harvest sources their original schedules for those that were paused due to repeated ParentNotHarvestedException error.

  • /harvest/dot-socrata-data-json
  • /harvest/healthdata-gov

@FuhuXia FuhuXia closed this as completed Jul 8, 2024
@FuhuXia FuhuXia mentioned this issue Aug 28, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants