Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 30-03-2023 #4238

Closed
15 tasks
hkdctol opened this issue Mar 16, 2023 · 12 comments
Closed
15 tasks

O+M 30-03-2023 #4238

hkdctol opened this issue Mar 16, 2023 · 12 comments
Assignees

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Mar 16, 2023

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Deployments

  • Catalog
  • Inventory

Check Restarts

  • Catalog
  • Inventory

Check Snyk Scans

  • Catalog
  • Inventory

Check Catalog Auto Tasks

Note
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Check Harvesting Emails

  • Catalog

Other

Weekly Checklist

@hkdctol
Copy link
Contributor Author

hkdctol commented Mar 16, 2023

For following sprint Robbie/Jin/Fuhu

@btylerburton
Copy link
Contributor

Day 1 Summary

SES report

  • 1 failure attributed to email being forwarded to an alternate domain

Dashboard

  • No alerts. Application is up

Inventory

  • No alerts. Application is up.

Catalog:

  • catalog-web-prod: 9 errors in related to Apdex score, web throughput and Error percentage > 15% No action taken as the application returned to normal rate after some time
  • catalog-admin-prod: 1 error related to web throughput

Harvesting

  • USDOT Geospatial Metadata had harvesting errors
    • 9 records in error / 78 document errors.
    • All errors attributable to failed validation / missing fields.

Solr

@btylerburton
Copy link
Contributor

Day 2 Summary

  • Deployments failed intermittently but nothing beyond white noise
  • Scheduled restarts succeeding and well within expected times
  • Snyk scans find no vulnerabilities
  • Harvesting errors:
    • 22 errors on 3/19
      • Many sources missing required properties: bureauCode, programCode
      • Or validation requirements not met: keyword missing and required, description too short, (parent) identifier not found
    • 19 errors on 3/20
  • New Relic Alerts
    • "Error percentage > 1&" alarm for catalog-web-prod 34min and 41min, respectively
    • Some samples:
      • Catalog:
        • 2023-03-20 19:36:45,782 ERROR [ckan.views.dataset] Dataset search error: ('Wrong bounding box provided',)
        • 2023-03-20 19:52:08,200 ERROR [ckan.config.middleware.flask_app] No plugin registered for (zip, <class 'rdflib.serializer.Serializer'>)
      • Inventory:
        • 2023-03-20 19:53:05,214 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset: <base/vendor/jquery.js>
  • DMARC found no failures

@btylerburton
Copy link
Contributor

Day 3 Summary

  • Catalog deployment failed due to app start timeout.
  • DB Solar Sync and Tracking Update issues haven't seen new entries added since 3/15 & 3/16 respectively
    • Need to look into that.
  • 34 harvesting job failures
  • harvest jobs
    • 34 failures
    • 42 succeses
  • Not sure what I should be looking for in New Relic Error rate or Performance logs
  • DMARC
    • 1 DKIM failure > Mark3ed as a "softfail"
      • google.com!ses-513ec6065d216283.ssb.data.gov!1679270400!1679356799

@btylerburton
Copy link
Contributor

Day 4 Summary

  • No fails in DMARC report
  • Deployments to Catalog and Inventory are successful and within expected times
  • Restarts to Catalog and Inventory are successful and within expected times
  • All CKAN Automated Tasks successful
  • Harvesting report: 27 successful, 11 failed
  • New Relic Alerts
    • catalog-web-prod: 10 alerts
    • Inventory: 2 alerts
    • All within normal tolerances

@btylerburton
Copy link
Contributor

Day 5 Summary

  • Deployments to Catalog and Inventory are successful and within expected times
  • Restarts to Catalog and Inventory are successful and within expected times
  • DMARC report is clean
  • All CKAN Automated Tasks successful
    • note: I now know how to get valid info for DB Solar Sync and Tracking Updates into the mermaid chart, apologies for the gaps
  • Data.gov static site triggered NR alert.
    • RES: Home page validation text updates
  • Harvesting report:
    • 28 successful
    • 12 failed

@btylerburton
Copy link
Contributor

btylerburton commented Mar 24, 2023

Day 6 Summary

  • Deployments to Catalog are successful, but Inventory failed even after re-run
    • Inventory error:
    Preparing metadata (pyproject.toml): started
    Preparing metadata (pyproject.toml): finished with status 'error'
      error: subprocess-exited-with-error
        
      × Preparing metadata (pyproject.toml) did not run successfully.
      │ exit code: 1
    - ```
    
  • Restarts to Catalog and Inventory are successful and within expected times
  • DMARC report finds same single failure from forwarding
  • Harvesting report: 115 successful, 61 failed
    • Failures are due to validation errors
  • Started working on New Relic dashboard for O&M monitoring

@btylerburton
Copy link
Contributor

Weekly Review Summary

  • Log Review
    • catalog-gather
      • 21 errors / 19 from Friday harvest
      • Mostly errors loading JSON content.
    • catalog-proxy
      • 46k logs from 3/21 4am - 8am
      • Ex. >> 2023/03/21 12:34:31 [error] 115#0: *1357 catalog-prod-datagov.apps.internal could not be resolved (3: Host not found), client: 10.255.38.240, server: catalog-prod-datagov.app.cloud.gov, request: "GET /webassets/vendor/0b01aef1_font-awesome.css HTTP/1.1", host: "catalog-prod-datagov.app.cloud.gov"
    • catalog-web
      • 58k logs from 3/21 4am - 8am
      • Majority are Solr errors
      • Solr core not available due to init failure
    • catalog-admin
      • 24 logs from 3.23 4am - 4:15am
      • Solr returned an error running query ...
    • catalog-fetch
      • 22k logs from 3.17 11pm - 3.18 8am
      • CSDGM import errors
        • Input contained an invalid ISO 19115 Topic Category [biodiversity and ecosystems] generating a regular keyword
      • Spatial transforms
        • Error deocding JSON object
  • All Solr leaders and followers are running and performant
    • Report being restarted 1 day ago, outside of follower-1 that restarted 15hrs ago
  • Solr duplicate check
    • 982 duplicate IDENTIFIERs found
    • 43 duplicate GUIDs found
    • Will create a ticket for IDENTIFIERs on Monday

@btylerburton
Copy link
Contributor

Day 7 Summary

  • Deployments to Catalog and Inventory are successful and within expected times
  • Restarts are failing in Catalog and Inventory due to a missing template error: https://github.com/GSA/catalog.data.gov/actions/runs/4536769881
  • DMARC report clean
  • Automated Snyk/Dependabot scans clean
  • Catalog Auto Tasks updated
  • Harvesting reports 54 successful harvests, 20 failures, all due to validation errors

@btylerburton
Copy link
Contributor

Day 8 Summary

  • Deployments to Catalog and Inventory are green again after temp change to get CF public key
  • Restarts in Catalog and Inventory are green again after temp change to get CF public key
  • Scheduled jobs did not run due to CF bug above. Will post new totals on Day 9
  • Catalog Harvest errors are all due to client data validation errors
  • DMARC report finds single softfail forward

@btylerburton
Copy link
Contributor

Day 9 Summary

  • Deployments and Restarts remain green
  • A few open Snyk PR's in Catalog & Inventory that need to be checked out
  • 22 Harvest Jobs with failures, all attributable to validation errors
  • DMARC report finds single softfail forward

@btylerburton
Copy link
Contributor

Day 10 Summary

  • Deployments and restarts are green
  • Snyk scans find 1 PR in Catalog with build issues, and 1 PR in inventory which needed requirements updated to succeed
  • No Harvest errors reported for the first time in my shift !!!
  • DMARC report finds no failures
  • And that's it for O&M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants