IOOS catalog data services health monitor #60

rsignell-usgs · 2017-12-19T11:37:14Z

Do we have tools to monitor the health of the data services in the IOOS catalog?

@ocefpaf just discovered this morning that his demo wasn't working because
http://crow.marine.usf.edu:8080/thredds/dodsC/FVCOM-Nowcast-Agg.nc.html
is broken. The THREDDS endpoint is there, but the opendap access to this dataset isn't working, giving java.io.EOFException error (yes, it has been reported to the provider).

Unfortunately this is a common occurrence, and it would be great to know about problems in advance instead of finding them out when we want to give demos!

daf · 2017-12-19T17:46:55Z

The previous IOOS Catalog was renamed to the Service Monitor and is still operational: https://monitor.ioos.us/

I wasn't able to find a service from that URL posted, but there is an FVCOM Nowcast here: https://monitor.ioos.us/services/590765c1b46a3d0055e83e36 - not sure if this is the same one.

Note: The service monitor considers it down because it won't respond to the .das request in 60s - testing from here it returns in around 70s.

rsignell-usgs · 2017-12-20T16:03:54Z

@daf, excellent. How does one go about changing the characteristics of individual tests? (e.g. changing the .das request to 80 s instead of 60s for that dataset)

Is there a repo where we can submit a PR?

daf · 2017-12-20T16:48:06Z

I don't think it currently has a per-service specific timeout - it's just a global. See here:

models/harvests.py#L86-L87

Low-hanging fruit here would be to make the second timeout equal to the first (they should likely be equal).

For a per-service timeout, you'd have to add the field to the Service model, have some way of setting it (via Admin area?) and use it in the same harvest code linked above. A much more involved process.

mwengren · 2018-01-10T14:19:09Z

@rsignell-usgs @daf Just FYA, we're considering retiring the Service Monitor site altogether at some point in the future. It's unclear what the need is for this capability, and some of the functionality is duplicated by the Axiom Sensor Dashboard (for SOS services anyway).

The Service Monitor has caused issues with RA data services in performing its monitoring/harvesting routines in the past, which is an issue for the RAs, and also there is maintenance overhead with keeping it online. We hadn't had many requests for this capability recently, until this one. The Service Monitor has been rather neglected the past few years, in terms of development focus, and as far as we know, end user usage.

Basically, we have to figure out what the most logical path forward is, given limited resources. The Sensor Dashboard does quite well monitoring SOS services. Can that be extended to also check OPeNDAP and/or ERDDAP service endpoints too? Is this in Axiom's plan for the Dashboard already? From what I've heard, this might be an easier solution than continuing to develop/extend and support the existing Service Monitor.

Open to suggestions however. How necessary is the capability to monitor service health, and what information do we need when we talk about monitoring (somewhere to go to to see past uptime for a service, daily uptime emails of all services to an email list - as has been done in the past, instant alerting to a POC if a service goes down)?

cc: @kwilcox @shane-axiom @ericmbernier @benjwadams @kknee @dpsnowden @kbailey-noaa

kwilcox · 2018-01-10T18:45:28Z

I recall the service monitor had an ultimate goal of being able to automate the IOOS Asset Inventory process. Is that still the case?

mwengren · 2018-01-16T15:46:17Z

@kwilcox We still have that as a goal, it may be just that we make a new tool to do that, partially based on the Service Monitor code. It hasn't really been decided at this point. Ideas or thoughts welcome. Either way, it will use the metadata in the Catalog for content, so there's reason to make sure that is maintained as best it can be.

rsignell-usgs · 2018-01-16T16:15:52Z

The basic use case here is: I'd like to know before I run a catalog-driven workflow which services are failing.
If it's too expensive or time consuming to test all the specific dataset endpoints every hour or every day, here's an idea:

Every hour we could test the health of the main service endpoints (e.g. THREDDS, SOS, ncWMS server endpoints). We could write a script to extract those endpoints from a catalog search to a list which is crawled.

Then we could crawl all the datasets on a more leisurely basis (however long it takes), perhaps just doing them sequentially and then starting again. Kind of like painting the Golden Gate Bridge. Would this take a few days?

ocefpaf mentioned this issue Dec 19, 2017

Fix some broken notebooks ioos/notebooks_demos#281

Merged

3 tasks

mwengren mentioned this issue Jan 16, 2018

Service Monitor - Too Many Services Down ioos/ckanext-ioos-theme#153

Closed

mwengren added this to the Release 1.5: Integrated Dataset Preview/UI Improvements milestone Aug 1, 2018

mwengren mentioned this issue Apr 9, 2019

Create a service compliance report in Catalog ioos/ckanext-ioos-theme#204

Open

mwengren mentioned this issue Dec 5, 2019

Umbrella Issue: Service Monitoring, Compliance, and Reporting in CKAN ioos/ckanext-ioos-theme#210

Open

mwengren modified the milestones: Release 1.6: National Products and DMAC Integration, Release 1.7: Dataset Monitoring and Compliance Reporting Jan 5, 2021

mwengren added this to IOOS Catalog Sep 9, 2024

mwengren moved this to Backlog in IOOS Catalog Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOOS catalog data services health monitor #60

IOOS catalog data services health monitor #60

rsignell-usgs commented Dec 19, 2017

daf commented Dec 19, 2017

rsignell-usgs commented Dec 20, 2017

daf commented Dec 20, 2017

mwengren commented Jan 10, 2018

kwilcox commented Jan 10, 2018

mwengren commented Jan 16, 2018

rsignell-usgs commented Jan 16, 2018 •

edited

Loading

IOOS catalog data services health monitor #60

IOOS catalog data services health monitor #60

Comments

rsignell-usgs commented Dec 19, 2017

daf commented Dec 19, 2017

rsignell-usgs commented Dec 20, 2017

daf commented Dec 20, 2017

mwengren commented Jan 10, 2018

kwilcox commented Jan 10, 2018

mwengren commented Jan 16, 2018

rsignell-usgs commented Jan 16, 2018 • edited Loading

rsignell-usgs commented Jan 16, 2018 •

edited

Loading