Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOOS catalog data services health monitor #60

Open
rsignell-usgs opened this issue Dec 19, 2017 · 7 comments
Open

IOOS catalog data services health monitor #60

rsignell-usgs opened this issue Dec 19, 2017 · 7 comments

Comments

@rsignell-usgs
Copy link
Member

Do we have tools to monitor the health of the data services in the IOOS catalog?

@ocefpaf just discovered this morning that his demo wasn't working because
http://crow.marine.usf.edu:8080/thredds/dodsC/FVCOM-Nowcast-Agg.nc.html
is broken. The THREDDS endpoint is there, but the opendap access to this dataset isn't working, giving java.io.EOFException error (yes, it has been reported to the provider).

Unfortunately this is a common occurrence, and it would be great to know about problems in advance instead of finding them out when we want to give demos!

@daf
Copy link
Member

daf commented Dec 19, 2017

The previous IOOS Catalog was renamed to the Service Monitor and is still operational: https://monitor.ioos.us/

I wasn't able to find a service from that URL posted, but there is an FVCOM Nowcast here: https://monitor.ioos.us/services/590765c1b46a3d0055e83e36 - not sure if this is the same one.

Note: The service monitor considers it down because it won't respond to the .das request in 60s - testing from here it returns in around 70s.

@rsignell-usgs
Copy link
Member Author

@daf, excellent. How does one go about changing the characteristics of individual tests? (e.g. changing the .das request to 80 s instead of 60s for that dataset)

Is there a repo where we can submit a PR?

@daf
Copy link
Member

daf commented Dec 20, 2017

I don't think it currently has a per-service specific timeout - it's just a global. See here:

models/harvests.py#L86-L87

Low-hanging fruit here would be to make the second timeout equal to the first (they should likely be equal).

For a per-service timeout, you'd have to add the field to the Service model, have some way of setting it (via Admin area?) and use it in the same harvest code linked above. A much more involved process.

@mwengren
Copy link
Member

@rsignell-usgs @daf Just FYA, we're considering retiring the Service Monitor site altogether at some point in the future. It's unclear what the need is for this capability, and some of the functionality is duplicated by the Axiom Sensor Dashboard (for SOS services anyway).

The Service Monitor has caused issues with RA data services in performing its monitoring/harvesting routines in the past, which is an issue for the RAs, and also there is maintenance overhead with keeping it online. We hadn't had many requests for this capability recently, until this one. The Service Monitor has been rather neglected the past few years, in terms of development focus, and as far as we know, end user usage.

Basically, we have to figure out what the most logical path forward is, given limited resources. The Sensor Dashboard does quite well monitoring SOS services. Can that be extended to also check OPeNDAP and/or ERDDAP service endpoints too? Is this in Axiom's plan for the Dashboard already? From what I've heard, this might be an easier solution than continuing to develop/extend and support the existing Service Monitor.

Open to suggestions however. How necessary is the capability to monitor service health, and what information do we need when we talk about monitoring (somewhere to go to to see past uptime for a service, daily uptime emails of all services to an email list - as has been done in the past, instant alerting to a POC if a service goes down)?

cc: @kwilcox @shane-axiom @ericmbernier @benjwadams @kknee @dpsnowden @kbailey-noaa

@kwilcox
Copy link
Member

kwilcox commented Jan 10, 2018

I recall the service monitor had an ultimate goal of being able to automate the IOOS Asset Inventory process. Is that still the case?

@mwengren
Copy link
Member

@kwilcox We still have that as a goal, it may be just that we make a new tool to do that, partially based on the Service Monitor code. It hasn't really been decided at this point. Ideas or thoughts welcome. Either way, it will use the metadata in the Catalog for content, so there's reason to make sure that is maintained as best it can be.

@rsignell-usgs
Copy link
Member Author

rsignell-usgs commented Jan 16, 2018

The basic use case here is: I'd like to know before I run a catalog-driven workflow which services are failing.
If it's too expensive or time consuming to test all the specific dataset endpoints every hour or every day, here's an idea:

Every hour we could test the health of the main service endpoints (e.g. THREDDS, SOS, ncWMS server endpoints). We could write a script to extract those endpoints from a catalog search to a list which is crawled.

Then we could crawl all the datasets on a more leisurely basis (however long it takes), perhaps just doing them sequentially and then starting again. Kind of like painting the Golden Gate Bridge. Would this take a few days?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

4 participants