Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace check-instance-ready endpoint to use status instead of stats from CanarieAPI #293

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fmigneault
Copy link
Collaborator

Overview

Fix errors seen in #283 (comment) and #283 (comment).

More precisely, the /stats endpoint indicates 503, while all services under it seems OK
(https://host-140-154.rdext.crim.ca/canarie/node/service/stats)
image

Using /status is more reliable
(https://host-140-154.rdext.crim.ca/canarie/node/service/status)
image

Changes

Non-breaking changes

  • Scripts: fix check-instance-ready script.

    Previously employed /canarie/node/service/stats endpoint could be unreliable for some services under the node that
    produced log collection errors to populate stats. Instead, use /canarie/node/service/status that check only if the
    services are responsive according to configured endpoints under CanarieAPI. This status endpoint is the same one that
    is employed by the CI test suite to check that the instance is ready before starting notebook tests.

Breaking changes

  • n/a

Related Issue / Discussion

@fmigneault
Copy link
Collaborator Author

run tests

@fmigneault
Copy link
Collaborator Author

@matprov
Jenkins produces another error:
http://daccs-jenkins.crim.ca/job/DACCS-iac-birdhouse/1237

hudson.AbortException: [INFO] Skipping post actions since pipeline was triggered from a PR which points to a deleted branch.
	at org.jenkinsci.plugins.workflow.steps.ErrorStep$Execution.run(ErrorStep.java:63)
	at org.jenkinsci.plugins.workflow.steps.ErrorStep$Execution.run(ErrorStep.java:50)
[...]

@matprov
Copy link
Collaborator

matprov commented Feb 10, 2023

Jenkins produces another error

@fmigneault Yeah that's totally fine - I'm currently changing things to make sure pipeline don't get run over non-existent branches, but changes are in progress. This issue has been introduced some minutes ago. I'll run the tests for this PR when it'll be stable.

@matprov
Copy link
Collaborator

matprov commented Feb 10, 2023

run tests

@crim-jenkins-bot
Copy link
Collaborator

E2E Test Results

DACCS-iac Pipeline Results

Build URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1238/
Result : failure

BIRDHOUSE_DEPLOY_BRANCH : check-instance-ready
DACCS_CONFIGS_BRANCH : master
PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master
PAVICS_SDI_BRANCH : master

DESTROY_INFRA_ON_EXIT : true
PAVICS_HOST : https://host-140-91.rdext.crim.ca

Infrastructure deployment failed. Instance has not been destroyed. @matprov

@matprov
Copy link
Collaborator

matprov commented Feb 10, 2023

@fmigneault
Looking at the initialization logs I see this (on the actual instance, not in jenkins logs):

[Weaver] Wait for response from Magpie to login [https://host-140-91.rdext.crim.ca/magpie]. OK!
Validate Magpie token... OK!
[Weaver] Wait for response from Weaver [https://host-140-91.rdext.crim.ca/twitcher/ows/proxy/weaver]. OK!
[Weaver] Using URL: [https://host-140-91.rdext.crim.ca/twitcher/ows/proxy/weaver]
[Weaver] Wait for response from remote WPS provider [catalog] on [https://host-140-91.rdext.crim.ca/twitcher/ows/proxy/catalog]....
[Weaver] Got valid response from remote WPS provider [catalog].
[Weaver] Unregistering any remote WPS provider matching [catalog].
[Weaver] Delete [catalog] response: 404
[Weaver] Registering remote WPS provider [catalog] on [https://host-140-91.rdext.crim.ca/twitcher/ows/proxy/catalog]... 
[Weaver] ERROR: Failed registration of remote WPS provider [catalog] on [https://host-140-91.rdext.crim.ca/twitcher/ows/proxy/catalog].
Error:
<?xml version="1.0" encoding="utf-8"?>
<ExceptionReport version="1.0.0"
    xmlns="http://www.opengis.net/ows/1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
    <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
        <ExceptionText>Unhandled error: bad character range _-: at position 33</ExceptionText>
    </Exception>
</ExceptionReport>

@fmigneault
Copy link
Collaborator Author

@matprov
The catalog seems to misbehave with GetCapabilities request. I'm not sure what could be the issue as it seems to have been the same pinned version for ~3years (https://github.com/bird-house/birdhouse-deploy/blame/bcd1d883ec18c64334dba869007da65e0de8fcd8/birdhouse/docker-compose.yml#L109).
I guess just restart the stack and ignore this if the issue is sporadic. It's not worth investigating because of #287

@crim-jenkins-bot
Copy link
Collaborator

E2E Test Results

DACCS-iac Pipeline Results

Build URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac/60/
Result : failure

BIRDHOUSE_DEPLOY_BRANCH : check-instance-ready
DACCS_CONFIGS_BRANCH :
PAVICS_E2E_WORKFLOW_TESTS_BRANCH :
PAVICS_SDI_BRANCH :

DESTROY_INFRA_ON_EXIT :
PAVICS_HOST : https://

PAVICS-e2e-workflow-tests Pipeline Results

Tests URL :

NOTEBOOK TEST RESULTS
    
</code>

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to keep /stats.

@@ -13,29 +13,31 @@ COMPOSE_DIR="`dirname "$THIS_DIR"`"

if [ -f "$COMPOSE_DIR/env.local" ]; then
# Get PAVICS_FQDN
. $COMPOSE_DIR/env.local
. "${COMPOSE_DIR}/env.local"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out, this change will conflict with the DELAYED_EVAL PR.

fi

MONITOR_URL="https://${PAVICS_FQDN}/canarie/node/service/status"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my memory is correct /status always return 200 so it's a bad test.

In your screenshot below, you see Solr and ncWMS2 having error so if that /status still return 200, that's wrong. I think we have to keep /stats.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is use /status with the Accept: application/json to better validate the contents.
I think the 200 is returned only for the HTML representation (though the correct code could be returned in that case as well...)

Technically, /stats could be completely empty if it never (yet) ran the log parsing cron job.
I've encountered this issue recently in the PR pipeline where tests were started too early because /stats looked OK as everything was empty.

@tlvu
Copy link
Collaborator

tlvu commented Feb 13, 2023

If Solr and ncWMS2 frequently break in your pipeline, we can remove them from the canarie-api montoring list since they are scheduled to be removed anyways. This will probably make /stats return 200.

@crim-jenkins-bot
Copy link
Collaborator

E2E Test Results

DACCS-iac Pipeline Results

Build URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1256/
Result : failure

BIRDHOUSE_DEPLOY_BRANCH : check-instance-ready
DACCS_CONFIGS_BRANCH : master
PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master
PAVICS_SDI_BRANCH : master

DESTROY_INFRA_ON_EXIT : true
PAVICS_HOST : https://host-140-216.rdext.crim.ca

PAVICS-e2e-workflow-tests Pipeline Results

Tests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/962/

NOTEBOOK TEST RESULTS
    
[2023-02-13T18:56:45.780Z] ============================= test session starts ==============================
[2023-02-13T18:56:45.780Z] platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
[2023-02-13T18:56:45.780Z] rootdir: /home/jenkins/agent/workspace/PAVICS-e2e-workflow-tests_master
[2023-02-13T18:56:45.780Z] plugins: anyio-3.6.2, dash-2.7.0, nbval-0.9.6, tornasync-0.6.0.post2
[2023-02-13T18:56:45.780Z] collected 268 items
[2023-02-13T18:56:45.780Z] 
[2023-02-13T18:56:53.718Z] notebooks-auth/test_thredds.ipynb ...........                            [  4%]
[2023-02-13T18:57:01.780Z] pavics-sdi-master/docs/source/notebooks/WCS_example.ipynb .......        [  6%]
[2023-02-13T18:57:09.511Z] pavics-sdi-master/docs/source/notebooks/WFS_example.ipynb ......         [  8%]
[2023-02-13T18:57:18.766Z] pavics-sdi-master/docs/source/notebooks/WMS_example.ipynb ........       [ 11%]
[2023-02-13T18:57:21.622Z] pavics-sdi-master/docs/source/notebooks/WPS_example.ipynb ..........     [ 15%]
[2023-02-13T19:04:54.140Z] pavics-sdi-master/docs/source/notebooks/climex.ipynb ............        [ 20%]
[2023-02-13T19:04:54.140Z] pavics-sdi-master/docs/source/notebooks/eccc-geoapi-climate-stations.ipynb . [ 20%]
[2023-02-13T19:04:58.084Z] ...............                                                          [ 26%]
[2023-02-13T19:05:07.205Z] pavics-sdi-master/docs/source/notebooks/eccc-geoapi-xclim.ipynb .....    [ 27%]
[2023-02-13T19:05:10.517Z] pavics-sdi-master/docs/source/notebooks/esgf-dap.ipynb FFFFFF            [ 30%]
[2023-02-13T19:05:26.142Z] pavics-sdi-master/docs/source/notebooks/forecasts.ipynb ......           [ 32%]
[2023-02-13T19:05:27.523Z] pavics-sdi-master/docs/source/notebooks/jupyter_extensions.ipynb .       [ 32%]
[2023-02-13T19:05:33.070Z] pavics-sdi-master/docs/source/notebooks/opendap.ipynb .......            [ 35%]
[2023-02-13T19:05:38.545Z] pavics-sdi-master/docs/source/notebooks/pavics_thredds.ipynb .....       [ 37%]
[2023-02-13T19:08:49.731Z] pavics-sdi-master/docs/source/notebooks/regridding.ipynb ............... [ 42%]
[2023-02-13T19:09:54.907Z] .............                                                            [ 47%]
[2023-02-13T19:09:59.881Z] pavics-sdi-master/docs/source/notebooks/rendering.ipynb ....             [ 49%]
[2023-02-13T19:10:02.018Z] pavics-sdi-master/docs/source/notebooks/subset-user-input.ipynb ........ [ 52%]
[2023-02-13T19:10:21.127Z] .................                                                        [ 58%]
[2023-02-13T19:10:27.863Z] pavics-sdi-master/docs/source/notebooks/subsetting.ipynb .....           [ 60%]
[2023-02-13T19:10:29.823Z] pavics-sdi-master/docs/source/notebook-components/weaver_example.ipynb . [ 60%]
[2023-02-13T19:10:30.604Z] .FFFFFFFF                                                                [ 64%]
[2023-02-13T19:10:41.651Z] finch-master/docs/source/notebooks/dap_subset.ipynb ...........          [ 68%]
[2023-02-13T19:10:51.249Z] finch-master/docs/source/notebooks/finch-usage.ipynb ......              [ 70%]
[2023-02-13T19:11:18.645Z] finch-master/docs/source/notebooks/subset.ipynb ....................ss.  [ 79%]
[2023-02-13T19:11:20.024Z] PAVICS-landing-master/content/notebooks/climate_indicators/PAVICStutorial_ClimateDataAnalysis-1DataAccess.ipynb . [ 79%]
[2023-02-13T19:11:23.079Z] ......                                                                   [ 81%]
[2023-02-13T19:11:37.985Z] PAVICS-landing-master/content/notebooks/climate_indicators/PAVICStutorial_ClimateDataAnalysis-2Subsetting.ipynb . [ 82%]
[2023-02-13T19:11:53.672Z] .............                                                            [ 86%]
[2023-02-13T19:12:01.802Z] PAVICS-landing-master/content/notebooks/climate_indicators/PAVICStutorial_ClimateDataAnalysis-3Climate-Indicators.ipynb . [ 87%]
[2023-02-13T19:12:30.959Z] ....s.                                                                   [ 89%]
[2023-02-13T19:12:39.085Z] PAVICS-landing-master/content/notebooks/climate_indicators/PAVICStutorial_ClimateDataAnalysis-4Ensembles.ipynb . [ 89%]
[2023-02-13T19:12:43.155Z] ...                                                                      [ 91%]
[2023-02-13T19:12:58.077Z] PAVICS-landing-master/content/notebooks/climate_indicators/PAVICStutorial_ClimateDataAnalysis-5Visualization.ipynb . [ 91%]
[2023-02-13T19:13:20.151Z] ......                                                                   [ 93%]
[2023-02-13T19:13:22.633Z] notebooks/hummingbird.ipynb ............                                 [ 98%]
[2023-02-13T19:16:47.248Z] notebooks/stress-tests.ipynb .....                                       [100%]
[2023-02-13T19:16:47.248Z] 
[2023-02-13T19:16:47.248Z] =================================== FAILURES ===================================
    
  

@fmigneault
Copy link
Collaborator Author

@tlvu
This is not critical, so I will wait for Solr and ncWMS2 to be resolved first, and see then if this is still needed.
I've been working on an improved CanarieAPI also, which should resolve multiple issues I've encountered.

@fmigneault fmigneault marked this pull request as draft February 13, 2023 20:18
@tlvu
Copy link
Collaborator

tlvu commented Feb 14, 2023

will wait for Solr and ncWMS2 to be resolved first

Misha will do the removal for part 1 of the components refactoring. It might take a while as he is ramping up. If this is frequently causing issues with CRIM pipeline, you can remove them early in this canarie-api monitoring.

Talking of monitoring, you would want to add Weaver to this canarie-api monitoring at the same time.

@fmigneault
Copy link
Collaborator Author

you can remove them early in this canarie-api monitoring.

Will do in #284

Talking of monitoring, you would want to add Weaver to this canarie-api monitoring at the same time.

It's already included in it. Same with Cowbird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants