Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved health-check ? #1

Open
zbal opened this issue Jul 2, 2019 · 10 comments
Open

Improved health-check ? #1

zbal opened this issue Jul 2, 2019 · 10 comments

Comments

@zbal
Copy link
Member

zbal commented Jul 2, 2019

I’d like to extend loopback.status() to include more useful info - any suggestion ?

Effectively - I’d like to include:

Currently we use /health that return the content of app.loopback.status() — if we don’t want to change that, could we get a /status that returns similar info + extended details ? (feels like duplicate though)…

Ultimately - this will allow us to expose via API the status of a backend component - and render it in an Admin/Status page on admin-ui, offering visibility.

/cc @xavierchow - please help evaluate and re-assign to whomever could help.

@zbal
Copy link
Member Author

zbal commented Jul 8, 2019

Following discussion in Slack...

From @ChopperLee2011 :

add version number + commit hash is doable.
/health and /status seems duplicate, we should use one, since we already use /health, so I prefer to add all the info to /health.

From @zbal :

agreed - also food-for-thoughts - what about having an arg like /health?full=true or /health?extended=true so we can support extending the /health response with extra info that we may not want to normally fetch (because of misc reason; e.g. high IO, need to connect to the DB, need to …) /cc @xavierchow @ChopperLee2011
this way we keep /health (default) small, minimal and common across all components

@xavierchow
Copy link
Member

@woodpig07 pls weigh in.

@ChopperLee2011 ChopperLee2011 self-assigned this Jul 9, 2019
@woodpig07
Copy link
Contributor

I vote for one single route /health to display all the info.

And if we add more and more stuff to it later on, then we can break them down to sections like

  • /health to display basic info
  • /health?stats=true to display basic info and server statistics
  • /health?config=true to display basic info and service configuration
    ....

@woodpig07
Copy link
Contributor

I've created PR so that we can configure the env variables we want to see.

// middleware.json
{
  "routes:before": {
    "loopback-healthcheck-middleware": {
      "params": {
        "env": {
          "component": "OMNI_COMPONENT",
          "ver": "OMNI_COMPONENT_VERSION",
          "commit": "OMNI_COMPONENT_COMMIT"
        }
      }
    }
  }
}

GET /health will return as

{
  started: "2019-08-01T16:24:34.013Z",
  uptime: 63430.004,
  version: "0.0.0",
  env: {
     component: "{OMNI_COMPONENT}",
     ver: "{OMNI_COMPONENT_VERSION}",
     commit: "{OMNI_COMPONENT_COMMIT}"
  }
}

@xavierchow @zbal @ChopperLee2011 let me know if it's good to merge for release

@xavierchow
Copy link
Member

I'm confused, the DevOps is expecting an API to get the info about a service but the API is relying on the env var set by DevOps when deploying? If so why don't check the ansible script directly?

@zbal
Copy link
Member Author

zbal commented Sep 23, 2020

@xavierchow - the reason is simply so we can expose those variables for misc status pages - those status pages are loading the data from the /health check

e.g.

Screen Shot 2020-09-23 at 3 57 04 PM

@MiffyLiye
Copy link
Member

@zbal
Copy link
Member Author

zbal commented Sep 25, 2020

@MiffyLiye @xavierchow - I understand there are other projects planning for status page - what are the ETA for those. This PR and the #2 have been opened for more than a year with no progress. I'm not so keen on having yet another year of discussion - weeks and months are also not "acceptable".

From what I've seen - the health check library you've worked on is in a separate repo, there is no conflicts with this one.
As for the design & the dashboard - the implementation of the base status page was 2h of work to get something out - which we did achieve. Whenever we fill like removing / replacing it with something more advanced - there won't be any problem, the wastage is minimal.

Can we move forward - close this issue and move onto other issues after ?

Note; I'm not saying we should not use that other library @MiffyLiye worked on - only that I need closure and get those tasks completed. Especially when all that is needed is an hour of fix and a merge.

@xavierchow
Copy link
Member

  1. it's opened for more than a year because we are not aligned and no one answered my question here: Improved health-check ? #1 (comment), I'm not quite aware of the priority of this task as well, I don't think we have to build a random request from you just because of it's small size.

  2. this middleware is widely used by many components, merging the PR won't make you status page working, as it requires changes for many components, which context(project) would you like to integrate with the new /health API?

  3. besides the three fields, what else you need to expose to the health API? do you really need the alias for env var? see: expose env variables #2 (comment), to me it's overdesigned and confusing.

          "component": "OMNI_COMPONENT",
          "ver": "OMNI_COMPONENT_VERSION",
          "commit": "OMNI_COMPONENT_COMMIT"

@bsdelf
Copy link

bsdelf commented Sep 25, 2020

K8s has liveness and readiness probes for different purposes:
https://k8smeetup.github.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

However we only expose a single /health API, here is a typical usage of omni component:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 50
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: hype
        ports:
        - containerPort: 3000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

So I have a few confusions:

  • Do we guarantee that all async bootstrap operations are done when /health returns 200?
  • Is "healthy" actually means "liveness AND readiness" in omni?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants