Improved health-check ? #1

zbal · 2019-07-02T09:04:25Z

I’d like to extend loopback.status() to include more useful info - any suggestion ?

Effectively - I’d like to include:

the content of env vars (version number + commit hash); those are included at build time for all OMNI components:
https://github.com/Wiredcraft/omni-scrm/blob/master/Dockerfile#L35-L41

Currently we use /health that return the content of app.loopback.status() — if we don’t want to change that, could we get a /status that returns similar info + extended details ? (feels like duplicate though)…

Ultimately - this will allow us to expose via API the status of a backend component - and render it in an Admin/Status page on admin-ui, offering visibility.

/cc @xavierchow - please help evaluate and re-assign to whomever could help.

The text was updated successfully, but these errors were encountered:

zbal · 2019-07-08T02:50:27Z

Following discussion in Slack...

From @ChopperLee2011 :

add version number + commit hash is doable.
/health and /status seems duplicate, we should use one, since we already use /health, so I prefer to add all the info to /health.

From @zbal :

agreed - also food-for-thoughts - what about having an arg like /health?full=true or /health?extended=true so we can support extending the /health response with extra info that we may not want to normally fetch (because of misc reason; e.g. high IO, need to connect to the DB, need to …) /cc @xavierchow @ChopperLee2011
this way we keep /health (default) small, minimal and common across all components

xavierchow · 2019-07-08T15:14:20Z

@woodpig07 pls weigh in.

woodpig07 · 2019-07-09T04:16:39Z

I vote for one single route /health to display all the info.

And if we add more and more stuff to it later on, then we can break them down to sections like

/health to display basic info
/health?stats=true to display basic info and server statistics
/health?config=true to display basic info and service configuration
....

woodpig07 · 2019-08-02T10:06:22Z

I've created PR so that we can configure the env variables we want to see.

// middleware.json
{
  "routes:before": {
    "loopback-healthcheck-middleware": {
      "params": {
        "env": {
          "component": "OMNI_COMPONENT",
          "ver": "OMNI_COMPONENT_VERSION",
          "commit": "OMNI_COMPONENT_COMMIT"
        }
      }
    }
  }
}

GET /health will return as

{
  started: "2019-08-01T16:24:34.013Z",
  uptime: 63430.004,
  version: "0.0.0",
  env: {
     component: "{OMNI_COMPONENT}",
     ver: "{OMNI_COMPONENT_VERSION}",
     commit: "{OMNI_COMPONENT_COMMIT}"
  }
}

@xavierchow @zbal @ChopperLee2011 let me know if it's good to merge for release

xavierchow · 2019-08-05T06:13:58Z

I'm confused, the DevOps is expecting an API to get the info about a service but the API is relying on the env var set by DevOps when deploying? If so why don't check the ansible script directly?

zbal · 2020-09-23T08:02:24Z

@xavierchow - the reason is simply so we can expose those variables for misc status pages - those status pages are loading the data from the /health check

e.g.

MiffyLiye · 2020-09-23T10:13:39Z

we have another design of system status page, do we need to combine the designs?

A public system status page about important SLI/SLO
Enhanced health check check for OMNI microservices

System Status Page
Health Check Extension

cc @ilyamochalov @xuqingfeng

zbal · 2020-09-25T01:26:56Z

@MiffyLiye @xavierchow - I understand there are other projects planning for status page - what are the ETA for those. This PR and the #2 have been opened for more than a year with no progress. I'm not so keen on having yet another year of discussion - weeks and months are also not "acceptable".

From what I've seen - the health check library you've worked on is in a separate repo, there is no conflicts with this one.
As for the design & the dashboard - the implementation of the base status page was 2h of work to get something out - which we did achieve. Whenever we fill like removing / replacing it with something more advanced - there won't be any problem, the wastage is minimal.

Can we move forward - close this issue and move onto other issues after ?

Note; I'm not saying we should not use that other library @MiffyLiye worked on - only that I need closure and get those tasks completed. Especially when all that is needed is an hour of fix and a merge.

xavierchow · 2020-09-25T02:35:16Z

it's opened for more than a year because we are not aligned and no one answered my question here: Improved health-check ? #1 (comment), I'm not quite aware of the priority of this task as well, I don't think we have to build a random request from you just because of it's small size.
this middleware is widely used by many components, merging the PR won't make you status page working, as it requires changes for many components, which context(project) would you like to integrate with the new /health API?
besides the three fields, what else you need to expose to the health API? do you really need the alias for env var? see: expose env variables #2 (comment), to me it's overdesigned and confusing.

          "component": "OMNI_COMPONENT",
          "ver": "OMNI_COMPONENT_VERSION",
          "commit": "OMNI_COMPONENT_COMMIT"

bsdelf · 2020-09-25T09:35:30Z

K8s has liveness and readiness probes for different purposes:
https://k8smeetup.github.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

However we only expose a single /health API, here is a typical usage of omni component:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 50
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: hype
        ports:
        - containerPort: 3000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

So I have a few confusions:

Do we guarantee that all async bootstrap operations are done when /health returns 200?
Is "healthy" actually means "liveness AND readiness" in omni?

zbal added Status: Backlog Priority: Medium labels Jul 2, 2019

zbal assigned xavierchow Jul 2, 2019

xavierchow assigned woodpig07 Jul 8, 2019

ChopperLee2011 self-assigned this Jul 9, 2019

woodpig07 mentioned this issue Aug 2, 2019

expose env variables #2

Open

xavierchow added Status: In review and removed Status: Backlog labels Aug 5, 2019

xavierchow unassigned ChopperLee2011 and woodpig07 Aug 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved health-check ? #1

Improved health-check ? #1

zbal commented Jul 2, 2019

zbal commented Jul 8, 2019 •

edited

Loading

xavierchow commented Jul 8, 2019

woodpig07 commented Jul 9, 2019

woodpig07 commented Aug 2, 2019

xavierchow commented Aug 5, 2019

zbal commented Sep 23, 2020

MiffyLiye commented Sep 23, 2020

zbal commented Sep 25, 2020

xavierchow commented Sep 25, 2020

bsdelf commented Sep 25, 2020 •

edited

Loading

Improved health-check ? #1

Improved health-check ? #1

Comments

zbal commented Jul 2, 2019

zbal commented Jul 8, 2019 • edited Loading

xavierchow commented Jul 8, 2019

woodpig07 commented Jul 9, 2019

woodpig07 commented Aug 2, 2019

xavierchow commented Aug 5, 2019

zbal commented Sep 23, 2020

MiffyLiye commented Sep 23, 2020

zbal commented Sep 25, 2020

xavierchow commented Sep 25, 2020

bsdelf commented Sep 25, 2020 • edited Loading

zbal commented Jul 8, 2019 •

edited

Loading

bsdelf commented Sep 25, 2020 •

edited

Loading