Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-19840: Gather operational metrics from installercache #7281

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

paul-maidment
Copy link
Contributor

The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.

counterDescriptionInstallerCachePrunedHardlink           "Counts the number of times the installercache pruned a hardlink for being too old"
counterDescriptionInstallerCacheGetReleaseOK             "Counts the number of times that a release was fetched succesfully"
counterDescriptionInstallerCacheGetReleaseTimeout        "Counts the number of times that a release timed out or had the context cancelled"
counterDescriptionInstallerCacheGetReleaseError          "Counts the number of times that a release fetch resulted in error"
counterDescriptionInstallerCacheReleaseCached            "Counts the number of times that a release was found in the cache"
counterDescriptionInstallerCacheReleaseExtracted         "Counts the number of times that a release was extracted"
counterDescriptionInstallerCacheTryEviction              "Counts the number of times that the eviction function was called"
counterDescriptionInstallerCacheReleaseEvicted           "Counts the number of times that a release was evicted"

This, combined with the event based metrics gathered in #7156 should provide enough information to track the behaviour of the cache.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 6, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 6, 2025

@paul-maidment: This pull request references MGMT-19840 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.

counterDescriptionInstallerCachePrunedHardlink "Counts the number of times the installercache pruned a hardlink for being too old"
counterDescriptionInstallerCacheGetReleaseOK "Counts the number of times that a release was fetched succesfully"
counterDescriptionInstallerCacheGetReleaseTimeout "Counts the number of times that a release timed out or had the context cancelled"
counterDescriptionInstallerCacheGetReleaseError "Counts the number of times that a release fetch resulted in error"
counterDescriptionInstallerCacheReleaseCached "Counts the number of times that a release was found in the cache"
counterDescriptionInstallerCacheReleaseExtracted "Counts the number of times that a release was extracted"
counterDescriptionInstallerCacheTryEviction "Counts the number of times that the eviction function was called"
counterDescriptionInstallerCacheReleaseEvicted "Counts the number of times that a release was evicted"

This, combined with the event based metrics gathered in #7156 should provide enough information to track the behaviour of the cache.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 6, 2025
Copy link

openshift-ci bot commented Feb 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2025
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 61.81818% with 21 lines in your changes missing coverage. Please review.

Project coverage is 67.96%. Comparing base (86a016c) to head (7b79d30).

Files with missing lines Patch % Lines
internal/installercache/installercache.go 50.00% 9 Missing and 2 partials ⚠️
internal/metrics/metricsManager.go 69.69% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7281      +/-   ##
==========================================
- Coverage   67.96%   67.96%   -0.01%     
==========================================
  Files         300      300              
  Lines       41013    41060      +47     
==========================================
+ Hits        27876    27906      +30     
- Misses      10633    10650      +17     
  Partials     2504     2504              
Files with missing lines Coverage Δ
internal/metrics/metricsManager.go 70.60% <69.69%> (-0.12%) ⬇️
internal/installercache/installercache.go 67.44% <50.00%> (-2.82%) ⬇️

... and 1 file with indirect coverage changes

@paul-maidment paul-maidment marked this pull request as draft February 6, 2025 14:05
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2025
@paul-maidment paul-maidment marked this pull request as ready for review February 9, 2025 09:24
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2025
@openshift-ci openshift-ci bot requested a review from ori-amizur February 9, 2025 09:25
@paul-maidment paul-maidment marked this pull request as draft February 9, 2025 09:26
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2025
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from f7b0e81 to 37ef3b0 Compare February 16, 2025 08:35
@paul-maidment paul-maidment marked this pull request as ready for review February 16, 2025 08:35
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2025
@paul-maidment
Copy link
Contributor Author

/test edge-e2e-metal-assisted-4-18

@paul-maidment
Copy link
Contributor Author

/test e2e-agent-compact-ipv4

@rccrdpccl
Copy link
Contributor

/cc @rivkyrizel

@openshift-ci openshift-ci bot requested a review from rivkyrizel February 17, 2025 11:36
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from 37ef3b0 to b0d03a6 Compare February 17, 2025 12:34
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from b0d03a6 to 9e0419f Compare February 18, 2025 12:13
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch 4 times, most recently from b66b722 to 9b1602b Compare February 18, 2025 16:34
Subsystem: subsystem,
Name: counterInstallerCacheGetRelease,
Help: counterDescriptionInstallerCacheGetRelease,
}, []string{labelStatus, labelReleaseID, labelCacheHit}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the label shouldn't be status, but rather error in case of error, empty otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the only purpose of this dimension is to detect an error, it's not about a real status

Copy link
Contributor Author

@paul-maidment paul-maidment Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real purpose of this dimension is to report the outcome of the request and add it to a count for that status actually. I think this is perfectly reasonable and with only three possible states, doesn't create a cardinality issue.

There are three status we are tracking

"ok" - indicating success in the request
"error" - indicating that an error (distinctly not timeout) occurred.
"timeout" - indicating that a timeout occurred and that perhaps there are issues with the Mutex

The sum of these counts should be equal to the total number of requests made. This is why we include them all, so that we can have the full context, especially if there are outliers.

The intended statement is "80% of your requests are timing out...you should look at that"

I actually think we should keep it this way.

If you believe otherwise, please explain why, are you suggesting that we ignore the "OK", "error" status completely and only worry about timeouts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is only about semantics, it would technically be the same (just replacing "status" label with "error", values would be {"", "unknown", "timeout"} or something like that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not actionable?

If we trigger an alert, what action should we take? I can only think of one: fix the code and redeploy. This would be the case with a lot of other errors throughout the code base. What guarantees a metric+alert for this specific case?

Regarding errors IMO we should strive to have generic alerts and easy traceability (with logs, traces, metrics) to the root cause. This seems very specific, if we had to do this for every potential cause of error, we'd have tons of error-related metrics.

What other side-effects having timeouts error would the app experience?

Copy link
Contributor Author

@paul-maidment paul-maidment Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only think of one: fix the code and redeploy.

I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?

Sure, we can talk about what the potential solutions are but I don't think that changes the facts on the ground.

If these errors occur, you would want to know about them ASAP, especially as you are delivering a SAAS service.

If we had to do this for every potential cause of error, we'd have tons of error-related metrics.

I am simply recording that a timeout occurred in a specific area, how you choose to later interpret that is up to you. The most likely interpretation (almost 100%) is that you have a Mutex lock up.

We can ignore this at our peril.

I don't think recording this metric is "recording an error metric" we are just recording a fact -- I saw timeouts. I think this is highly relevant.

What other side-effects having timeouts error would the app experience?

The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention.

As for action, sure, you could automate a redeployment and this may work, but if it doesn't, you probably still need a real human to take a look.

I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?

What I am suggesting is to make sure we have a path of action for each metric. In this case, SRE could not act on this metric, but it doesn't necessarily mean we do not want to track it.

I do not see why this error would need to be treated in a special way, but I might be wrong - that's what I am trying to find out.

The inability to extract any release due to Mutex lockup we can assume.

What other symptoms would this cause? Deadlock in the main thread and app unresponsive?

I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.

The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.

Copy link
Contributor Author

@paul-maidment paul-maidment Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other symptoms would this cause? Deadlock in the main thread and app unresponsive?

As mentioned above

"The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention."

More specifically this would result in near immediate retries for any requests that needed to download a release as these would be waiting for storage to be free.

Releases that have already been cached would be allowed to proceed.

But the service degradation would be quite nasty.

The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.

This is not "just in case" - we are fairly certain we want to alert when this metric goes 'non zero' in a timeframe.
This metric can be used to infer with a high degree of reliability that there is an error with the Mutex.

My point here was less about registering something "just in case" and more a response to your comment about this metric being "for a specific error". I am making the point that this is not the case.

We are recording this metric to help us detect a specific problem, sure... But we do not decide here whether this is a problem or not (Spoiler -- it is and we would definitely write an alert in this case). That will be done later.

I therefore argue that we don't betray any 'generic' approach by reporting on a fact "There are timeouts"

I don't think this is the wrong approach, we are running a SAAS service and irrespective of whether SRE would be able to take action (we could argue that a service restart even without a code change would help in some cases and at 3AM this is the most likely thing they would try.) they need to know that there is a site reliability issue and the inability to serve a fair proportion of your customers, is exactly that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline we can get a compromise solution and have a metric to expose error(s)

@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from 9b1602b to 1cf589e Compare February 19, 2025 14:47
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch 3 times, most recently from 9b6070c to 6b99a6a Compare February 27, 2025 12:24
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from 6b99a6a to f6610df Compare February 27, 2025 12:43
The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.

	counterDescriptionInstallerCachePrunedHardlink           "Counts the number of times the installercache pruned a hardlink for being too old"
	counterDescriptionInstallerCacheGetReleaseOK             "Counts the number of times that a release was fetched succesfully"
	counterDescriptionInstallerCacheGetReleaseTimeout        "Counts the number of times that a release timed out or had the context cancelled"
	counterDescriptionInstallerCacheGetReleaseError          "Counts the number of times that a release fetch resulted in error"
	counterDescriptionInstallerCacheReleaseCached            "Counts the number of times that a release was found in the cache"
	counterDescriptionInstallerCacheReleaseExtracted         "Counts the number of times that a release was extracted"
	counterDescriptionInstallerCacheTryEviction              "Counts the number of times that the eviction function was called"
	counterDescriptionInstallerCacheReleaseEvicted           "Counts the number of times that a release was evicted"

This, combined with the event based metrics gathered in openshift#7156 should provide enough information to track the behaviour of the cache.
@paul-maidment paul-maidment force-pushed the add-metrics-to-installercache branch from f6610df to 7b79d30 Compare February 27, 2025 13:08
@paul-maidment
Copy link
Contributor Author

/test edge-lint

Copy link

openshift-ci bot commented Feb 27, 2025

@paul-maidment: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agent-compact-ipv4 7b79d30 link true /test e2e-agent-compact-ipv4

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

i.log.WithError(err).Errorf("failed to prune hard link %s", finfo.path)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this continue?

default:
release, err := i.get(releaseID, releaseIDMirror, pullSecret, ocRelease, ocpVersion, clusterID)
majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be done outside the for select

majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret)
if err != nil {
i.log.Warnf("unable to get majorMinorVersion to record metric for %s falling back to full URI", releaseID)
majorMinorVersion = releaseID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed releaseID should not be a value. We can set unknown or undfined or something if we cannot extract the value from the release (which should never happen anyway)

if err == nil {
i.metricsAPI.InstallerCacheGetReleaseOK(majorMinorVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed on having 2 metrics:

  • assisted_installer_cache_get_release{hit="(true|false)"}
  • assisted_installer_cache_get_release_error{error="(timeout|whatever|other/unknown/undefined)"}

I still think the error metric won't be of much (or any) use, but definitely we do not want to count all requests there. What I thought of this is hits+misses+errors should return the total number of requests to this function.

Did I misunderstand anything?

@@ -39,6 +40,10 @@ const (
counterFilesystemUsagePercentage = "assisted_installer_filesystem_usage_percentage"
histogramMonitoredHostsDurationMs = "assisted_installer_monitored_hosts_duration_ms"
histogramMonitoredClustersDurationMs = "assisted_installer_monitored_clusters_duration_ms"
counterInstallerCacheGetRelease = "assisted_installer_cache_get_release"
counterInstallerCacheReleaseCached = "assisted_installer_cache_get_release_cached"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unused

@@ -310,8 +325,10 @@ func (i *Installers) evictFile(filePath string) error {
i.log.Infof("evicting binary file %s due to storage pressure", filePath)
err := os.Remove(filePath)
if err != nil {
i.metricsAPI.InstallerCacheReleaseEvicted(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we measure this in the outer method evict()? https://github.com/openshift/assisted-service/pull/7281/files#diff-cf73de10b668273452bf6aeb594f0385d3f3ea5b1a851c0f577d2d8989e6cccdR320

If we really want to add number of files evicted we should change the metric type to bucketed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants