MGMT-19840: Gather operational metrics from installercache #7281

paul-maidment · 2025-02-06T12:49:07Z

The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.

counterDescriptionInstallerCachePrunedHardlink           "Counts the number of times the installercache pruned a hardlink for being too old"
counterDescriptionInstallerCacheGetReleaseOK             "Counts the number of times that a release was fetched succesfully"
counterDescriptionInstallerCacheGetReleaseTimeout        "Counts the number of times that a release timed out or had the context cancelled"
counterDescriptionInstallerCacheGetReleaseError          "Counts the number of times that a release fetch resulted in error"
counterDescriptionInstallerCacheReleaseCached            "Counts the number of times that a release was found in the cache"
counterDescriptionInstallerCacheReleaseExtracted         "Counts the number of times that a release was extracted"
counterDescriptionInstallerCacheTryEviction              "Counts the number of times that the eviction function was called"
counterDescriptionInstallerCacheReleaseEvicted           "Counts the number of times that a release was evicted"

This, combined with the event based metrics gathered in #7156 should provide enough information to track the behaviour of the cache.

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-02-06T12:49:11Z

@paul-maidment: This pull request references MGMT-19840 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.

counterDescriptionInstallerCachePrunedHardlink "Counts the number of times the installercache pruned a hardlink for being too old"
counterDescriptionInstallerCacheGetReleaseOK "Counts the number of times that a release was fetched succesfully"
counterDescriptionInstallerCacheGetReleaseTimeout "Counts the number of times that a release timed out or had the context cancelled"
counterDescriptionInstallerCacheGetReleaseError "Counts the number of times that a release fetch resulted in error"
counterDescriptionInstallerCacheReleaseCached "Counts the number of times that a release was found in the cache"
counterDescriptionInstallerCacheReleaseExtracted "Counts the number of times that a release was extracted"
counterDescriptionInstallerCacheTryEviction "Counts the number of times that the eviction function was called"
counterDescriptionInstallerCacheReleaseEvicted "Counts the number of times that a release was evicted"

This, combined with the event based metrics gathered in #7156 should provide enough information to track the behaviour of the cache.

List all the issues related to this PR

New Feature

Enhancement

Bug fix

Tests

Documentation

CI/CD

What environments does this code impact?

Automation (CI, tools, etc)

Cloud

Operator Managed Deployments

None

How was this code tested?

assisted-test-infra environment

dev-scripts environment

Reviewer's test appreciated

Waiting for CI to do a full test run

Manual (Elaborate on how it was tested)

No tests needed

Checklist

Title and description added to both, commit and PR.

Relevant issues have been associated (see CONTRIBUTING guide)

This change does not require a documentation update (docstring, docs, README, etc)

Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?

Is there a bug required (and linked) for this change?

Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-02-06T12:51:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [paul-maidment]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-02-06T13:45:45Z

Codecov Report

Attention: Patch coverage is 61.81818% with 21 lines in your changes missing coverage. Please review.

Project coverage is 67.96%. Comparing base (86a016c) to head (7b79d30).

Files with missing lines	Patch %	Lines
internal/installercache/installercache.go	50.00%	9 Missing and 2 partials ⚠️
internal/metrics/metricsManager.go	69.69%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7281      +/-   ##
==========================================
- Coverage   67.96%   67.96%   -0.01%     
==========================================
  Files         300      300              
  Lines       41013    41060      +47     
==========================================
+ Hits        27876    27906      +30     
- Misses      10633    10650      +17     
  Partials     2504     2504

Files with missing lines	Coverage Δ
internal/metrics/metricsManager.go	`70.60% <69.69%> (-0.12%)`	⬇️
internal/installercache/installercache.go	`67.44% <50.00%> (-2.82%)`	⬇️

... and 1 file with indirect coverage changes

paul-maidment · 2025-02-16T16:01:06Z

/test edge-e2e-metal-assisted-4-18

paul-maidment · 2025-02-16T16:03:10Z

/test e2e-agent-compact-ipv4

internal/metrics/metricsManager.go

rccrdpccl · 2025-02-17T11:36:49Z

/cc @rivkyrizel

internal/installercache/installercache.go

internal/metrics/metricsManager.go

rccrdpccl · 2025-02-19T08:55:49Z

internal/metrics/metricsManager.go

+				Subsystem: subsystem,
+				Name:      counterInstallerCacheGetRelease,
+				Help:      counterDescriptionInstallerCacheGetRelease,
+			}, []string{labelStatus, labelReleaseID, labelCacheHit}),


I think the label shouldn't be status, but rather error in case of error, empty otherwise

Why do you think that?

because the only purpose of this dimension is to detect an error, it's not about a real status

The real purpose of this dimension is to report the outcome of the request and add it to a count for that status actually. I think this is perfectly reasonable and with only three possible states, doesn't create a cardinality issue.

There are three status we are tracking

"ok" - indicating success in the request
"error" - indicating that an error (distinctly not timeout) occurred.
"timeout" - indicating that a timeout occurred and that perhaps there are issues with the Mutex

The sum of these counts should be equal to the total number of requests made. This is why we include them all, so that we can have the full context, especially if there are outliers.

The intended statement is "80% of your requests are timing out...you should look at that"

I actually think we should keep it this way.

If you believe otherwise, please explain why, are you suggesting that we ignore the "OK", "error" status completely and only worry about timeouts?

My suggestion is only about semantics, it would technically be the same (just replacing "status" label with "error", values would be {"", "unknown", "timeout"} or something like that)

Why is this not actionable?

If we trigger an alert, what action should we take? I can only think of one: fix the code and redeploy. This would be the case with a lot of other errors throughout the code base. What guarantees a metric+alert for this specific case?

Regarding errors IMO we should strive to have generic alerts and easy traceability (with logs, traces, metrics) to the root cause. This seems very specific, if we had to do this for every potential cause of error, we'd have tons of error-related metrics.

What other side-effects having timeouts error would the app experience?

I can only think of one: fix the code and redeploy.

I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?

Sure, we can talk about what the potential solutions are but I don't think that changes the facts on the ground.

If these errors occur, you would want to know about them ASAP, especially as you are delivering a SAAS service.

If we had to do this for every potential cause of error, we'd have tons of error-related metrics.

I am simply recording that a timeout occurred in a specific area, how you choose to later interpret that is up to you. The most likely interpretation (almost 100%) is that you have a Mutex lock up.

We can ignore this at our peril.

I don't think recording this metric is "recording an error metric" we are just recording a fact -- I saw timeouts. I think this is highly relevant.

What other side-effects having timeouts error would the app experience?

The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention.

As for action, sure, you could automate a redeployment and this may work, but if it doesn't, you probably still need a real human to take a look.

I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.

I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?

What I am suggesting is to make sure we have a path of action for each metric. In this case, SRE could not act on this metric, but it doesn't necessarily mean we do not want to track it.

I do not see why this error would need to be treated in a special way, but I might be wrong - that's what I am trying to find out.

The inability to extract any release due to Mutex lockup we can assume.

What other symptoms would this cause? Deadlock in the main thread and app unresponsive?

I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.

The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.

What other symptoms would this cause? Deadlock in the main thread and app unresponsive?

As mentioned above

"The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention."

More specifically this would result in near immediate retries for any requests that needed to download a release as these would be waiting for storage to be free.

Releases that have already been cached would be allowed to proceed.

But the service degradation would be quite nasty.

The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.

This is not "just in case" - we are fairly certain we want to alert when this metric goes 'non zero' in a timeframe.
This metric can be used to infer with a high degree of reliability that there is an error with the Mutex.

My point here was less about registering something "just in case" and more a response to your comment about this metric being "for a specific error". I am making the point that this is not the case.

We are recording this metric to help us detect a specific problem, sure... But we do not decide here whether this is a problem or not (Spoiler -- it is and we would definitely write an alert in this case). That will be done later.

I therefore argue that we don't betray any 'generic' approach by reporting on a fact "There are timeouts"

I don't think this is the wrong approach, we are running a SAAS service and irrespective of whether SRE would be able to take action (we could argue that a service restart even without a code change would help in some cases and at 3AM this is the most likely thing they would try.) they need to know that there is a site reliability issue and the inability to serve a fair proportion of your customers, is exactly that.

As discussed offline we can get a compromise solution and have a metric to expose error(s)

The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution. counterDescriptionInstallerCachePrunedHardlink "Counts the number of times the installercache pruned a hardlink for being too old" counterDescriptionInstallerCacheGetReleaseOK "Counts the number of times that a release was fetched succesfully" counterDescriptionInstallerCacheGetReleaseTimeout "Counts the number of times that a release timed out or had the context cancelled" counterDescriptionInstallerCacheGetReleaseError "Counts the number of times that a release fetch resulted in error" counterDescriptionInstallerCacheReleaseCached "Counts the number of times that a release was found in the cache" counterDescriptionInstallerCacheReleaseExtracted "Counts the number of times that a release was extracted" counterDescriptionInstallerCacheTryEviction "Counts the number of times that the eviction function was called" counterDescriptionInstallerCacheReleaseEvicted "Counts the number of times that a release was evicted" This, combined with the event based metrics gathered in openshift#7156 should provide enough information to track the behaviour of the cache.

paul-maidment · 2025-02-27T15:00:30Z

/test edge-lint

openshift-ci · 2025-02-27T17:18:34Z

@paul-maidment: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agent-compact-ipv4	`7b79d30`	link	true	`/test e2e-agent-compact-ipv4`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rccrdpccl · 2025-02-28T10:35:11Z

internal/installercache/installercache.go

 				i.log.WithError(err).Errorf("failed to prune hard link %s", finfo.path)
+				continue


What is the purpose of this continue?

rccrdpccl · 2025-02-28T10:37:11Z

internal/installercache/installercache.go

 		default:
-			release, err := i.get(releaseID, releaseIDMirror, pullSecret, ocRelease, ocpVersion, clusterID)
+			majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret)


I think this should be done outside the for select

rccrdpccl · 2025-02-28T10:37:48Z

internal/installercache/installercache.go

+			majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret)
+			if err != nil {
+				i.log.Warnf("unable to get majorMinorVersion to record metric for %s falling back to full URI", releaseID)
+				majorMinorVersion = releaseID


as discussed releaseID should not be a value. We can set unknown or undfined or something if we cannot extract the value from the release (which should never happen anyway)

rccrdpccl · 2025-02-28T10:42:58Z

internal/installercache/installercache.go

 			if err == nil {
+				i.metricsAPI.InstallerCacheGetReleaseOK(majorMinorVersion)


We agreed on having 2 metrics:

assisted_installer_cache_get_release{hit="(true|false)"}

assisted_installer_cache_get_release_error{error="(timeout|whatever|other/unknown/undefined)"}

I still think the error metric won't be of much (or any) use, but definitely we do not want to count all requests there. What I thought of this is hits+misses+errors should return the total number of requests to this function.

Did I misunderstand anything?

rccrdpccl · 2025-02-28T10:47:17Z

internal/metrics/metricsManager.go

@@ -39,6 +40,10 @@ const (
 	counterFilesystemUsagePercentage              = "assisted_installer_filesystem_usage_percentage"
 	histogramMonitoredHostsDurationMs             = "assisted_installer_monitored_hosts_duration_ms"
 	histogramMonitoredClustersDurationMs          = "assisted_installer_monitored_clusters_duration_ms"
+	counterInstallerCacheGetRelease               = "assisted_installer_cache_get_release"
+	counterInstallerCacheReleaseCached            = "assisted_installer_cache_get_release_cached"


I think this is unused

rccrdpccl · 2025-02-28T10:52:40Z

internal/installercache/installercache.go

@@ -310,8 +325,10 @@ func (i *Installers) evictFile(filePath string) error {
 	i.log.Infof("evicting binary file %s due to storage pressure", filePath)
 	err := os.Remove(filePath)
 	if err != nil {
+		i.metricsAPI.InstallerCacheReleaseEvicted(false)


Shouldn't we measure this in the outer method evict()? https://github.com/openshift/assisted-service/pull/7281/files#diff-cf73de10b668273452bf6aeb594f0385d3f3ea5b1a851c0f577d2d8989e6cccdR320

If we really want to add number of files evicted we should change the metric type to bucketed

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 6, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 6, 2025

openshift-ci bot requested review from danielerez and romfreiman February 6, 2025 12:51

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2025

paul-maidment marked this pull request as draft February 6, 2025 14:05

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2025

paul-maidment marked this pull request as ready for review February 9, 2025 09:24

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2025

openshift-ci bot requested a review from ori-amizur February 9, 2025 09:25

paul-maidment marked this pull request as draft February 9, 2025 09:26

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2025

paul-maidment force-pushed the add-metrics-to-installercache branch from f7b0e81 to 37ef3b0 Compare February 16, 2025 08:35

paul-maidment marked this pull request as ready for review February 16, 2025 08:35

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2025

openshift-ci bot requested review from CrystalChun and giladravid16 February 16, 2025 08:36

rccrdpccl reviewed Feb 17, 2025

View reviewed changes

internal/metrics/metricsManager.go Outdated Show resolved Hide resolved

gamli75 reviewed Feb 17, 2025

View reviewed changes

internal/metrics/metricsManager.go Outdated Show resolved Hide resolved

openshift-ci bot requested a review from rivkyrizel February 17, 2025 11:36

paul-maidment requested a review from rccrdpccl February 17, 2025 12:30

paul-maidment force-pushed the add-metrics-to-installercache branch from 37ef3b0 to b0d03a6 Compare February 17, 2025 12:34

paul-maidment requested a review from gamli75 February 17, 2025 12:34

paul-maidment force-pushed the add-metrics-to-installercache branch from b0d03a6 to 9e0419f Compare February 18, 2025 12:13

paul-maidment force-pushed the add-metrics-to-installercache branch 4 times, most recently from b66b722 to 9b1602b Compare February 18, 2025 16:34

rccrdpccl reviewed Feb 19, 2025

View reviewed changes

internal/installercache/installercache.go Outdated Show resolved Hide resolved

rccrdpccl reviewed Feb 19, 2025

View reviewed changes

internal/metrics/metricsManager.go Outdated Show resolved Hide resolved

rccrdpccl reviewed Feb 19, 2025

View reviewed changes

paul-maidment requested a review from rccrdpccl February 19, 2025 09:24

paul-maidment force-pushed the add-metrics-to-installercache branch from 9b1602b to 1cf589e Compare February 19, 2025 14:47

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025

paul-maidment force-pushed the add-metrics-to-installercache branch 3 times, most recently from 9b6070c to 6b99a6a Compare February 27, 2025 12:24

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025

paul-maidment force-pushed the add-metrics-to-installercache branch from 6b99a6a to f6610df Compare February 27, 2025 12:43

paul-maidment force-pushed the add-metrics-to-installercache branch from f6610df to 7b79d30 Compare February 27, 2025 13:08

rccrdpccl reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MGMT-19840: Gather operational metrics from installercache #7281

MGMT-19840: Gather operational metrics from installercache #7281

paul-maidment commented Feb 6, 2025

openshift-ci-robot commented Feb 6, 2025 •

edited by openshift-ci bot

Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Feb 6, 2025

codecov bot commented Feb 6, 2025 •

edited

Loading

paul-maidment commented Feb 16, 2025

paul-maidment commented Feb 16, 2025

rccrdpccl commented Feb 17, 2025

rccrdpccl Feb 19, 2025

paul-maidment Feb 19, 2025

rccrdpccl Feb 19, 2025

paul-maidment Feb 19, 2025 •

edited

Loading

rccrdpccl Feb 19, 2025

rccrdpccl Feb 19, 2025

paul-maidment Feb 19, 2025 •

edited

Loading

rccrdpccl Feb 19, 2025

paul-maidment Feb 19, 2025 •

edited

Loading

rccrdpccl Feb 19, 2025

paul-maidment commented Feb 27, 2025

openshift-ci bot commented Feb 27, 2025

rccrdpccl Feb 28, 2025

rccrdpccl Feb 28, 2025

rccrdpccl Feb 28, 2025

rccrdpccl Feb 28, 2025

rccrdpccl Feb 28, 2025

rccrdpccl Feb 28, 2025

		i.log.WithError(err).Errorf("failed to prune hard link %s", finfo.path)
		continue

		if err == nil {
		i.metricsAPI.InstallerCacheGetReleaseOK(majorMinorVersion)

MGMT-19840: Gather operational metrics from installercache #7281

Are you sure you want to change the base?

MGMT-19840: Gather operational metrics from installercache #7281

Conversation

paul-maidment commented Feb 6, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented Feb 6, 2025 • edited by openshift-ci bot Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Feb 6, 2025

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

paul-maidment commented Feb 16, 2025

paul-maidment commented Feb 16, 2025

rccrdpccl commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-maidment Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-maidment Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-maidment Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-maidment commented Feb 27, 2025

openshift-ci bot commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Feb 6, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

paul-maidment Feb 19, 2025 •

edited

Loading

paul-maidment Feb 19, 2025 •

edited

Loading

paul-maidment Feb 19, 2025 •

edited

Loading