-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-19840: Gather operational metrics from installercache #7281
base: master
Are you sure you want to change the base?
MGMT-19840: Gather operational metrics from installercache #7281
Conversation
@paul-maidment: This pull request references MGMT-19840 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: paul-maidment The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7281 +/- ##
==========================================
- Coverage 67.96% 67.96% -0.01%
==========================================
Files 300 300
Lines 41013 41060 +47
==========================================
+ Hits 27876 27906 +30
- Misses 10633 10650 +17
Partials 2504 2504
|
f7b0e81
to
37ef3b0
Compare
/test edge-e2e-metal-assisted-4-18 |
/test e2e-agent-compact-ipv4 |
/cc @rivkyrizel |
37ef3b0
to
b0d03a6
Compare
b0d03a6
to
9e0419f
Compare
b66b722
to
9b1602b
Compare
internal/metrics/metricsManager.go
Outdated
Subsystem: subsystem, | ||
Name: counterInstallerCacheGetRelease, | ||
Help: counterDescriptionInstallerCacheGetRelease, | ||
}, []string{labelStatus, labelReleaseID, labelCacheHit}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the label shouldn't be status, but rather error in case of error, empty otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the only purpose of this dimension is to detect an error, it's not about a real status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real purpose of this dimension is to report the outcome of the request and add it to a count for that status actually. I think this is perfectly reasonable and with only three possible states, doesn't create a cardinality issue.
There are three status we are tracking
"ok" - indicating success in the request
"error" - indicating that an error (distinctly not timeout) occurred.
"timeout" - indicating that a timeout occurred and that perhaps there are issues with the Mutex
The sum of these counts should be equal to the total number of requests made. This is why we include them all, so that we can have the full context, especially if there are outliers.
The intended statement is "80% of your requests are timing out...you should look at that"
I actually think we should keep it this way.
If you believe otherwise, please explain why, are you suggesting that we ignore the "OK", "error" status completely and only worry about timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is only about semantics, it would technically be the same (just replacing "status" label with "error", values would be {"", "unknown", "timeout"}
or something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not actionable?
If we trigger an alert, what action should we take? I can only think of one: fix the code and redeploy. This would be the case with a lot of other errors throughout the code base. What guarantees a metric+alert for this specific case?
Regarding errors IMO we should strive to have generic alerts and easy traceability (with logs, traces, metrics) to the root cause. This seems very specific, if we had to do this for every potential cause of error, we'd have tons of error-related metrics.
What other side-effects having timeouts error would the app experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only think of one: fix the code and redeploy.
I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?
Sure, we can talk about what the potential solutions are but I don't think that changes the facts on the ground.
If these errors occur, you would want to know about them ASAP, especially as you are delivering a SAAS service.
If we had to do this for every potential cause of error, we'd have tons of error-related metrics.
I am simply recording that a timeout occurred in a specific area, how you choose to later interpret that is up to you. The most likely interpretation (almost 100%) is that you have a Mutex lock up.
We can ignore this at our peril.
I don't think recording this metric is "recording an error metric" we are just recording a fact -- I saw timeouts. I think this is highly relevant.
What other side-effects having timeouts error would the app experience?
The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention.
As for action, sure, you could automate a redeployment and this may work, but if it doesn't, you probably still need a real human to take a look.
I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not clear on what is being suggested here, to not track these errors?
Is this not a site reliability issue at that point?
What I am suggesting is to make sure we have a path of action for each metric. In this case, SRE could not act on this metric, but it doesn't necessarily mean we do not want to track it.
I do not see why this error would need to be treated in a special way, but I might be wrong - that's what I am trying to find out.
The inability to extract any release due to Mutex lockup we can assume.
What other symptoms would this cause? Deadlock in the main thread and app unresponsive?
I think the alerting policy, we can defer until later as this can probably be done (or not done if you prefer) in Prometheus right? I don't think it's incorrect to make this possible by making sure the data is there to alert on.
The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other symptoms would this cause? Deadlock in the main thread and app unresponsive?
As mentioned above
"The inability to extract any release due to Mutex lockup we can assume. A severe enough condition to require immediate intervention."
More specifically this would result in near immediate retries for any requests that needed to download a release as these would be waiting for storage to be free.
Releases that have already been cached would be allowed to proceed.
But the service degradation would be quite nasty.
The alert policy can be defined in a later stage, however we want to make sure we expose metrics that will be useful. I think exposing "just in case" it's not a good practice and will eventually just result in an overloaded prometheus.
This is not "just in case" - we are fairly certain we want to alert when this metric goes 'non zero' in a timeframe.
This metric can be used to infer with a high degree of reliability that there is an error with the Mutex.
My point here was less about registering something "just in case" and more a response to your comment about this metric being "for a specific error". I am making the point that this is not the case.
We are recording this metric to help us detect a specific problem, sure... But we do not decide here whether this is a problem or not (Spoiler -- it is and we would definitely write an alert in this case). That will be done later.
I therefore argue that we don't betray any 'generic' approach by reporting on a fact "There are timeouts"
I don't think this is the wrong approach, we are running a SAAS service and irrespective of whether SRE would be able to take action (we could argue that a service restart even without a code change would help in some cases and at 3AM this is the most likely thing they would try.) they need to know that there is a site reliability issue and the inability to serve a fair proportion of your customers, is exactly that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline we can get a compromise solution and have a metric to expose error(s)
9b1602b
to
1cf589e
Compare
9b6070c
to
6b99a6a
Compare
6b99a6a
to
f6610df
Compare
The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution. counterDescriptionInstallerCachePrunedHardlink "Counts the number of times the installercache pruned a hardlink for being too old" counterDescriptionInstallerCacheGetReleaseOK "Counts the number of times that a release was fetched succesfully" counterDescriptionInstallerCacheGetReleaseTimeout "Counts the number of times that a release timed out or had the context cancelled" counterDescriptionInstallerCacheGetReleaseError "Counts the number of times that a release fetch resulted in error" counterDescriptionInstallerCacheReleaseCached "Counts the number of times that a release was found in the cache" counterDescriptionInstallerCacheReleaseExtracted "Counts the number of times that a release was extracted" counterDescriptionInstallerCacheTryEviction "Counts the number of times that the eviction function was called" counterDescriptionInstallerCacheReleaseEvicted "Counts the number of times that a release was evicted" This, combined with the event based metrics gathered in openshift#7156 should provide enough information to track the behaviour of the cache.
f6610df
to
7b79d30
Compare
/test edge-lint |
@paul-maidment: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
i.log.WithError(err).Errorf("failed to prune hard link %s", finfo.path) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this continue
?
default: | ||
release, err := i.get(releaseID, releaseIDMirror, pullSecret, ocRelease, ocpVersion, clusterID) | ||
majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be done outside the for select
majorMinorVersion, err := ocRelease.GetMajorMinorVersion(i.log, releaseID, releaseIDMirror, pullSecret) | ||
if err != nil { | ||
i.log.Warnf("unable to get majorMinorVersion to record metric for %s falling back to full URI", releaseID) | ||
majorMinorVersion = releaseID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed releaseID should not be a value. We can set unknown
or undfined
or something if we cannot extract the value from the release (which should never happen anyway)
if err == nil { | ||
i.metricsAPI.InstallerCacheGetReleaseOK(majorMinorVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed on having 2 metrics:
assisted_installer_cache_get_release{hit="(true|false)"}
assisted_installer_cache_get_release_error{error="(timeout|whatever|other/unknown/undefined)"}
I still think the error metric won't be of much (or any) use, but definitely we do not want to count all requests there. What I thought of this is hits+misses+errors should return the total number of requests to this function.
Did I misunderstand anything?
@@ -39,6 +40,10 @@ const ( | |||
counterFilesystemUsagePercentage = "assisted_installer_filesystem_usage_percentage" | |||
histogramMonitoredHostsDurationMs = "assisted_installer_monitored_hosts_duration_ms" | |||
histogramMonitoredClustersDurationMs = "assisted_installer_monitored_clusters_duration_ms" | |||
counterInstallerCacheGetRelease = "assisted_installer_cache_get_release" | |||
counterInstallerCacheReleaseCached = "assisted_installer_cache_get_release_cached" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unused
@@ -310,8 +325,10 @@ func (i *Installers) evictFile(filePath string) error { | |||
i.log.Infof("evicting binary file %s due to storage pressure", filePath) | |||
err := os.Remove(filePath) | |||
if err != nil { | |||
i.metricsAPI.InstallerCacheReleaseEvicted(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we measure this in the outer method evict()
? https://github.com/openshift/assisted-service/pull/7281/files#diff-cf73de10b668273452bf6aeb594f0385d3f3ea5b1a851c0f577d2d8989e6cccdR320
If we really want to add number of files evicted we should change the metric type to bucketed
The intent of this PR is to trace the following statistics, implemented as counts and incremented from applicable parts of the solution.
This, combined with the event based metrics gathered in #7156 should provide enough information to track the behaviour of the cache.
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist