-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grpc metrics #414
Add grpc metrics #414
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
The code changes looks good to me -- but adding just any sort of test would be good.
@srenatus Done |
} | ||
if output.Status.Code != int32(code.Code_OK) { | ||
t.Fatal("Expected request to be allowed but got:", output) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the test scaffolding is capable enough -- but could we check that the desired prometheus metric is actually exported, and has a non-zero value after this incoming request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a panic in this test https://github.com/open-policy-agent/opa-envoy-plugin/actions/runs/5271008838/jobs/9531286549?pr=414#step:3:262
@srenatus @ashutosh-narkar Could you please review PR and my last fixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @antgubarev. Few comments inline. Also these metrics should be part of the status update and decision log. It would be great if we have a test to verify that.
internal/internal.go
Outdated
GRPCMaxRecvMsgSize int `json:"grpc-max-recv-msg-size"` | ||
GRPCMaxSendMsgSize int `json:"grpc-max-send-msg-size"` | ||
SkipRequestBodyParse bool `json:"skip-request-body-parse"` | ||
EnablePrometheusMetrics bool `json:"enable-prometheus-metrics"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we call this something like EnablePerformanceMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done
internal/internal.go
Outdated
@@ -237,6 +240,15 @@ func (p *envoyExtAuthzGrpcServer) listen() { | |||
addr = "grpc://" + addr | |||
} | |||
|
|||
if p.cfg.EnablePrometheusMetrics { | |||
summaryAuthzDuration := prometheus.NewSummary(prometheus.SummaryOpts{ | |||
Name: "grpc_authz_request_duration_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please help me understand the reason to use NewSummary
v/s something like SummaryVec
or HistogramVec
? The latter would be useful if we want aggregates afaik. Similar to this.
Nit: Can we change grpc_authz_request_duration_seconds
to grpc_request_duration_seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have only one value. Vec is for multiple values. It's a single reason. I will change to SummaryVec if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have multiple instances of the plugin and we need to aggregate the results, will we be able to do this?
Also is there any impact from the change in default behavior of Summary as indicated in this comment? It would be helpful if we have more tests in any case.
// A Summary captures individual observations from an event or sample stream and
// summarizes them in a manner similar to traditional summary statistics: 1. sum
// of observations, 2. observation count, 3. rank estimations.
//
// A typical use-case is the observation of request latencies. By default, a
// Summary provides the median, the 90th and the 99th percentile of the latency
// as rank estimations. However, the default behavior will change in the
// upcoming v1.0.0 of the library. There will be no rank estimations at all by
// default. For a sane transition, it is recommended to set the desired rank
// estimations explicitly.
//
// Note that the rank estimations cannot be aggregated in a meaningful way with
// the Prometheus query language (i.e. you cannot average or add them). If you
// need aggregatable quantiles (e.g. you want the 99th percentile latency of all
// queries served across all instances of a service), consider the Histogram
// metric type. See the Prometheus documentation for more details.
//
// To create Summary instances, use NewSummary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I fixed it. Thanks!
@antgubarev please let us know if you need any help with this. This would be a good one to get in. |
Thanks! I don't need any help but I need free time ) I'm going to finish it in the next few days. |
@ashutosh-narkar I can't find a way to test status plugin :( . Could you help me? |
@ashutosh-narkar @srenatus Could you give your opinion about this PR |
Thanks for working on this @antgubarev. I'll take a look next week. |
grpc-max-recv-msg-size: 40194304 # default: 1024 * 1024 * 4 | ||
grpc-max-send-msg-size: 2147483647 # default: max Int | ||
skip-request-body-parse: false # default: false | ||
enable-performance-metrics: false # default: false. Adds `grpc_request_duration_seconds` prometheus histogram metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antgubarev here we are stating that by enabling this we'll get a metric called grpc_request_duration_seconds
which is a histogram metric. This looks good. But below we've defined grpc_request_duration_seconds
as a prometheus.Summary
which does not seem right. I would have imagined something like a prometheus.HistogramVec
similar to what OPA does. This would be a more useful metric than a summary as users typically want to understand the latency distribution of their requests. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashutosh-narkar We've already used prometheus.HistogramVec
not Summary
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes @antgubarev. Can you please look into the test failure. Also these metrics should be part of the status update and decision log. It would be great if we have a test to verify that.
@ashutosh-narkar I'm sorry it's my fault. I accidentally reverted more commits than I needed. I fixed it! |
@antgubarev thanks so much for working on this. The changes look good. Can you please squash your commits and rebase? Thanks. |
Signed-off-by: Anton Gubarev <[email protected]>
@ashutosh-narkar Done. Thanks for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @antgubarev!
fix the issue #648