Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a traitlet to disable recording HTTP request metrics #1472

Merged
merged 3 commits into from
Nov 5, 2024

Conversation

yuvipanda
Copy link
Contributor

Since this records a series of metrics for each HTTP handler class, this quickly leads to an explosion of cardinality and makes storing metrics quite difficult. For example, just accessing the metrics endpoint creates the following 17 metrics:

http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.005",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.01",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.025",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.05",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.075",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.1",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.25",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.75",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="1.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="2.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="5.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="7.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="10.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="+Inf",method="GET",status_code="200"} 9.0
http_request_duration_seconds_count{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",method="GET",status_code="200"} 9.0
http_request_duration_seconds_sum{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",method="GET",status_code="200"} 0.009019851684570312

This has what has stalled prior attempts at collecting metrics from jupyter_server usefully in multitenant deployments (see berkeley-dsep-infra/datahub#1977).

This PR adds a traitlet that allows hub admins to turn these metrics off.

Since this records a series of metrics for each HTTP handler
class, this quickly leads to an explosion of cardinality and
makes storing metrics quite difficult. For example, just accessing
the metrics endpoint creates the following 17 metrics:

```
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.005",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.01",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.025",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.05",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.075",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.1",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.25",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="0.75",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="1.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="2.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="5.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="7.5",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="10.0",method="GET",status_code="200"} 9.0
http_request_duration_seconds_bucket{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",le="+Inf",method="GET",status_code="200"} 9.0
http_request_duration_seconds_count{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",method="GET",status_code="200"} 9.0
http_request_duration_seconds_sum{handler="jupyter_server.base.handlers.PrometheusMetricsHandler",method="GET",status_code="200"} 0.009019851684570312
```

This has what has stalled prior attempts at collecting metrics
from jupyter_server usefully in multitenant deployments
(see berkeley-dsep-infra/datahub#1977).

This PR adds a traitlet that allows hub admins to turn these
metrics off.
jupyter_server/serverapp.py Outdated Show resolved Hide resolved
Co-authored-by: Zachary Sailer <[email protected]>
@@ -41,13 +41,16 @@ def _scrub_uri(uri: str) -> str:
return uri


def log_request(handler):
def log_request(handler, record_prometheus_metrics=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding the signature here makes me a little more cautious to merge. This technically expands a public API which would normally be saved for a major release. However, we rarely make a major release of Jupyter Server (because it requires a lot of work to coordinate with out subprojects).

On the other hand, this change seems small enough that it likely shouldn't trigger a major release. We could also argue that this API is likely unused by anyone outside Jupyter Server, since it's really specific to Jupyter Server. The only people this might affect are folks that monkeypatch this method, which is discouraged.

I think we can proceed as long as we communicate clearly that there's a "possible breaking change" when this is released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, that's why I made it a default arg and set it to the value that would cause no behavior change when omitted! So a log_request(handler) from any other project would see no difference than before.

Copy link
Member

@Zsailer Zsailer Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, and this is great!

The specific case I was thinking about was where someone patches log_request to intercept our logger. Something like this:

import jupyter_server.log

# Create a custom function to monkeypatch jupyter server's log_request 
def log_request(handler):
    ...
    # custom logic

jupyter_server.log.log_request = log_request

A Jupyter Server using this monkeypatch would fail after releasing this PR right?

This is definitely a discouraged thing to do 😅, but because this function is public, the reason this fails feels like a breach of contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah interesting. that would fail, but I always assume that if you monkeypatch and something fails, that's on you :D the risk and reward of monkeypatching...

But regardless, I agree this one is ok here. If people do want to override log_request, IMO the way to do that is to override the tornado setting log_function instead - and people doing that will not be affected by this change.

I think we can proceed as long as we communicate clearly that there's a "possible breaking change" when this is released.

Is there anything you'd like me to do to make this possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you monkeypatch and something fails, that's on you :D

Agreed 👍

Nope, I think this is good to go.

I'm mostly raised/noted here in the thread so we can cross-link if someone reports a "bug" after release. It's not a bug, but a consequence of a monkeypatch 😃 Documenting this here is enough for future reference.

Thanks @yuvipanda!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay, ty @Zsailer!

@Zsailer Zsailer merged commit 045dc46 into jupyter-server:main Nov 5, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants