Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible changes in web observations #32897

Closed
olga-larina opened this issue May 26, 2024 · 14 comments
Closed

Possible changes in web observations #32897

olga-larina opened this issue May 26, 2024 · 14 comments
Labels
for: stackoverflow A question that's better suited to stackoverflow.com in: web Issues in web modules (web, webmvc, webflux, websocket) status: invalid An issue that we don't feel is valid theme: observability An issue related to observability and tracing

Comments

@olga-larina
Copy link

olga-larina commented May 26, 2024

Hello.
I have some questions about web observation issues.
We updated from Spring Boot 2.x to Spring Boot 3.1.4 and faced the problems with metrics. After investigating we found related issues in micrometer-metrics and spring-web (for example, #31417, #31388, #27587, micrometer-metrics/micrometer#3874). Updating versions solved our problems.

But we found several places, where problems can arise (maybe we are wrong).
Can you please have a look at them?

Thanks!

private void doOnTerminate(ServerRequestObservationContext context) {
ServerHttpResponse response = context.getResponse();
if (response != null) {
if (response.isCommitted()) {
this.observation.stop();
}
else {
response.beforeCommit(() -> {
this.observation.stop();
return Mono.empty();
});
- should we stop observation if response is null? (same in
private void doOnTerminate(ServerRequestObservationContext context) {
)

- volatile HttpStatusCode statusCode?

- volatile HttpHeaders readOnlyHeaders?

- volatile String pathPattern, volatile boolean connectionAborted?

protected KeyValue outcome(ServerRequestObservationContext context) {
if (context.isConnectionAborted()) {
- same logic as in
protected KeyValue status(ServerRequestObservationContext context) {
if (context.isConnectionAborted() && (context.getResponse() == null || !context.getResponse().isCommitted())) {
?

@olga-larina olga-larina changed the title Possible changes in observations Possible changes in web observations May 26, 2024
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label May 26, 2024
@olga-larina
Copy link
Author

And I have a question about "active" metrics (for example, http.client.requests.active or http.server.requests.active). Should I create another issue for it?

Observation is started at the beginning of the request (server or client). And at this time LongTaskTimer for "xxx.active" metrics with tags is created. But at this time we have only default tag values (for example, http status code) from org.springframework.http.server.reactive.observation.DefaultServerRequestObservationConvention or org.springframework.web.reactive.function.client.DefaultClientRequestObservationConvention. And onStop tags in LongTaskTimer are not updated. Is it a problem? Or that is how it should be?

Related code on micrometer:

https://github.com/micrometer-metrics/micrometer/blob/23b6c43d9ce7a758dd5aa4620c776358f3a86039/micrometer-core/src/main/java/io/micrometer/core/instrument/observation/DefaultMeterObservationHandler.java#L72-L79

@snicoll
Copy link
Member

snicoll commented May 26, 2024

Thanks for getting in touch, but it feels like this is a question that would be better suited to Stack Overflow. As mentioned in the guidelines for contributing, we prefer to use the issue tracker only for bugs and enhancements. Feel free to update this issue with a link to the re-posted question (so that other people can find it) or add some more details if you feel this is a genuine bug.

If you intended to report an issue, then this should be one issue with instructions to reproduce. As it stands, this issue is not actionable.

@snicoll snicoll closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024
@snicoll snicoll added for: stackoverflow A question that's better suited to stackoverflow.com and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels May 26, 2024
@bclozel
Copy link
Member

bclozel commented May 26, 2024

Also see micrometer-metrics/micrometer#5147

As Stéphane pointed out already, you can provide a sample application that demonstrates the problem, if there is one. As for volatile members, it's only useful if those are read/written concurrently, which is not the case here. I don't think the previous generation of instrumentation was using those, did you encounter an issue in the past with it? Feel free to share a link to your question on StackOverflow and I'll have a look.

@olga-larina
Copy link
Author

olga-larina commented May 27, 2024

This is the example which shows that xxx_active metrics are created only with default values from DefaultServerRequestObservationConvention and DefaultClientRequestObservationConvention (because they are created on start of the observation). (Use App.kt to start the application, TestCheck.kt to query test endpoint, curl localhost:8080/actuator/prometheus to view metrics)

Also it shows that fields in AbstractServerHttpResponse and ServerRequestObservationContext can be accessed from different threads. This is reproduced by using publishOn in the custom filter (TestFilter), but without it some fields are also read/written from different threads. You can see it in debug mode. There has already been the concurrency issue in AbstractServerHttpResponse #27587, where commitActions were changed to CopyOnWriteArrayList.

If you call TestCheck with numThreads=100 and numRequests=10000 for example, the metrics with outcome="UNKNOWN" are created (numThreads and numRequests can vary). This is the same issue as #31388 for status code. But the changes were made only for status. That's why if the connection is aborted, outcome becomes UNKNOWN.

All of these points are reproduced in Spring boot 3.1.10 and 3.2.4.

I don`t know how to reproduce null response in

private void doOnTerminate(ServerRequestObservationContext context) {
ServerHttpResponse response = context.getResponse();
if (response != null) {
if (response.isCommitted()) {
this.observation.stop();
}
else {
response.beforeCommit(() -> {
this.observation.stop();
return Mono.empty();
});
and
private void doOnTerminate(ServerRequestObservationContext context) {

But it looks like observation should be stopped here, because it is started in doFirst.

Can you please have a look? Or I should publish it on StackOverflow?

untitled.zip

@bclozel bclozel reopened this May 27, 2024
@bclozel
Copy link
Member

bclozel commented May 27, 2024

This is effectively conflating 3 different problems in a single issue, we'll try and work from here.

@bclozel
Copy link
Member

bclozel commented May 27, 2024

This is the example which shows that xxx_active metrics are created only with default values from DefaultServerRequestObservationConvention and DefaultClientRequestObservationConvention (because they are created on start of the observation). (Use App.kt to start the application, TestCheck.kt to query test endpoint, curl localhost:8080/actuator/prometheus to view metrics)

It's not ideal, but this is the expected behavior. Long task timers collect tags as soon as the obversation is started. In the case of HTTP server observations, nothing much is available at that point besides the HTTP request basics. The only way to collect more data would be to start the observation only when the request is mapped to a controller method - this means that we would not measure HTTP routing for regular timers - that would be incorrect in my opinion.

Micrometer made it possible to disable such timers and Spring Boot has a dedicated option for this. "Active" timers make sense in many cases, but I agree that here this has little value.

@snicoll snicoll added status: waiting-for-triage An issue we've not yet triaged or decided on and removed for: stackoverflow A question that's better suited to stackoverflow.com labels May 27, 2024
@bclozel
Copy link
Member

bclozel commented May 27, 2024

Also it shows that fields in AbstractServerHttpResponse and ServerRequestObservationContext can be accessed from different threads. This is reproduced by using publishOn in the custom filter (TestFilter), but without it some fields are also read/written from different threads. You can see it in debug mode. There has already been the concurrency issue in AbstractServerHttpResponse #27587, where commitActions were changed to CopyOnWriteArrayList.

It's expected to have objects read/written from different threads as this is the concurrency model chosen by our Reactive stack. State should not be read/written concurrently by different threads though.

#27587 did not solve a concurrency problem, see #27587 (comment)

#31388 did not solve a concurrency issue either, we have just refined the "UNKNOWN" outcome to be only triggered if the response was not fully written to the network. This reduces the amount of "false positives" in metrics. As you can see, the fix did not change concurrency visibility for state nor switch to a concurrent collection implementation.

If you call TestCheck with numThreads=100 and numRequests=10000 for example, the metrics with outcome="UNKNOWN" are created (numThreads and numRequests can vary). This is the same issue as #31388 for status code. But the changes were made only for status. That's why if the connection is aborted, outcome becomes UNKNOWN.

Indeed, it seems that using a WebFilter like the following changes the behavior:

@Component
class TestFilter: WebFilter {
    private val executorService = Executors.newFixedThreadPool(2)
    private val scheduler = Schedulers.fromExecutor(executorService)

    override fun filter(exchange: ServerWebExchange, chain: WebFilterChain): Mono<Void> {
        return chain.filter(exchange).publishOn(scheduler)
    }
}

I don't believe this is due to concurrent modifications though. In debug mode, I can see that org.springframework.web.server.adapter.HttpWebHandlerAdapter.ObservationSignalListener#doOnCancel is being called. This means a CANCEL signal is flowing through and we interpret that as the HTTP client closing the connection before the server had a chance to complete the exchange. In that sense, adding volatile to various places would not fix anything as this is not due to concurrent read/writes of state. The observation convention is not reading outdated state, the observation context is rightfully updated because of a signal - the observation is stopped right after on the same thread and metrics information is collected right away.

Scheduling the handling of the HTTP exchange on an executor (and not on the Netty workers) is quite surprising to me and this could be the cause of the problem here, especially on a bounded executor. Maybe @simonbasle or @rstoyanchev have an opinion about this?

@bclozel
Copy link
Member

bclozel commented May 27, 2024

Now about the last point in this issue

I don't know how to reproduce null response in

private void doOnTerminate(ServerRequestObservationContext context) {
ServerHttpResponse response = context.getResponse();
if (response != null) {
if (response.isCommitted()) {
this.observation.stop();
}
else {
response.beforeCommit(() -> {
this.observation.stop();
return Mono.empty();
});

and
private void doOnTerminate(ServerRequestObservationContext context) {

But it looks like observation should be stopped here, because it is started in doFirst.

This is a nullability refinement introduced in c531a8a#diff-564f3e4af414eeaaef0b11ac1d0394443b248683e5000bde0b225576ab991c8c. Effectively, at this point, if the observation has been started, the response cannot be null. So unless you can demonstrate a concrete use case where a problem happens with this, there is nothing to be fixed in my opinion.

@olga-larina
Copy link
Author

#31388 did not solve a concurrency issue either, we have just refined the "UNKNOWN" outcome to be only triggered if the response was not fully written to the network. This reduces the amount of "false positives" in metrics. As you can see, the fix did not change concurrency visibility for state nor switch to a concurrent collection implementation.

Yes, it did not solve a concurrency issue. This issue fixed UNKNOWN status. But the changes were made only for status (not outcome). That's why if the connection is aborted, outcome becomes UNKNOWN. Is it ok?

@bclozel
Copy link
Member

bclozel commented May 27, 2024

Yes, it did not solve a concurrency issue. This issue fixed UNKNOWN status. But the changes were made only for status (not outcome). That's why if the connection is aborted, outcome becomes UNKNOWN. Is it ok?

No, it's the opposite. If the server receives a CANCEL signal, this means the connection has been closed/aborted by the client. As a result, we mark the observation context as "connection aborted" and this is later reported as "outcome unknown" in the observation. #31388 merely reduced the number of those by being more lenient: we consider that "outcome: UNKNOWN" should happen when the connection is closed by the client and the response has not been flushed already. In your case, your TestFilter my trigger this case very often because of how it's scheduled. It doesn't mean the information is wrong, though.

@olga-larina
Copy link
Author

No, it's the opposite. If the server receives a CANCEL signal, this means the connection has been closed/aborted by the client. As a result, we mark the observation context as "connection aborted" and this is later reported as "outcome unknown" in the observation. #31388 merely reduced the number of those by being more lenient: we consider that "outcome: UNKNOWN" should happen when the connection is closed by the client and the response has not been flushed already. In your case, your TestFilter my trigger this case very often because of how it's scheduled. It doesn't mean the information is wrong, though.

This logic was applied only for status, not outcome. Outcome is always unknown for aborted context.

protected KeyValue status(ServerRequestObservationContext context) {
if (context.isConnectionAborted() && (context.getResponse() == null || !context.getResponse().isCommitted())) {
return STATUS_UNKNOWN;
}
return (context.getResponse() != null && context.getResponse().getStatusCode() != null) ?
KeyValue.of(LowCardinalityKeyNames.STATUS, Integer.toString(context.getResponse().getStatusCode().value())) :
STATUS_UNKNOWN;
}

protected KeyValue outcome(ServerRequestObservationContext context) {
if (context.isConnectionAborted()) {
return HTTP_OUTCOME_UNKNOWN;
}
if (context.getResponse() != null && context.getResponse().getStatusCode() != null) {
return HttpOutcome.forStatus(context.getResponse().getStatusCode());
}
return HTTP_OUTCOME_UNKNOWN;

@bclozel
Copy link
Member

bclozel commented May 27, 2024

Sorry, in my previous comments I should have mentioned "status: UNKNOWN" instead of "outcome: UNKNOWN". This is what #31388 was about. If the response is committed, it means that we're confident that the response status would be what it is (so, not UNKNOWN) even if the client does not receive the response.

Now, you're asking about making the "ouctome" not "UNKNOWN" in those cases. If the connection is aborted, this can mean that:

  1. the client read the response or part of the response and bailed out before the server had a chance to close the connection as it should
  2. there was a network problem and the client could not receive the response entirely

From the server point of view, we cannot know which case it is. What "outcome" value would you use for this case? 1) would be "SUCCESSFUL", but 2) would definitely be a "FAILURE". Please advise.

@bclozel bclozel added in: web Issues in web modules (web, webmvc, webflux, websocket) theme: observability An issue related to observability and tracing labels May 27, 2024
@olga-larina
Copy link
Author

On the one hand it seems that "outcome" should correspond "status", because sometimes we receive the metrics with status=200 and outcome=UNKNOWN (in the plain code, without such filters as in the example). But on the other hand it should be unknown (and it may be useful to have this value).

Thank you very much for investigating my issue and explaining the details.

@bclozel
Copy link
Member

bclozel commented Jun 4, 2024

If I'm not mistaken, we have discussed and solved all questions here.

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024
@bclozel bclozel added status: invalid An issue that we don't feel is valid for: stackoverflow A question that's better suited to stackoverflow.com and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: stackoverflow A question that's better suited to stackoverflow.com in: web Issues in web modules (web, webmvc, webflux, websocket) status: invalid An issue that we don't feel is valid theme: observability An issue related to observability and tracing
Projects
None yet
Development

No branches or pull requests

4 participants