Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent already retried message from being retried in a new batch #4775

Closed
wants to merge 1 commit into from

Conversation

PhilBastian
Copy link
Contributor

fix for a scenario where the API is called for retry ([Route("errors/retry")]) multiple times, resulting in two batches processing the same message and thus creating a duplicate message in the input queue

@PhilBastian
Copy link
Contributor Author

@danielmarbach I have requested your review on this, since this 'fix' has broken the test When_there_is_one_poison_message_it_is_removed_from_batch_and_the_status_is_Complete and all the tests in this area were created by you. Can you give some guidance on how to not break the existing batching/staging/forwarding process whilst still fixing the issue in question?

From https://github.com/Particular/ServiceControl/blob/master/docs/bulk-retries-design.md#done, the duplication issue shouldn't have been possible in the first place, but that's not what's happening

@danielmarbach
Copy link
Contributor

I can only look into this with more depth later today or tomorrow. It is quite likely though that my name is only in these files because I was involved in the split of the instances and the .NET migration which caused me to touch MANY files. Have to bootstrap my memories first.

Quick question, though. Given the inherent nature of indexes being BASE in RavenDB is this even fixable?

@PhilBastian
Copy link
Contributor Author

i'm not sure what you mean by the indexes... however given the FailedMessage entries retain their retry state until the next failure, checking for that state as part of processing the batch does fix the issue. I just don't understand what was the intended usecase of the test that has now been broken by the code change

@danielmarbach
Copy link
Contributor

To clarify my understanding. We are using the message IDs to create a deterministic request ID to make sure when the same message IDs are submitted multiple times we land on the same batch document based on the retry session ID. So this is supposed to dedup cases like

First attempt [42, 43]
Second attempt [42, 43]

but would not work for

First attempt [42, 43]
Second attempt [43, 42]

or

First attempt [42, 43]
Second attempt [41, 42, 43]

or any permutations of those.

So the goal is to solve the problem of effectively filtering out messages that are retried at the same time within multiple batches?

@danielmarbach
Copy link
Contributor

I don’t believe the proposed change is a valid fix. The core idea behind staging is that if a failure occurs during dispatch to the transport, the entire staging attempt is retried for all messages within the batch. By default, we attempt to hand over all messages to the transport. If that fails, we fall back to concurrent individual sends, allowing each dispatch operation and its associated failed message retry to be handled separately.

The test appears to simulate the processor executing a series of scheduled recoverability operations. However, instead of being triggered by a timer, the test invokes these executions directly in a while loop. Since the loop forces the processor to exceed the maximum execution limit of five, it eventually removes the poison message from the batch.

By modifying the query to process only unresolved messages, you unintentionally broke the scenario where a pending retry fails during dispatch, and the next execution restages those pending retries. In subsequent iterations, all documents are already marked as pending retry.

If the goal is to achieve what you’re aiming for, the retries gateway would need to deduplicate these cases—though doing so efficiently in memory might be challenging. Currently, the retry manager creates a “deduplication record” using the requestId, but still executes staging regardless of whether a record exists. To address this, you would likely need to separate out already staged message IDs in a new retry attempt, add only those not already part of another batch, and adjust the failed message count in the new batch accordingly.

I wonder though if we are trying to solve something staging and retrying was never designed for...

@PhilBastian
Copy link
Contributor Author

replace with #4789

@PhilBastian PhilBastian deleted the 1412_retry_duplication branch February 10, 2025 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants