-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent already retried message from being retried in a new batch #4775
Conversation
@danielmarbach I have requested your review on this, since this 'fix' has broken the test From https://github.com/Particular/ServiceControl/blob/master/docs/bulk-retries-design.md#done, the duplication issue shouldn't have been possible in the first place, but that's not what's happening |
I can only look into this with more depth later today or tomorrow. It is quite likely though that my name is only in these files because I was involved in the split of the instances and the .NET migration which caused me to touch MANY files. Have to bootstrap my memories first. Quick question, though. Given the inherent nature of indexes being BASE in RavenDB is this even fixable? |
i'm not sure what you mean by the indexes... however given the FailedMessage entries retain their retry state until the next failure, checking for that state as part of processing the batch does fix the issue. I just don't understand what was the intended usecase of the test that has now been broken by the code change |
To clarify my understanding. We are using the message IDs to create a deterministic request ID to make sure when the same message IDs are submitted multiple times we land on the same batch document based on the retry session ID. So this is supposed to dedup cases like First attempt [42, 43] but would not work for First attempt [42, 43] or First attempt [42, 43] or any permutations of those. So the goal is to solve the problem of effectively filtering out messages that are retried at the same time within multiple batches? |
I don’t believe the proposed change is a valid fix. The core idea behind staging is that if a failure occurs during dispatch to the transport, the entire staging attempt is retried for all messages within the batch. By default, we attempt to hand over all messages to the transport. If that fails, we fall back to concurrent individual sends, allowing each dispatch operation and its associated failed message retry to be handled separately. The test appears to simulate the processor executing a series of scheduled recoverability operations. However, instead of being triggered by a timer, the test invokes these executions directly in a while loop. Since the loop forces the processor to exceed the maximum execution limit of five, it eventually removes the poison message from the batch. By modifying the query to process only unresolved messages, you unintentionally broke the scenario where a pending retry fails during dispatch, and the next execution restages those pending retries. In subsequent iterations, all documents are already marked as pending retry. If the goal is to achieve what you’re aiming for, the retries gateway would need to deduplicate these cases—though doing so efficiently in memory might be challenging. Currently, the retry manager creates a “deduplication record” using the requestId, but still executes staging regardless of whether a record exists. To address this, you would likely need to separate out already staged message IDs in a new retry attempt, add only those not already part of another batch, and adjust the failed message count in the new batch accordingly. I wonder though if we are trying to solve something staging and retrying was never designed for... |
replace with #4789 |
fix for a scenario where the API is called for retry (
[Route("errors/retry")]
) multiple times, resulting in two batches processing the same message and thus creating a duplicate message in the input queue