prevent already retried message from being retried in a new batch #4775

PhilBastian · 2025-02-05T02:51:13Z

fix for a scenario where the API is called for retry ([Route("errors/retry")]) multiple times, resulting in two batches processing the same message and thus creating a duplicate message in the input queue

PhilBastian · 2025-02-05T02:53:59Z

@danielmarbach I have requested your review on this, since this 'fix' has broken the test When_there_is_one_poison_message_it_is_removed_from_batch_and_the_status_is_Complete and all the tests in this area were created by you. Can you give some guidance on how to not break the existing batching/staging/forwarding process whilst still fixing the issue in question?

From https://github.com/Particular/ServiceControl/blob/master/docs/bulk-retries-design.md#done, the duplication issue shouldn't have been possible in the first place, but that's not what's happening

danielmarbach · 2025-02-05T07:09:46Z

I can only look into this with more depth later today or tomorrow. It is quite likely though that my name is only in these files because I was involved in the split of the instances and the .NET migration which caused me to touch MANY files. Have to bootstrap my memories first.

Quick question, though. Given the inherent nature of indexes being BASE in RavenDB is this even fixable?

PhilBastian · 2025-02-05T23:30:49Z

i'm not sure what you mean by the indexes... however given the FailedMessage entries retain their retry state until the next failure, checking for that state as part of processing the batch does fix the issue. I just don't understand what was the intended usecase of the test that has now been broken by the code change

danielmarbach · 2025-02-06T14:22:07Z

To clarify my understanding. We are using the message IDs to create a deterministic request ID to make sure when the same message IDs are submitted multiple times we land on the same batch document based on the retry session ID. So this is supposed to dedup cases like

First attempt [42, 43]
Second attempt [42, 43]

but would not work for

First attempt [42, 43]
Second attempt [43, 42]

or

First attempt [42, 43]
Second attempt [41, 42, 43]

or any permutations of those.

So the goal is to solve the problem of effectively filtering out messages that are retried at the same time within multiple batches?

danielmarbach · 2025-02-06T15:29:19Z

I don’t believe the proposed change is a valid fix. The core idea behind staging is that if a failure occurs during dispatch to the transport, the entire staging attempt is retried for all messages within the batch. By default, we attempt to hand over all messages to the transport. If that fails, we fall back to concurrent individual sends, allowing each dispatch operation and its associated failed message retry to be handled separately.

The test appears to simulate the processor executing a series of scheduled recoverability operations. However, instead of being triggered by a timer, the test invokes these executions directly in a while loop. Since the loop forces the processor to exceed the maximum execution limit of five, it eventually removes the poison message from the batch.

By modifying the query to process only unresolved messages, you unintentionally broke the scenario where a pending retry fails during dispatch, and the next execution restages those pending retries. In subsequent iterations, all documents are already marked as pending retry.

If the goal is to achieve what you’re aiming for, the retries gateway would need to deduplicate these cases—though doing so efficiently in memory might be challenging. Currently, the retry manager creates a “deduplication record” using the requestId, but still executes staging regardless of whether a record exists. To address this, you would likely need to separate out already staged message IDs in a new retry attempt, add only those not already part of another batch, and adjust the failed message count in the new batch accordingly.

I wonder though if we are trying to solve something staging and retrying was never designed for...

PhilBastian · 2025-02-10T04:45:34Z

replace with #4789

prevent already retried message from being retried in a new batch

b0d68ba

PhilBastian requested review from danielmarbach, jpalac and abparticular February 5, 2025 02:51

PhilBastian closed this Feb 10, 2025

PhilBastian deleted the 1412_retry_duplication branch February 10, 2025 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent already retried message from being retried in a new batch #4775

prevent already retried message from being retried in a new batch #4775

PhilBastian commented Feb 5, 2025

PhilBastian commented Feb 5, 2025

danielmarbach commented Feb 5, 2025

PhilBastian commented Feb 5, 2025

danielmarbach commented Feb 6, 2025

danielmarbach commented Feb 6, 2025

PhilBastian commented Feb 10, 2025

prevent already retried message from being retried in a new batch #4775

prevent already retried message from being retried in a new batch #4775

Conversation

PhilBastian commented Feb 5, 2025

PhilBastian commented Feb 5, 2025

danielmarbach commented Feb 5, 2025

PhilBastian commented Feb 5, 2025

danielmarbach commented Feb 6, 2025

danielmarbach commented Feb 6, 2025

PhilBastian commented Feb 10, 2025