-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with partition rebalancing #328
Comments
Hi @dasilvaKevin , Why did you override the consumer property Thanks, |
Hi @LGouellec, Thank you for your reply. We have overridden the consumer property partition.assignment.strategy to roundrobin for testing because we have the same problem with range. |
|
Hi @LGouellec, We have not reproduced the same problem, but we do have two other cases. First problem: We started with one pod (log0) and we produced "product13" and this message is processed. We start two pods and produce "product15" to "product19" and all messages are processed. and this message is not processed and we have a lag on the source topic. The log2 file contains the logs of the remaining pod. |
Hi @dasilvaKevin, 1st Problem : { timestamp = 2nd Problem : { timestamp = 2024-06-13T08:56:49.3788938Z, log = { level = Debug, logger = Streamiz.Kafka.Net.Processors.SinkProcessor, original = stream-task[0|0]|processor[KSTREAM-SINK-0000000006]- Process<String,ProductWithCategory> message with key Product20 and stream.ProductWithCategory with record metadata [topic:product|partition:0|offset:4] }, message = stream-task[0|0]|processor[KSTREAM-SINK-0000000006]- Process<String,ProductWithCategory> message with key Product20 and stream.ProductWithCategory with record metadata [topic:product|partition:0|offset:4], metadata = { message_template = stream-task[0|0]|processor[KSTREAM-SINK-0000000006]- Process<String,ProductWithCategory> message with key Product20 and stream.ProductWithCategory with record metadata [topic:product|partition:0|offset:4] }, ecs = { version = 1.5.0 }, event = { severity = 1, timezone = Coordinated Universal Time, created = 2024-06-13T08:56:49.3788968Z }, process = { thread = { id = 16 }, pid = 1, name = ProductCategoryStream, executable = /app/Shared.dll } } If not, please enable the debug librdkafka log ( |
Hi @LGouellec, Thanks for the clarification on the first problem. For the second problem, when we enable the librdkafka log (config.Debug = "broker,topic,msg"), we encounter a System.NullReferenceException: 'Object reference not set to an instance of an object.' at Confluent.Kafka.Consumer`2.get_Name() in Confluent.Kafka\Consumer.cs: line 134. |
Hi @dasilvaKevin , Can you share the full stack trace of exception please ? |
Hi @LGouellec,
|
Really weird , this exception is catched : https://github.com/LGouellec/kafka-streams-dotnet/blob/0635ab5795dceb86f5a92b2b4ce68dee5d4d5346/core/Kafka/Internal/KafkaLoggerAdapter.cs#L72 Which version are you using ? Can you try to override the conf :
|
Hi @LGouellec, We use the version 1.5.1. For configuration, we override the ClientID. Here's our configuration: :
|
Please add also to have more debug logs
|
Hi @LGouellec, |
HI @dasilvaKevin , |
Hi @LGouellec, We have the same error with the last RC. |
Hi @dasilvaKevin, Can you open a ticket here confluent-kafka-dotnet and asking why |
Hi @LGouellec, I found a workaround. private string GetName(IClient client)
{
// FOR FIX
var name = "";
try
{
if (client.Handle == null || client.Handle.IsInvalid)
return "Unknown";
name = client.Name;
}
catch (NullReferenceException)
{
name = "Unknown";
}
return name;
} I reproduced the issue where a message is not consumed and there is lag in my consumer group. |
Hi, I've encountered a similar issue in my tests. Few seconds later also the related partition from TB are associated to the consumer and When this happens the consumer results as it's consuming messages from TB but nothing happens as the partition is not associated to any task. |
Hey @EmanueleAlbero, Can you share your logs when this issue appears ? It seems another problem, not directly related to this original issue, but maybe I'm wrong. Appreciate, |
CI.Processor.Debug.JustPublished.Reducted.log Attached the logs of the app just published on ESK. The issue is usually with If I delete the running pod and launch it again everything works fine.
and so on. Otherwise, as in the attached logs If it receives source2 first (line 244) it creates the task with only 1 partition and when it receives the source1 it's ignored (line 3810). I've tried to apply the fix you wrote for the issue I'm commenting but it didn't solves my problem. Thanks a lot for your incredible effort in this library! |
Hi, found out the issue for not receiving all the required topics at once was due to the consumer Partition strategy set to CooperativeSticky. |
Hi, I have found out a possible cause for the missing messages consumed\sent. The application I'm working on uses the default value
But, when the rebalancing occurs the record collector tries to dispose it
Once disposed, at the next Partitions assignment a disposed producer is assigned causing issues and access violations |
@EmanueleAlbero Will take a look next week |
Hey @EmanueleAlbero ,
If you join multiple topics, Range is better to be sure all co-partitioned tasks are assigned to the same instance. When the rebalancing occurs the record collector is flushed first, after closed but only if there is no other partitions which use the producer instance. It will easier and better to separate both issues. Btw, I'm currently conducting a satisfaction survey to understand how I can better serve and I would love to get your feedback on the product. Your insights are invaluable and will help us shape the future of our product to better meet your needs. The survey will only take a few minutes, and your responses will be completely confidential. Thank you for your time and feedback! |
Description
We are having problems with partition rebalancing. We use Kubernetes and we have two replicas for a service with a stream.
One of the problems is message loss.
We have this problem when one of the two pods is down and the second has not yet taken over. If a message is produced at that time and the pod restarts, then the message is not processed.
In some cases, messages are published in duplicates but we have not identified when this happens.
In other cases, both replicas get stuck and we have to restart a single pod to unblock the situation. Here is our configuration.
How to reproduce
Run the stream on two pods.
The text was updated successfully, but these errors were encountered: