-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cdc: fetching specific topics in RefreshMetadata during Dial() caused error #116872
Comments
Posted a question on sarama repo and hope we can get more insights there - IBM/sarama#2755 |
cc @cockroachdb/cdc |
Patch (cockroachdb#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh metadata exclusively for kafkaSink.topics() rather than for all topics. This triggered a sarama error ( an invalid replication factor in Kafka servers). It seems hard to get to the bottom of it given that there are too many unknowns from sarama and kafka side (More in cockroachdb#116872). We decided to revert back to what Dial() previously did in sarama code and will continue investigating afterwards. Part of: cockroachdb#116358, cockroachdb#116872 Release note: None
Patch (cockroachdb#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh metadata exclusively for kafkaSink.topics() rather than for all topics. This triggered a sarama error ( an invalid replication factor in Kafka servers). It seems hard to get to the bottom of it given that there are too many unknowns from sarama and kafka side (More in cockroachdb#116872). We decided to revert back to what Dial() previously did in sarama code and will continue investigating afterwards. Part of: cockroachdb#116872 Fixes: cockroachdb#116358 Release note: None
116414: changefeedccl: revert Dial() to fetch all metadata topics r=jayshrivastava a=wenyihu6 Patch (#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh metadata exclusively for kafkaSink.topics() rather than for all topics. This triggered a sarama error ( an invalid replication factor in Kafka servers). It seems hard to get to the bottom of it given that there are too many unknowns from sarama and kafka side (More in #116872). We decided to revert back to what Dial() previously did in sarama code and will continue investigating afterwards. Part of: #116872 Fixes: #116358 Release note: None 116894: storage: increase testing shard count r=jbowens a=itsbilal We're starting to sporadically see failures when the test shard for pkg/storage times out as a whole. This change doubles the shard count to give each test more headroom before it times out. Fixes #116692. Epic: none Release note: None 116910: testserver: disable tenant randomization under race in multi-node clusters r=yuzefovich a=yuzefovich We now run `race` builds in the EngFlow environment which has either 1 CPU or 2 CPU executors. If we have a multi-node cluster and then start a default test tenant, then it's very likely for that environment to be overloaded and lead to unactionable test failures. This commit prevents this from happening by disabling the tenant randomization under race in multi-node clusters. As a result, it optimistically unskips a few tests that we skipped due to those unactionable failures. Fixes: #115619. Release note: None 116956: sql: skip Test{Experimental}RelocateVoters under duress r=yuzefovich a=yuzefovich There is little value in running these tests in complex configs. Fixes: #116939. Release note: None Co-authored-by: Wenyi Hu <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Now that we have fixed kafka auth setup in roachtests, we can change Dial() to fetch metadata for specific topics only. Fixes: cockroachdb#116872 Part of: cockroachdb#118952 Release note: none
@wenyihu6 RE:
That's not an authentication failure, that's the Kafka broker's connection to ZooKeeper attempting a TLS connection and finding that the remote ZooKeeper cluster did not have TLS enabled. It's just an INFO level log to let you know, Kafka will still work despite this, just using PLAIN for communication with ZooKeeper |
@wenyihu6 RE:
The difference here is that by refreshing metadata with an explicit set of topics, you go into the Kafka code path that permits automatic topic creation (which is disabled when you've done Metadata.Full and passed an empty topic list) https://github.com/IBM/sarama/blob/5f63a84f47c39bf08a1c276f1f6b5f1d754e9fc3/client.go#L1022-L1028 The fact that you're getting 'Replication-factor is invalid' in that scenario suggests that the number of running/active brokers you have in the Kafka cluster is fewer than the default replication factor, so Kafka is returning INVALID_REPLICATION_FACTOR to Sarama because it doesn't have enough active brokers to satisfy the create request |
Hi Dominic! Thanks for taking the time to address my inquiries! I'm curious about how sarama handles Metadata.Full set to true. Does kafka dynamically determine available topics upon receiving request for refreshing metadata here? Do you happen to recall any changes made to Sarama's For context, we recently upgraded from I'm suspecting the changes made in IBM/sarama#2645 and IBM/sarama@c42b2e0, but I'm having trouble spotting the root cause. |
@wenyihu6 sure, when requesting metadata from Kafka it’s actually the protocol that supports optionally asking for specific topics or alternatively sending a general metadata request without topics and then the broker will return all known brokers and topics for the whole cluster In terms of what’s changed in Sarama since v1.38.1, I’d suggest that v1.41.0 was the most significant release as it corrected the use of a large number of compatible but backlevel protocol versions — ensuring that we send newer versions where they are supported by the cluster (and bound by Version field in sarama.Config) Are you still seeing issues now or have you resolved them with the corrected backend config? |
Hi @dnwe! Thanks for your reply! We've fixed the misconfiguration in our kafka cluster and confirmed no issues with sarama code. This error has existed in our kafka set up for some time, but the errors were never returned back to us until this recent sarama upgrade surfaced it. So I'm wondering if you happen to know what had been changed in The error we got was https://github.com/IBM/sarama/blob/5f63a84f47c39bf08a1c276f1f6b5f1d754e9fc3/client.go#L1080. No servers were able to set up successfully in our kafka cluster due to the misconfiguration, so this error is expected. I'm just curious about what was improved in this new version which surfaced this error in this case. |
My guess would be basically what I referenced earlier. A combination of Sarama sending newer protocol versions and the backing cluster hence performing more validation or returning more info that required broker-to-broker communication |
Makes sense. Thanks for helping me understand here! I will close this and the sarama issue I opened up. |
Closing - upon investigation, we found that the error we are encountering here is due to some misconfiguration in our kafka roachtest set up. More details in #119077 (comment). |
Summary: This is a similar issue to #118525. The certificate we generated for the kafka test cluster was no longer valid after host name verification of servers became enabled by default (kafka |
Describe the problem
When calling refreshMetadata(sink.Topics()) instead of an empty argument, sarama returns
pq: kafka server: Replication-factor is invalid
for kafka-auth roachtest.kafka has always been logging errors in logs/kafka/server.log but they don't surface in sarama code.(
[2023-12-18 01:30:06,771] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Failed authentication with /10.142.0.32 (channelId=10.142.0.32:9094-10.142.0.32:34616-0) (SSL handshake failed) (org.apache.kafka.common.network.Selector)
) And explicitly asking to fetch metadata for the specific topics during Dial() revealed the error somehow, causing sarama to returnpq: kafka server: Replication-factor is invalid
.Three fixes for kafka-auth found so far:
We are going with the third option to match previous behaviour since there are too many unknowns here. 1. unclear which part of sarama code triggered the authentication error and why sarama swallowed the error? 2. when using refreshMetadata with an empty argument, sarama doesn't do anything from its side but just passes a request with empty topics to broker (https://github.com/IBM/sarama/blob/228c94e423f068fa75293e12f68cb6cf249d38d3/client.go#L1034). So my guess is that kafka never really refreshed metadata for these topics during Dial() before (maybe the kafka topics haven't been created yet then?). 3. If we failed authentication with kafka since the start, why do other kafka tests work? 4. Unclear if
RefreshMetadata
inside EmitResolvedTimestamp works as expected either.Jira issue: CRDB-34830
The text was updated successfully, but these errors were encountered: