Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: fetching specific topics in RefreshMetadata during Dial() caused error #116872

Closed
wenyihu6 opened this issue Dec 20, 2023 · 11 comments
Closed
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@wenyihu6
Copy link
Contributor

wenyihu6 commented Dec 20, 2023

Describe the problem

When calling refreshMetadata(sink.Topics()) instead of an empty argument, sarama returns pq: kafka server: Replication-factor is invalid for kafka-auth roachtest.

kafka has always been logging errors in logs/kafka/server.log but they don't surface in sarama code.([2023-12-18 01:30:06,771] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Failed authentication with /10.142.0.32 (channelId=10.142.0.32:9094-10.142.0.32:34616-0) (SSL handshake failed) (org.apache.kafka.common.network.Selector)) And explicitly asking to fetch metadata for the specific topics during Dial() revealed the error somehow, causing sarama to return pq: kafka server: Replication-factor is invalid.

Three fixes for kafka-auth found so far:

  1. adding ssl.endpoint.identification.algorithm= (https://stackoverflow.com/questions/54903381/kafka-failed-authentication-due-to-ssl-handshake-failed) to kafka config roachtest or
  2. using a different kafka version or
  3. passing an empty argument to refreshMetadata during Dial()

We are going with the third option to match previous behaviour since there are too many unknowns here. 1. unclear which part of sarama code triggered the authentication error and why sarama swallowed the error? 2. when using refreshMetadata with an empty argument, sarama doesn't do anything from its side but just passes a request with empty topics to broker (https://github.com/IBM/sarama/blob/228c94e423f068fa75293e12f68cb6cf249d38d3/client.go#L1034). So my guess is that kafka never really refreshed metadata for these topics during Dial() before (maybe the kafka topics haven't been created yet then?). 3. If we failed authentication with kafka since the start, why do other kafka tests work? 4. Unclear if RefreshMetadata inside EmitResolvedTimestamp works as expected either.

Jira issue: CRDB-34830

@wenyihu6 wenyihu6 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 20, 2023
@wenyihu6
Copy link
Contributor Author

Posted a question on sarama repo and hope we can get more insights there - IBM/sarama#2755

@wenyihu6 wenyihu6 added the A-cdc Change Data Capture label Dec 20, 2023
@blathers-crl blathers-crl bot added the T-cdc label Dec 20, 2023
Copy link

blathers-crl bot commented Dec 20, 2023

cc @cockroachdb/cdc

wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Dec 20, 2023
Patch (cockroachdb#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh
metadata exclusively for kafkaSink.topics() rather than for all topics. This
triggered a sarama error ( an invalid replication factor in Kafka servers). It
seems hard to get to the bottom of it given that there are too many unknowns
from sarama and kafka side (More in cockroachdb#116872). We decided to revert back to what
Dial() previously did in sarama code and will continue investigating afterwards.

Part of: cockroachdb#116358, cockroachdb#116872
Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Dec 21, 2023
Patch (cockroachdb#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh
metadata exclusively for kafkaSink.topics() rather than for all topics. This
triggered a sarama error ( an invalid replication factor in Kafka servers). It
seems hard to get to the bottom of it given that there are too many unknowns
from sarama and kafka side (More in cockroachdb#116872). We decided to revert back to what
Dial() previously did in sarama code and will continue investigating afterwards.

Part of: cockroachdb#116872
Fixes: cockroachdb#116358
Release note: None
craig bot pushed a commit that referenced this issue Dec 21, 2023
116414: changefeedccl: revert Dial() to fetch all metadata topics  r=jayshrivastava a=wenyihu6

Patch (#114740) changed Dial() in pkg/ccl/changefeedccl/sink_kafka.go to refresh
metadata exclusively for kafkaSink.topics() rather than for all topics. This
triggered a sarama error ( an invalid replication factor in Kafka servers). It
seems hard to get to the bottom of it given that there are too many unknowns
from sarama and kafka side (More in #116872). We decided to revert back to what
Dial() previously did in sarama code and will continue investigating afterwards.

Part of: #116872
Fixes: #116358
Release note: None

116894: storage: increase testing shard count r=jbowens a=itsbilal

We're starting to sporadically see failures when the test shard for pkg/storage times out as a whole. This change doubles the shard count to give each test more headroom before it times out.

Fixes #116692.

Epic: none

Release note: None

116910: testserver: disable tenant randomization under race in multi-node clusters r=yuzefovich a=yuzefovich

We now run `race` builds in the EngFlow environment which has either 1 CPU or 2 CPU executors. If we have a multi-node cluster and then start a default test tenant, then it's very likely for that environment to be overloaded and lead to unactionable test failures. This commit prevents this from happening by disabling the tenant randomization under race in multi-node clusters. As a result, it optimistically unskips a few tests that we skipped due to those unactionable failures.

Fixes: #115619.

Release note: None

116956: sql: skip Test{Experimental}RelocateVoters under duress r=yuzefovich a=yuzefovich

There is little value in running these tests in complex configs.

Fixes: #116939.

Release note: None

Co-authored-by: Wenyi Hu <[email protected]>
Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@exalate-issue-sync exalate-issue-sync bot assigned wenyihu6 and unassigned wenyihu6 Jan 8, 2024
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Feb 8, 2024
Now that we have fixed kafka auth setup in roachtests, we can change Dial() to
fetch metadata for specific topics only.

Fixes: cockroachdb#116872
Part of: cockroachdb#118952
Release note: none
@dnwe
Copy link

dnwe commented Feb 11, 2024

@wenyihu6 RE:

2023-12-18 01:30:06,771] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Failed authentication with /10.142.0.32 (channelId=10.142.0.32:9094-10.142.0.32:34616-0) (SSL handshake failed)

That's not an authentication failure, that's the Kafka broker's connection to ZooKeeper attempting a TLS connection and finding that the remote ZooKeeper cluster did not have TLS enabled. It's just an INFO level log to let you know, Kafka will still work despite this, just using PLAIN for communication with ZooKeeper

@dnwe
Copy link

dnwe commented Feb 11, 2024

@wenyihu6 RE:

When calling refreshMetadata(sink.Topics()) instead of an empty argument, sarama returns pq: kafka server: Replication-factor is invalid for kafka-auth roachtest.

The difference here is that by refreshing metadata with an explicit set of topics, you go into the Kafka code path that permits automatic topic creation (which is disabled when you've done Metadata.Full and passed an empty topic list)

https://github.com/IBM/sarama/blob/5f63a84f47c39bf08a1c276f1f6b5f1d754e9fc3/client.go#L1022-L1028

The fact that you're getting 'Replication-factor is invalid' in that scenario suggests that the number of running/active brokers you have in the Kafka cluster is fewer than the default replication factor, so Kafka is returning INVALID_REPLICATION_FACTOR to Sarama because it doesn't have enough active brokers to satisfy the create request

@wenyihu6
Copy link
Contributor Author

@wenyihu6 RE:

When calling refreshMetadata(sink.Topics()) instead of an empty argument, sarama returns pq: kafka server: Replication-factor is invalid for kafka-auth roachtest.

The difference here is that by refreshing metadata with an explicit set of topics, you go into the Kafka code path that permits automatic topic creation (which is disabled when you've done Metadata.Full and passed an empty topic list)

https://github.com/IBM/sarama/blob/5f63a84f47c39bf08a1c276f1f6b5f1d754e9fc3/client.go#L1022-L1028

The fact that you're getting 'Replication-factor is invalid' in that scenario suggests that the number of running/active brokers you have in the Kafka cluster is fewer than the default replication factor, so Kafka is returning INVALID_REPLICATION_FACTOR to Sarama because it doesn't have enough active brokers to satisfy the create request

Hi Dominic!

Thanks for taking the time to address my inquiries!

I'm curious about how sarama handles Metadata.Full set to true. Does kafka dynamically determine available topics upon receiving request for refreshing metadata here?


Do you happen to recall any changes made to Sarama's NewClient or RefreshMetadata since v1.38.1 which alters original behaviour and begin invoking some kafka code path?

For context, we recently upgraded from v1.38.1 to v1.42.1, and some tests start failing. Upon investigation, we realized some misconfiguration in our kafka server, preventing kafka brokers from starting and inter-communicating. We've found how to fix the issue, but I'm interested in learning how RefreshMetadata and NewClient might have been changed which is doing slightly more work and surfacing the error back to us.

I'm suspecting the changes made in IBM/sarama#2645 and IBM/sarama@c42b2e0, but I'm having trouble spotting the root cause.

@dnwe
Copy link

dnwe commented Feb 11, 2024

@wenyihu6 sure, when requesting metadata from Kafka it’s actually the protocol that supports optionally asking for specific topics or alternatively sending a general metadata request without topics and then the broker will return all known brokers and topics for the whole cluster

In terms of what’s changed in Sarama since v1.38.1, I’d suggest that v1.41.0 was the most significant release as it corrected the use of a large number of compatible but backlevel protocol versions — ensuring that we send newer versions where they are supported by the cluster (and bound by Version field in sarama.Config)

Are you still seeing issues now or have you resolved them with the corrected backend config?

@wenyihu6
Copy link
Contributor Author

@wenyihu6 sure, when requesting metadata from Kafka it’s actually the protocol that supports optionally asking for specific topics or alternatively sending a general metadata request without topics and then the broker will return all known brokers and topics for the whole cluster

In terms of what’s changed in Sarama since v1.38.1, I’d suggest that v1.41.0 was the most significant release as it corrected the use of a large number of compatible but backlevel protocol versions — ensuring that we send newer versions where they are supported by the cluster (and bound by Version field in sarama.Config)

Are you still seeing issues now or have you resolved them with the corrected backend config?

Hi @dnwe!

Thanks for your reply!

We've fixed the misconfiguration in our kafka cluster and confirmed no issues with sarama code. This error has existed in our kafka set up for some time, but the errors were never returned back to us until this recent sarama upgrade surfaced it. So I'm wondering if you happen to know what had been changed in NewClient or RefreshMetadata recently which revealed the problem. No worries if there is too much going over beneath the surface to pinpoint the root cause in this case.

The error we got was https://github.com/IBM/sarama/blob/5f63a84f47c39bf08a1c276f1f6b5f1d754e9fc3/client.go#L1080. No servers were able to set up successfully in our kafka cluster due to the misconfiguration, so this error is expected. I'm just curious about what was improved in this new version which surfaced this error in this case.

@dnwe
Copy link

dnwe commented Feb 11, 2024

My guess would be basically what I referenced earlier. A combination of Sarama sending newer protocol versions and the backing cluster hence performing more validation or returning more info that required broker-to-broker communication

@wenyihu6
Copy link
Contributor Author

My guess would be basically what I referenced earlier. A combination of Sarama sending newer protocol versions and the backing cluster hence performing more validation or returning more info that required broker-to-broker communication

Makes sense. Thanks for helping me understand here! I will close this and the sarama issue I opened up.

@wenyihu6
Copy link
Contributor Author

Closing - upon investigation, we found that the error we are encountering here is due to some misconfiguration in our kafka roachtest set up. More details in #119077 (comment).

@wenyihu6
Copy link
Contributor Author

Summary:

This is a similar issue to #118525. The certificate we generated for the kafka test cluster was no longer valid after host name verification of servers became enabled by default (kafka
2.0
). As a
result, some inter-broker communication has been failing with a hostname
verification error for some time. But this error wasn't raised to user until we try to fetch specific topics during Dial().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
None yet
Development

No branches or pull requests

2 participants