Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

Closed
timtimb0t opened this issue Jan 21, 2025 · 6 comments · Fixed by #9912
Assignees

Comments

@timtimb0t
Copy link
Contributor

Packages

Scylla version: 2025.1.0~dev-20250119.1ef2d9d07692 with build-id e16c81a8ccdafe6190a5846930a953c8c8312a12

Kernel Version: 6.8.0-1020-azure

Issue description

During cluster startup the following error occurs:

2025-01-20 12:20:45.646: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=f3a7bf93-9a77-40f2-8af4-2571d7796900 duration=2s: node=Node longevity-tls-1tb-7d-master-loader-node-965572fc-eastus-1 [None | 10.0.0.9]
stress_cmd=cassandra-stress write cl=QUORUM n=275050075 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=150 -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..275050075
errors:
Stress command completed with bad status 1:   Failed to connect to node due to an error during SSL handshake javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net found.
java.lang.RuntimeException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1.internal.cloudapp.net/10.0.0.5:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net/10.0.0.6:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net/10.0.0.6:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3.internal.cloudapp.net/10.0.0.7:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3.internal.cloudapp.net/10.0.0.7:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-4.internal.cloudapp.net/10.0.0.8:9042 [only showing errors of first 3 hosts, use getErrors() for more details])

Impact

SCT tests itself never start

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

  • longevity-tls-1tb-7d-master-db-node-965572fc-eastus-4 (null | 10.0.0.8) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3 (null | 10.0.0.7) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2 (null | 10.0.0.6) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1 (null | 10.0.0.5) (shards: 14)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-2025.1.0-dev-x86_64-2025-01-20T02-10-43 (NO RUNNER: NO RUNNER)

Test: longevity-1tb-5days-azure-test
Test id: 965572fc-211f-479b-93a5-56706d3f5167
Test name: scylla-master/tier1/longevity-1tb-5days-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 965572fc-211f-479b-93a5-56706d3f5167
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 965572fc-211f-479b-93a5-56706d3f5167

Logs:

Jenkins job URL
Argus

@timtimb0t
Copy link
Contributor Author

Hi @dimakr , could you please take a look at this issue?

@dimakr
Copy link
Contributor

dimakr commented Jan 21, 2025

Hi @dimakr , could you please take a look at this issue?

on it

@dimakr
Copy link
Contributor

dimakr commented Jan 22, 2025

The issue is backend specific, it is reproduced on gce and azure, but not on aws

@dimakr
Copy link
Contributor

dimakr commented Jan 22, 2025

On AWS we put the hostname, public and internal FQDNs in the SAN extension of server certificate, e.g.:

        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:PR-provision-test-tls-enab-db-node-afff4a9d-2, IP Address:10.4.2.101, IP Address:34.254.100.198, DNS:ec2-34-254-100-198.eu-west-1.compute.amazonaws.com, DNS:ip-10-4-2-101.eu-west-1.compute.internal

But on GCE and Azure, for some reason we have only hostnames:

GCE
---
        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1, IP Address:10.142.0.8, IP Address:104.196.108.173, DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1, DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1

Azure
---
        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1, IP Address:10.0.0.5, DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1, DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1

We need to properly get FQDNs for aws and gce, in SCT, before putting them in server cerificates.

@fruch
Copy link
Contributor

fruch commented Jan 23, 2025

@dimakr

what was changed ? that you by default check the hostname-verification ?

please disable it for GCE and Azure, until we can figure it out

we never used DNS in both

@fruch fruch changed the title Failed to connect to node due to an error during SSL handshake Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) Jan 23, 2025
@dimakr
Copy link
Contributor

dimakr commented Jan 23, 2025

we never used DNS in both

@fruch
This is basically the reason. I didn't know/notice it during initial PR, when peer verification was enabled, as was testing on AWS.
So the driver seems to always resolve IPs (which we use in c-s command) to DNS names. But for GCE and Azure nodes we don't use real DNS names in cluster.node.public_dns_name / cluster.node.private_dns_name attributes, simple hostnames are set here. And these hostnames are then put in SAN extension of certs. Later the peer verification is reasonably failing.

What I tried - during node setup we can get DNS names:

  • for GCE
    • the private DNS name is saved as instance metadata attribute and can be requested as smth like self.query_gce_metadata("instance/hostname")
    • the public DNS name is not kept in metadata (this is also mentioned in gce docs), but we can resolve it at runtime
  • for Azure, DNS names are not kept in metadata, but we can resolve them at runtime
  • for both - DNS names always follow the same pattern (if no custom DNS is configured on backend), so we can build the names using patterns (e.g. {vm_name}.{region}.internal.cloudapp.net for Azure, etc.)

Additionally, the java driver has an option resolve-contact-points which we can probably try to set from c-s to disable name resolution and pass peer verification based on IPs in SAN extension (if we don't have DNS names in the extension). But this solution is probably not OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants