Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

timtimb0t · 2025-01-21T14:12:26Z

Packages

Scylla version: 2025.1.0~dev-20250119.1ef2d9d07692 with build-id e16c81a8ccdafe6190a5846930a953c8c8312a12

Kernel Version: 6.8.0-1020-azure

Issue description

During cluster startup the following error occurs:

2025-01-20 12:20:45.646: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=f3a7bf93-9a77-40f2-8af4-2571d7796900 duration=2s: node=Node longevity-tls-1tb-7d-master-loader-node-965572fc-eastus-1 [None | 10.0.0.9]
stress_cmd=cassandra-stress write cl=QUORUM n=275050075 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=150 -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..275050075
errors:
Stress command completed with bad status 1:   Failed to connect to node due to an error during SSL handshake javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net found.
java.lang.RuntimeException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1.internal.cloudapp.net/10.0.0.5:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net/10.0.0.6:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2.internal.cloudapp.net/10.0.0.6:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3.internal.cloudapp.net/10.0.0.7:9042 (com.datastax.driver.core.exceptions.TransportException: [longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3.internal.cloudapp.net/10.0.0.7:9042] Error writing), longevity-tls-1tb-7d-master-db-node-965572fc-eastus-4.internal.cloudapp.net/10.0.0.8:9042 [only showing errors of first 3 hosts, use getErrors() for more details])

Impact

SCT tests itself never start

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

longevity-tls-1tb-7d-master-db-node-965572fc-eastus-4 (null | 10.0.0.8) (shards: 14)
longevity-tls-1tb-7d-master-db-node-965572fc-eastus-3 (null | 10.0.0.7) (shards: 14)
longevity-tls-1tb-7d-master-db-node-965572fc-eastus-2 (null | 10.0.0.6) (shards: 14)
longevity-tls-1tb-7d-master-db-node-965572fc-eastus-1 (null | 10.0.0.5) (shards: 14)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-2025.1.0-dev-x86_64-2025-01-20T02-10-43 (NO RUNNER: NO RUNNER)

Test: longevity-1tb-5days-azure-test
Test id: 965572fc-211f-479b-93a5-56706d3f5167
Test name: scylla-master/tier1/longevity-1tb-5days-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-1TB-5days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 965572fc-211f-479b-93a5-56706d3f5167
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 965572fc-211f-479b-93a5-56706d3f5167

Logs:

db-cluster-965572fc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/db-cluster-965572fc.tar.gz
sct-runner-events-965572fc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/sct-runner-events-965572fc.tar.gz
sct-965572fc.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/sct-965572fc.log.tar.gz
loader-set-965572fc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/loader-set-965572fc.tar.gz
monitor-set-965572fc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/monitor-set-965572fc.tar.gz
ssl-conf-965572fc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/20250120_122223/ssl-conf-965572fc.tar.gz
builder-965572fc.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/965572fc-211f-479b-93a5-56706d3f5167/upload_20250120_122340/builder-965572fc.log.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

timtimb0t · 2025-01-21T14:13:28Z

Hi @dimakr , could you please take a look at this issue?

dimakr · 2025-01-21T14:18:38Z

Hi @dimakr , could you please take a look at this issue?

on it

dimakr · 2025-01-22T11:40:31Z

The issue is backend specific, it is reproduced on gce and azure, but not on aws

dimakr · 2025-01-22T12:15:05Z

On AWS we put the hostname, public and internal FQDNs in the SAN extension of server certificate, e.g.:

        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:PR-provision-test-tls-enab-db-node-afff4a9d-2, IP Address:10.4.2.101, IP Address:34.254.100.198, DNS:ec2-34-254-100-198.eu-west-1.compute.amazonaws.com, DNS:ip-10-4-2-101.eu-west-1.compute.internal

But on GCE and Azure, for some reason we have only hostnames:

GCE
---
        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1, IP Address:10.142.0.8, IP Address:104.196.108.173, DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1, DNS:pr-provision-test-tls-enab-db-node-f8eb1cab-0-1

Azure
---
        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1, IP Address:10.0.0.5, DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1, DNS:pr-provision-test-tls-enab-db-node-02ef9e26-eastus-1

We need to properly get FQDNs for aws and gce, in SCT, before putting them in server cerificates.

fruch · 2025-01-23T15:35:55Z

@dimakr

what was changed ? that you by default check the hostname-verification ?

please disable it for GCE and Azure, until we can figure it out

we never used DNS in both

dimakr · 2025-01-23T17:18:22Z

we never used DNS in both

@fruch
This is basically the reason. I didn't know/notice it during initial PR, when peer verification was enabled, as was testing on AWS.
So the driver seems to always resolve IPs (which we use in c-s command) to DNS names. But for GCE and Azure nodes we don't use real DNS names in cluster.node.public_dns_name / cluster.node.private_dns_name attributes, simple hostnames are set here. And these hostnames are then put in SAN extension of certs. Later the peer verification is reasonably failing.

What I tried - during node setup we can get DNS names:

for GCE
- the private DNS name is saved as instance metadata attribute and can be requested as smth like self.query_gce_metadata("instance/hostname")
- the public DNS name is not kept in metadata (this is also mentioned in gce docs), but we can resolve it at runtime
for Azure, DNS names are not kept in metadata, but we can resolve them at runtime
for both - DNS names always follow the same pattern (if no custom DNS is configured on backend), so we can build the names using patterns (e.g. {vm_name}.{region}.internal.cloudapp.net for Azure, etc.)

Additionally, the java driver has an option resolve-contact-points which we can probably try to set from c-s to disable name resolution and pass peer verification based on IPs in SAN extension (if we don't have DNS names in the extension). But this solution is probably not OK.

timtimb0t added the tests/longevity-tier1 label Jan 21, 2025

github-actions bot assigned timtimb0t Jan 21, 2025

timtimb0t assigned dimakr and unassigned timtimb0t Jan 21, 2025

fruch changed the title ~~Failed to connect to node due to an error during SSL handshake~~ Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) Jan 23, 2025

dimakr mentioned this issue Jan 23, 2025

fix(certificates): use real DNS names in SAN extension #9912

Merged

5 tasks

fruch closed this as completed in #9912 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

timtimb0t commented Jan 21, 2025

Logs:

timtimb0t commented Jan 21, 2025

dimakr commented Jan 21, 2025

dimakr commented Jan 22, 2025

dimakr commented Jan 22, 2025

fruch commented Jan 23, 2025

dimakr commented Jan 23, 2025 •

edited

Loading

Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

Failed to connect to node due to an error during SSL handshake (DNS in certificate is not matching in both GCE/Azure) #9883

Comments

timtimb0t commented Jan 21, 2025

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

timtimb0t commented Jan 21, 2025

dimakr commented Jan 21, 2025

dimakr commented Jan 22, 2025

dimakr commented Jan 22, 2025

fruch commented Jan 23, 2025

dimakr commented Jan 23, 2025 • edited Loading

dimakr commented Jan 23, 2025 •

edited

Loading