Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xrootd TLS handshake failed: not an SSL/TLS record #7628

Open
ageorget opened this issue Jul 31, 2024 · 12 comments
Open

Xrootd TLS handshake failed: not an SSL/TLS record #7628

ageorget opened this issue Jul 31, 2024 · 12 comments
Labels
xroot Targeting affects support for the xroot protocol (either door or pool)

Comments

@ageorget
Copy link

ageorget commented Jul 31, 2024

Hi,

For the last days, we observed more and more SSL/TLS errors on the CMS XRootD doors like this without any changes on dCache side (9.2.22 on doors, Java openjdk version "17.0.10") :

Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]: 31 Jul 2024 14:10:40 (xrootd-ccdcacli501) [] Server: TLS handshake failed: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record.
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]: 31 Jul 2024 14:10:40 (xrootd-ccdcacli501) [door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA] TLS handshake failed: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record.
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]: 31 Jul 2024 14:10:40 (xrootd-ccdcacli501) [door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA] Uncaught exception in thread xrootd-net-2
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]: io.netty.handler.codec.DecoderException: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:42)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at java.base/java.lang.Thread.run(Thread.java:840)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]: Caused by: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1189)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1426)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1469)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:223)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1257)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1297)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
Jul 31 14:10:40 ccdcacli501 dcache@xrootd-ccdcacli501Domain[15154]:         ... 17 common frames omitted

The strange thing is that access log reports the transaction as OK and nothing else:

level=INFO ts=2024-07-31T14:10:40.712+0200 event=org.dcache.xrootd.connection.start session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA socket.remote=128.142.196.239:15788 socket.local=134.158.209.241:1094
level=INFO ts=2024-07-31T14:10:40.714+0200 event=org.dcache.xrootd.request session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA request=protocol response=ok
level=INFO ts=2024-07-31T14:10:40.719+0200 event=org.dcache.xrootd.connection.end session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA

Adrien

@samuambroj
Copy link

samuambroj commented Jul 31, 2024

Hi,

We have commented this issue today in the dCache Tier-1 Meeting with @lemora and @kofemann.
On Monday, our CMS representative complained that the SAM tests for Xrootd were failing, experiencing timeouts. But they were not failing all the time, but about 50% of the time.
The first thing we found in the logs, is the TLS handshake errors you have commented. But, @XMol and I, after searching a lot for culprits, found a clear correlation with the start of the failing CMS Xrootd SAM tests. This was a sudden spike in the number of TCP connections in CLOSE_WAIT state on the Xrootd doors. After restarting the doors, the hanging TCP sockets were killed and the SAM Xrootd tests recovered. Why did it happen? Maybe a client sending too many transfers, a network glitch ... We do not know and we have been unable to find a clear reason in the logs.
A detailed explanation, clearly indicating that sockets in CLOSE_WAIT state and timeouts in incoming connections are correlated, can be found in this article from 2016.
So, @ageorget could you please verify that you do not have long lasting TCP sockets in CLOSE_WAIT state on your Xrootd doors?

I forgot to mention: dCache 9.2.22, openjdk 17 and RHEL-8.

Cheers,
Samuel

@ageorget
Copy link
Author

Hi Samuel,

Good to know, we also experienced timeouts with CMS SAM tests for Xrootd since yesterday afternoon and same observation, 50% of SAM tests failed with timeouts

Checking read of SAM dataset test file
root://[email protected]:1094//pnfs/in2p3.fr/data/cms/disk/data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root
[E] xrootd file open error, XRootDStatus.code=306 "[FATAL] Redirect limit has been reached"

https://monit-grafana.cern.ch/goto/UgDTDSrIg?orgId=20

But I don't know if it's related with this SSL/TLS issue because I found these errors in the Xrootd door log since the end of May...

Stephan from CMS just answered that the ETF team may have picked up newer, broken xrootd client version yesterday. They are trying to downgrade it to fix the SAM tests.

Adrien

@samuambroj
Copy link

Hi Adrien,

I am not sure, because we had a Full Site Downtime last week and we reinstalled all dCache servers hosting the doors, and we do not save the door logs, but I think we have also seen these TLS handshake errors in the past on the doors.
Do you happen to have historical access to TCP states on the Xrootd doors? Just to see if you also have hanging sockets with CLOSE_WAIT status during the time, the SAM tests were experiencing timeouts.
As commented before, we do not know the reason for it, but your comment about a broken xrootd client, could help identify the origin of the problem ;-)

Cheers,
Samuel

@ageorget
Copy link
Author

ageorget commented Jul 31, 2024

Do you happen to have historical access to TCP states on the Xrootd doors? Just to see if you also have hanging sockets with CLOSE_WAIT status during the time, the SAM tests were experiencing timeouts.

No I don't see any suspect CLOSE_WAIT TCP connections on the xrootd doors :
image

@DmitryLitvintsev
Copy link
Member

DmitryLitvintsev commented Jul 31, 2024

Just to add this discussin. We see the same errors from etf-07.cern.ch hosts.
We also see spikes of CLOSE_WAIT connections during which xrootd door uses up all available thread workers
I am investigating

We upgraded to Alma9. And deployed 9.2.22. We have not seen CLOSE_WAIT connections in large numbers before
upgrade.

Another bit - the "no SSL/TLS" errors definitely come from SAM test and do not seem to correlate with CLOSE_WAIT situation. I started to monitor threads used by XrootD and spikes occur about may be once in 2 days whereas these
"no SSL/TLS" errors occur more frequent. To me it looks like server "thinks" that client is not using SSL connection.
(like say you would use http instead of https in case if WebDAV).

We do not restart the door. It recoveres by itself.

Dmitry

@DmitryLitvintsev
Copy link
Member

The strange thing is that access log reports the transaction as OK and nothing else:

level=INFO ts=2024-07-31T14:10:40.712+0200 event=org.dcache.xrootd.connection.start session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA socket.remote=128.142.196.239:15788 socket.local=134.158.209.241:1094
level=INFO ts=2024-07-31T14:10:40.714+0200 event=org.dcache.xrootd.request session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA request=protocol response=ok
level=INFO ts=2024-07-31T14:10:40.719+0200 event=org.dcache.xrootd.connection.end session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA

Adrien

When connection to xrootd door is made there are few things being exchanged. I think all that means is that protocol call has succeeded. But then the server threw exception causing no proper entry in the log.

@DmitryLitvintsev
Copy link
Member

This is time distrubution of errors on xrootd door on CMS site:

select date_trunc('hour', datestamp) as time, count(*), sum(case when errorcode <> 0 then 1 else 0 end) as errors from billinginfo where datestamp > current_date  and initiator like 'door:Xrootd-%'  group by time order by time asc; 
          time          | count | errors 
------------------------+-------+--------
 2024-07-31 00:00:00-05 | 10045 |     19
 2024-07-31 01:00:00-05 | 10782 |      9
 2024-07-31 02:00:00-05 |  8972 |     15
 2024-07-31 03:00:00-05 | 12413 |     22
 2024-07-31 04:00:00-05 | 15987 |     17
 2024-07-31 05:00:00-05 | 15684 |     21
 2024-07-31 06:00:00-05 | 20845 |     23
 2024-07-31 07:00:00-05 | 16651 |     16
 2024-07-31 08:00:00-05 | 12087 |     18
 2024-07-31 09:00:00-05 | 12716 |     19
 2024-07-31 10:00:00-05 | 18714 |     14
 2024-07-31 11:00:00-05 |  2398 |     52
(12 rows)

We experienced spike in thread count on Xrootd door at about 2 AM. All errors between 2 AM and 4 AM looked like:

 connectiontime |                                   errormessage                                    
----------------+-----------------------------------------------------------------------------------
        1572899 | Client disconnected without closing file.
        5285789 | Client disconnected without closing file.
        2066893 | Client disconnected without closing file.
        3363181 | Post-processing failed: File size mismatch (expected=1014285505, actual=92274688)
        3913444 | Client disconnected without closing file.
        2919113 | Client disconnected without closing file.
        2729012 | Client disconnected without closing file.
        3629109 | Client disconnected without closing file.
        5069413 | Client disconnected without closing file.
         300000 | No connection from client after 300 seconds. Giving up.
        3937066 | Client disconnected without closing file.
         867855 | Client disconnected without closing file.
        3048245 | Client disconnected without closing file.
        2413614 | Client disconnected without closing file.
        6421148 | Client disconnected without closing file.
        2246886 | Client disconnected without closing file.
         104141 | Client disconnected without closing file.
        2622108 | Post-processing failed: File size mismatch (expected=104937297, actual=50331648)
        1927925 | Post-processing failed: File size mismatch (expected=1014285505, actual=41943040)
         997368 | Client disconnected without closing file.
        4115686 | Post-processing failed: File size mismatch (expected=303559167, actual=100663296)
        3136051 | Post-processing failed: File size mismatch (expected=3996099675, actual=67108864)
        2230348 | Post-processing failed: File size mismatch (expected=129602261, actual=50331648)
        5012781 | Post-processing failed: File size mismatch (expected=303513742, actual=117440512)
        2040690 | Post-processing failed: File size mismatch (expected=366576405, actual=33554432)
         430277 | Client disconnected without closing file.
        1143861 | Client disconnected without closing file.
        1312073 | Client disconnected without closing file.
        2132199 | Client disconnected without closing file.
         212419 | Client disconnected without closing file.
         291383 | Client disconnected without closing file.
        2243652 | Post-processing failed: File size mismatch (expected=3984752381, actual=41943040)
        1905640 | Post-processing failed: File size mismatch (expected=366576405, actual=41943040)
        2311224 | Post-processing failed: File size mismatch (expected=3996099675, actual=58720256)
        1921058 | Post-processing failed: File size mismatch (expected=91600215, actual=50331648)
        3827332 | Post-processing failed: File size mismatch (expected=2213161106, actual=75497472)
(36 rows)

Again, this is NOT related to SSL/TLS issue but, I think this is related to CLOSE_WAIT

https://github.com/DmitryLitvintsev/scripts/blob/master/bash/dcache/dcache_thread_count.sh

I use the above script to monitor thread counts and I trigger thread dump of domain if thread count > 900 plus
netstat -apn

(this is because default xrootd.limits.threads = 1000 if you have higher setting in your layout/config adjust the threshold
correspondingly)

I would like you to run this script at your site.

@ageorget
Copy link
Author

ageorget commented Aug 1, 2024

Well, the errors stopped Yesterday afternoon as soon as CMS rollbacked their Xrootd client version I think.

         time          | count | errors 
------------------------+-------+--------
 2024-07-31 09:00:00+02 |   869 |    208
 2024-07-31 10:00:00+02 |  2358 |    253
 2024-07-31 11:00:00+02 |   906 |    243
 2024-07-31 12:00:00+02 |  2418 |    192
 2024-07-31 13:00:00+02 |  1375 |    112
 2024-07-31 14:00:00+02 |  1639 |    139
 2024-07-31 15:00:00+02 |   716 |     20
 2024-07-31 16:00:00+02 |  1913 |     48
 2024-07-31 17:00:00+02 |  5279 |      0
 2024-07-31 18:00:00+02 |  1410 |      0
 2024-07-31 19:00:00+02 |  1296 |     10
 2024-07-31 20:00:00+02 |  2070 |     11
 2024-07-31 21:00:00+02 |  1607 |      2
 2024-07-31 22:00:00+02 |  1655 |      2
 2024-07-31 23:00:00+02 |  1738 |      0
 2024-08-01 00:00:00+02 |  5221 |     10
 2024-08-01 01:00:00+02 |  5946 |     20
 2024-08-01 02:00:00+02 |  6517 |      9
 2024-08-01 03:00:00+02 |  5867 |     11
 2024-08-01 04:00:00+02 |  5677 |      4
 2024-08-01 05:00:00+02 |  4668 |      5
 2024-08-01 06:00:00+02 |  3168 |      2
 2024-08-01 07:00:00+02 |  2844 |      3
 2024-08-01 08:00:00+02 |  1642 |      2
 2024-08-01 09:00:00+02 |  1309 |     17
 2024-08-01 10:00:00+02 |  1189 |      0
 2024-08-01 11:00:00+02 |  1306 |      0
 2024-08-01 12:00:00+02 |  1752 |      5
 2024-08-01 13:00:00+02 |  1547 |      3
 2024-08-01 14:00:00+02 |  1213 |     10

Now I only got errors on doors like Server: TLS handshake failed: java.nio.channels.ClosedChannelException.

@samuambroj
Copy link

Same behaviour at KIT. The errors with TLS handshake failed: not an SSL/TLS record stopped yesterday around 16:20 h (CEST).

@kofemann
Copy link
Member

kofemann commented Aug 1, 2024

@ageorget @samuambroj Thanks for the update. Nonetheless, we (dcache) should find out what's going on there anyway....

@kofemann kofemann added the xroot Targeting affects support for the xroot protocol (either door or pool) label Aug 1, 2024
@DmitryLitvintsev
Copy link
Member

The SSL errors looked liked non-ssl connection made to SSL expecting host.
I think we need to handle that exception, so we don't generate stack trace.
I will look into it.

@geonmo
Copy link

geonmo commented Sep 6, 2024

The issue with multiple CLOSE_WAITs and peek CPU utilization on the xrootd server for CMS AAA is believed to be caused by a bug in OpenSSL 3.0. (This is a c library to extract voms information at voms_api)
We had a similar issue on our site and found that the xrootd configuration after removing the voms extract part was not generating CLOSE_WAIT sockets.
Related openssl issues and PRs were finalized in July. We are currently testing the openssl 3.2-dev branch instead of latest release.
For one week, we have not experienced close_wait and increased CPU utilization in xrootd.
However, we have recently experienced authentication-related errors in dcache and some users have reported TLS-related errors. Therefore, the CLOSE_WAIT issue and the TLS error issue seem to have different causes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
xroot Targeting affects support for the xroot protocol (either door or pool)
Projects
None yet
Development

No branches or pull requests

5 participants