-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection was interrupted while the page was loading #439
Comments
@jwr Hi Jan, sorry to hear about the trouble! Thanks for the detailed report, the screenshots are helpful. No obvious ideas come to mind yet, though I'm happy to dig more. In the meantime, you mentioned:
So it sounds like this is a recent development. Has anything possibly relevant changed recently? E.g. updated Sente, http-kit, Ring middleware, etc. If so, that could help narrow down where to look. |
That's the first thing I checked. Absolutely nothing changed server-side recently, and that includes not just the app and libraries, but also the entire server stack. I even rebooted the servers just to be sure. There were no recent changes in the app (ClojureScript) either. The things that were changing that I know of: user database sizes (the database gets sent after the initial connect, so that can be a factor) as these keep increasing constantly, and obviously user machines (system updates, browser updates, etc). The database sizes were the first thing I suspected, but one of the users is able to log in and work normally on one machine, but not on several others. As for browsers and operating systems, these are entirely out of my control and they do change frequently. But if users report problems on multiple machines and browsers, including some that were not updated recently, I would probably look elsewhere. In other words, trying to track down what changed doesn't lead me anywhere, so I tried understanding where the error message comes from and what produces it. Unfortunately, that search wasn't fruitful either: most other reports I found were old and/or unrelated. One thing that seems to appear is a 45s interval between retries. I am not sure where that comes from. Could sente have internal timeouts that come into play? But then again, some of my users sometimes load data longer than 45s, so I would have seen this problem earlier. By this point I even started suspecting networks and firewalls along the way. The only approach I can think of right now is trying to understand "how could this possibly happen". |
Okay, great - thanks for confirmation. That's the ideal situation and should help tree-shake possibilities :-) One potential explanation that comes to mind then is that something's changed with browser behaviour, but would need to dig further to advise on the likeliest causes. Do I understand correctly that this isn't head-on-fire urgent? If so, I'll aim to investigate further tomorrow and will update you. |
Thank you for offering to help! No, this is not a total showstopper, because it doesn't affect all connections, just a select few. I don't think changes in browser behavior are to blame. I gathered information from one of the affected users and: Location A: Location B: Location C (network connectivity via Starlink): This would seem to indicate something network-related or timing-related. But I don't even understand what the error message means: "connection interrupted while the page was loading" — does this mean a connection was established and then interrupted? Was the interruption unexpected or caused by a timeout in the browser? I'm completely baffled. |
Thanks to the kindness of my users, I now have traces of a normal page/app reload and a failed one, in Chrome. There are no console messages appearing in Chrome, but something happens to the websocket, too. This is what should normally happen: (the long wait is normal: loading a database can easily take a minute, it took about 46s in this case it seems) Now, this is how a failed page/app reload looks: The only difference between these two screenshots is the network. Both were taken on the same (Windows) PC with Chrome, minutes apart, the failed one was over a cellular hotspot connection, and the working one was over a Starlink connection. |
After some more debugging, deploying versions with extended logging, and a number of tests: it appears that it is Sente that is killing the websocket connections, specifically because of a ws-ping timeout. On one hand, I feel stupid, because that sounds obvious. On the other hand, my mental model of how ws-ping works was different (I thought they get sent only on an idle connection, not while waiting for a response), and I thought ws-ping had a default 5s timeout. I still don't understand why this happens, where the (roughly) 45s timeout comes from (I relied on sente defaults, and I can't find a 45s value anywhere, only 5s), why this issue doesn't affect many more users, or why it recently started affecting the two users that reported it. But I do know right now that adding an explicit |
Hi @jwr, thanks for all the extra info - will investigate now and come back to you 👍 |
Will address all your questions, just want to confirm a few details so long.
Edit to add:
Thanks! |
Without entering the thick of it, I would like to point the finger at carrier-grade NAT, a practice by ISPs that allows them to share small pools of public addresses among many end users. Some time ago, I was observing constant interruptions of websocket connections. I called my ISP and they told me that they were aware of the problem, explained it was due to carrier-grade NAT, and upgraded my subscription. No more carrier-grade NAT, now I get assigned a public IP address that is not shared (still dynamic though). There were no more interruptions with websockets connections. |
@danielsz Hi Daniel, thanks for the extra data point. Could I ask you to please create another issue describing your experience in more detail? For example - did you see this specifically affect Sente, how did it manifest, etc.? I've not heard of the phenomenon before so any pointers you can give would be handy. In principle even if it's caused by an ISP - I'd still consider something like that to be a Sente bug since Sente needs to be able to ~gracefully work around problems it's likely to encounter in the real world. |
Sure, I'll try but that was a long time ago and I can't reproduce the problem to check additional details. |
@danielsz Understood, but however little you can remember would be helpful - and it'd be nice to at least have a dedicated issue open so that if anyone else encounters the same thing we can start collecting experiences in one place. |
Yes. Although I would not focus on the "log in" part too much — for example the Chrome traces above were for reloads of an authenticated session, so "logging in" did not factor into it.
Yes. Specifically, when the app loads and there is an authenticated session (so, after log in, or if the session is already authenticated), there will be a "data load" request with a large response, and that can take anywhere from single seconds to more than a minute.
Yes. Although it is even slower for some other users. So, it isn't like these users crossed a threshold, and the others did not.
Both, really.
That's hard to estimate right now — my guess would be hundreds of kB to single megabytes.
Right after login, or when initializing the app, requesting data is one of the first things the app does.
Yes, it's behind nginx, and I've looked at the timeouts there, but cannot find anything that would be applicable. Also, from what I understand from the logs after improving logging, it seems that it is Sente that is closing the connection. That's what the "Client ws-ping to server timed-out, will cycle WebSocket now" message would indicate, right? As to @danielsz's comment, there might be something to it. I had another customer who reported the same problem several weeks ago. I couldn't help him much, but he contacted his ISP, and they changed something, which caused things to work for him again. That would fit the "ISP NAT breaking websocket connections" hypothesis. I don't think this is what we're looking at in these specific two cases, but I think in general this is something that can happen. EDIT: Also, I've been told that my app does not work in China, from behind their firewall. I haven't investigated this. |
Edit to add: I sent this before seeing your latest response, please ignore anything irrelevant. @jwr Hi Jan, to update from my side:
In the meantime, to answer a few of your own questions:
I can't think of any obvious source of a 45s interval if you're using Sente's defaults. The relevant Sente timers I can think of would be: Server-side:
Client-side:
The general logic of the ping behaviour is: Server-side:
Client-side:
In other words both the server and client will:
Now it's possible you're seeing 45s as a result of some interaction of other timers - but nothing intentional/obvious comes to mind, so I'd first want to rule out timers from other layers in your stack (e.g. nginx, etc.).
Pings are only sent when idle, but when they're sent without a response - that's taken as a signal that something's wrong with the connection. I.e. pings are used to distinguish between:
Note that the treat-missing-pong-as-disconnected logic is only currently enabled by default for client->server pings, not server->client pings. The latter was only added in a recent version of Sente, and for reasons explained here I didn't want to enable this by default yet. It can be enabled manually.
Interesting. Just to confirm:
|
Great, thanks for the answers 👍 My current sorted hunches would be:
We could rule out (1) based on your http-kit server version and/or config (notably thread count). And we could rule out (2) if you could maybe share the relevant parts of your nginx config. (Feel free to email if there's anything in there you'd rather not post publicly). If we can rule out both, I'll continue down the chain. |
I use http-kit 2.7.0, mostly with defaults. The only parameters to I can reproduce the problem when running my app locally (http-kit only, no nginx proxying, single client connection) and connecting with Chrome with network throttling set to "Fast 3G". Here are the relevant logs, edited for clarity, note that the log includes both server-side and client-side:
What seems to be happening is that the socket gets opened, and my software immediately sends a After about 40s (note the client-side timestamps can be different from server-side ones) sente sends a ws-ping message, which is received on the server. And 5s later, once the So, the 45s interval comes from the sum of After that, the cycle repeats — and the data load never completes, because it never has the chance to arrive in full. There is still much that I do not understand here. I don't understand the sente concept of an 'idle connection'. And my mental model of a sente connection and pings was incorrect (though to be honest I never gave it much thought): I thought of a sente connection like a TCP connection, where "activity" is defined as any data bytes being sent or received. In other words, I thought a sente connection that is receiving data would be "active". I also do not understand why this only came up recently. I have many users with much longer load times. Somehow this interaction does not always come into play. I think with the current behavior of I hope this moves us forward! I also hope some of this can result in an improvement to Sente for everyone. |
Hi Jan,
That will be a problem if you've got slow synchronous handlers. http-kit 2.7.0 only allocates 4 worker threads by default, and so can easily become starved of threads in this case. If that happens, it won't be able to respond to client ping requests - causing client's to disconnect. Would suggest you set http-kit server's
👍
This is the problem that I'm pointing out above. Your http-kit server should be able to reply with a pong if it's not thread-starved.
Unless your payload is very large and connection very slow, thread starvation seems a lot more likely cause to me. I.e. my hunch is that your slow Ring handler isn't spending the majority of its time on WebSocket IO but on preparing the response. Would recommend some simple profiling to be sure. Tufte is one option, but some simple adhoc
One possible explanation would be that as your concurrent user count and/or Ring handler costs have increased, you're running into thread starvation more often. Slow or flakey connections may be especially sensitive since they'll have the additional network delay to contend with.
I don't believe that your usage pattern should be a problem for the ping behaviour. You might want to tweak the client-side My advice would be to try bump http-kit server's Please let me know how that goes. |
Well, now that I have the problem reproducible, testing this hypothesis is easy. I added I would be surprised if it did: right now I am testing in a local setup, so there is a dedicated http-kit server with a single Chrome client. That single client downloads some static content and then opens a single websocket connection. There are no other clients connecting and no other traffic. I wouldn't expect that to lead to thread starvation. Looking back at how ws-ping works, I am not sure how we can expect the ping response to make it back in time to the client, if the network is slow and the websocket connection is busy transmitting a large amount of data. If the response is stuck behind, say, several megabytes of data, and transmitting that data takes longer than 5s, it is going to timeout, right? I'm looking at this line in Sente: https://github.com/taoensso/sente/blob/a51a54a6d0372e7284e0c322b2c75e3804dbe1f8/src/taoensso/sente.cljc#L1511C25-L1511C63 — it seems that this |
Hi Jan, your explanation makes sense - thanks for all the work debugging 👍 I realise this is time away from your business, so probably frustrating.
👍 Though I'll note that it's possible to have even a single webpage issue multiple HTTP requests to different endpoints. Since we're talking about only 4 threads, it's not too difficult to get starved if there's expensive endpoints being hit.
A lot depends on how large the data and how slow the connection. If possible, it'd really be helpful to get some real numbers. Could you maybe check on the payload size in your tests demonstrating the problem? (Again, assuming I'm not missing some difficulty in checking that number). As an example, let's say a typical payload is 2Mb and we're on a 1Mbit/sec connection. That'd mean ~16 secs to do the transfer. Will that cause a disconnection? It depends on when the request is sent. The worst case with default options is:
I'd expect that to disconnect since the transfer will be in flight during the precise period that a pong is expected. But if the payload is 10Mb on the same connection, then it doesn't even matter when the request is sent - since the transfer time (80 seconds) will certainly overlap the pong window and lead to a disconnect. If you are potentially talking about payloads of this kind of size (and/or connections this slow), then that definitely sounds like the source of trouble. My first recommendation in that case would be to move the large payloads off Sente entirely. The big benefit of Sente/WebSockets is the ability to easily have ~bidirectional real-time comms. It had actually not crossed my mind before that someone might use a Sente channel for large data transfers so I hadn't considered the implications. It might work, to a point - but your example does highlight one of the issues. You might be able to try tune the timeouts, etc. - but even in the best case you'd still ultimately be tying up your WebSocket channel for no benefit. I'd recommend instead using your Sente channel only for small data (max transfer of a few seconds), and for signalling. E.g. the server could signal to the client that it should request payload X via Ajax, then the client can make that a separate request and leave Sente's channel open for notifications, etc. My own applications always use a mix of Sente and Ajax, since with Ajax you also have all the usual benefits of response caching, etc. Does that make sense? Would that be viable in your case? If not, please let me know why and I'll consider alternative ideas. If it's any help, there's a convenient Ajax util in Sente to alias this. I'll note: Sente's documentation definitely should make it clear to avoid large data transfers. I'm really sorry about the oversight! I'll get the documentation updated tomorrow. As an aside: I would strongly recommend keeping the higher http-kit thread count, since that's undoubtedly going to lead to trouble at some stage even if it wasn't the cause of the trouble here. |
Hi Jan, some updates:
Next time I'm on batched Sente work, I'll pursue the other items on the checklist. |
To provide some context: I don't really have the option of sending data via different channels. The whole point of using Sente in my application is to tie the client application to RethinkDB changefeeds. Here is a somewhat simplified explanation: when a user logs in, a changefeed is set up in the database to that user's data. That changefeed receives the initial data and then all subsequent changes. That changefeed is also tied to the Sente websocket connection. This needs to be transactional: you get all the data as of a certain point in time, and then get all the changes to that data. There is no way to safely and correctly do this in two separate operations. Of course things are much more involved that what I described (multiple changefeeds per user, etc), but the general concept holds. I am working on a rewrite that will use FoundationDB instead of RethinkDB. Given all the downsides of websockets I plan to stop using them altogether in the future. FoundationDB lets me implement similar safe transactional changefeed functionality using a distributed database, but also without the burden of persistent database changefeed connections or persistent websockets. Polling architectures are generally simpler and more resilient, so that's what I plan to move to. In the meantime, I will keep increased ws-ping timeouts, and also look at splitting the large data message into smaller ones, which should be doable. This will result in more frequent updates to Sente's activity concept ( I keep thinking that what would solve the problem right away would be a way to update |
Thanks for the extra info. I'm not familiar with RethinkDB so can't comment in detail, but that does sound unfortunate that you seem to have such limited control over how data is sent.
Just to make sure we're on the same page: besides the unsuitability for large data transfers, what downsides do you have in mind?
Well if it's possible to split the large data message into smaller ones, that would certainly help the present isssue.
I'm not entirely clear on what what you're comparing here, but in case it's relevant - just double checking that you're aware that you can also disable WebSockets on Sente and just run it over long-polling?
Not with a WebSocket as far as I'm aware, but I haven't looked into it in detail. |
I do have control over how data is sent (I'm the one sending it), but I have to worry about correctness. Writing chat apps is easy, writing ERP apps less so :-) The key here is that the client needs to get the full data (up to a certain point in db-transactional-time) and then a stream of subsequent changes. On the server, that's a single "establish a changefeed" database operation. I can either map it roughly 1:1 to a client websocket connection by just dumping the data over the connection to the clients, or maintain a (costly and complex) system for caching that information and providing it to clients over AJAX calls. The RethinkDB changefeed system solves a difficult problem really well, and together with Sente was a good solution in my case for more than 8 years now. Unfortunately, RethinkDB did not become fashionable (unlike the substantially worse MongoDB which it was often compared to), and it doesn't get much development anymore. That's why I'm planning to migrate to another database. Another way to approach this kind of problem is a bi-temporal database (get db state to a specified point in time, then poll for changes afterwards). Or any database with a data model that allows for detecting changes after a point in time in a correct way, which is what I'm working towards with FoundationDB. But my current database does not allow me to ask for changes up to a point in time and then get updates after that point in time in a performant and transactionally correct way.
Multiple things:
This is why, as I'm redesigning the data model to take advantage of the incredible features that FoundationDB offers, I am also making sure that I will be able to move to a simple polling model. If the cost of checking for database changes is nearly 0, polling an endpoint is a great solution and would let me get rid of a lot of complexity. Yes, I am aware that I can use Sente with long-polling only, but much as I like Sente, I'd rather not use it if I can, like with every other piece of code. It wouldn't bring many advantages in that case. Now, trying to slowly wrap this up:
|
Hi Jan, thanks for the detailed and thoughtful reply - that definitely helps me understand if there's anything I can improve on Sente's end (or if there's anything constructive I can suggest). In short: your current plan sounds reasonable to me given what I understand of your architecture and objectives. This is good to know about, thanks for the link!
In case it's helpful while you're redesigning things, I'll note that http-kit isn't particularly well suited to large data transfers in general. One example: its IO is single threaded, so there's a limit to how much data can ultimately be served by a single instance. In my experience that limit is rarely hit in real-world applications, but it can happen if you have high user loads with high IO per user. This is most likely in cases where you have large servers (>16 core) where it's possible to produce sustained heavy IO that dominates over CPU load. But a lot depends on the particular load type and application architecture. I've recently added a very thorough benchmark suite to http-kit which may be informative if this is ever something you'd want to explore.
My main concerns would be
In principle that'd be easy to do, but would push a fair amount of incidental complexity to application authors that I'd prefer to avoid. If there were a desire to officially support large payloads through Sente, I think my first inclination would be to add auto chunking on Sente's side. But ultimately:
I'm 100% in support of this conclusion. If you're not benefitting from the specific advantages that Sente offers, then far better to remove it from your stack. The less software you can run the better 👍
Feel free to close if you're satisfied with your current workaround 👍 And feel free to ping any time if you have other questions or if there's some other way that I can assist. |
I am looking for help/advice, because I can't track this problem down. Some of my users started running into problems and can't log into the application. I can't reproduce the issue and it's relatively rare: so far two reports out of several thousand worldwide users. It doesn't seem to be browser-related, because these users see problems in both Chrome and Firefox, and on Windows and Linux.
The visible symptom is that there are these messages in the Firefox console:
Here is a screenshot of the Firefox console, from a user. This shows the login flow: there is a 101 request establishing a socket connection (this seems to succeed), then a login POST request, which normally results in the client calling
sente/chsk-reconnect!
and a new websocket being established. But I have no idea why there are four websocket connection attempts in the screenshot, nor why the errors happen.And this is the normal, expected flow:
This is rare and not reproducible for me. So far the reports are from users with noticeable latencies (450ms and 1.2s RTT). But I can't narrow it down further. The initial server->client message is fairly large for these users, but not the largest (and I've already increased the
:max-ws
parameter in http-kit). I increased the sente timeout values for messages (though I don't think I can control any timeouts onsente/chsk-reconnect!
.The same failure happens in Chrome, although I don't have a console screenshot.
Any hints or ideas would be much appreciated.
The text was updated successfully, but these errors were encountered: