-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GetOrchestrator
slowness when PM Sender data is not cached
#2849
Conversation
GetOrchestrator
slowness when PM Sender data is not cached
Want to add log lines pulled from Titan Node orchestrator when testing this. These log lines have been removed but are relevant to where in the code the OrchestratorInfo response generation slowed down. You can also see every other step of generating the response takes 4ms of the 484ms. When |
Been running this update for 48 hours, all streams and node stats look fine. |
Thanks @stronk-dev, I'm going to get someone to start looking at this today |
Looks like we have genuine test failures here, could you take a look please @stronk-dev? |
The tests were not updated yet. We were looking for some feedback on moving the eth rpc calls in the OrchestratorInfo response generation to the background and change in cache time from 60 seconds to 1 week. Also on the cleanup interval changed from 1 minute to 10 minutes. There appears to be a blockWatcher that accounts for broadcaster reserve changes but the cache was so short at 60 seconds that it did not operate very long. With Titan's testing at the 1-2 minute mark the OrchestratorInfo response would slow down because of the eth rpc calls. The gist is this change can significantly speed up OrchestratorInfo response times that are required at each start of stream and periodically during streams. The goal is to provide more reliable response times, and possibly, shorten the initial session pool building to speed up first segment transcoding. |
I'm trying to catch up on this issue. So, if I understand correctly, there is a bug that when we have GetOrchestrator call and an RPC request at the same time, then the GetOrchestrator response is delayed for some reason? Is that correct? I'm actually not sure how changing the timeouts solves this issue. Could you provide more detailed description on that? About your questions, I don't know, would need to dig into the code, but first I'm trying to understand the actual issue. |
There's no bug, but due to the timeout being only 1 minute, it happens often that an Orchestrator has to do a blocking RPC call in order to respond to a GetOrchestrator request. In this case the Orchestrator needs to have up-to-date reserve info on the B making the GetOrchestrator request in order to respond Depending on the speed, distance and availability of the L2 RPC provider (which can be expecially problematic for regions like Singapore) this causes long fluctuations in response times to GetOrchestrator requests. Sometimes it's high enough for the Orchestrator to go above the threshold a B has set to populate their working set of Orchestrators. You can see here when Titan applied this patch and his response times immediately became excellent Doing blocking RPC calls during a request like that is inefficient anyway. It's a good idea to make sure any data an O needs to respond to B requests is always up-to-date and available. We're trying to figure out here:
|
Ok, I understand, thanks for the explanation Marco. One thing I don't understand is why Other than that, my understanding is that, in theory, you should never need to refresh the cache because it's updated with the block feed here. Not sure actually why we're cleaning up this cache at all 🤔 Saying that, I this PR is ok, to be on the safe side, we can refresh the cache from time to time. I'll put some comments inline, other than that, I'd love to understand my question above ☝️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comments. In general, I understand the issue and this PR, so I think we should get it merged.
It generates B specific ticket parameters and has a special reject case if the Broadcaster has insufficient reserves to do a payout: Lines 315 to 326 in d8ad5c7
Lines 348 to 354 in d8ad5c7
go-livepeer/core/orchestrator.go Lines 226 to 236 in f350fba
Lines 192 to 201 in f350fba
Lines 264 to 274 in f350fba
go-livepeer/pm/sendermonitor.go Lines 155 to 162 in f350fba
Not sure how important these ticket parameters are for the B. But if the SenderMonitor keeps this data up-to-date anyway, increasing the timeout would be a nice first step |
84bec1f
to
540af3b
Compare
In order to get this PR ready for merge, I think it's just 1 minor question ATM: Since |
Yes, in my opinion, we can remove it from |
Modified. I guess the added print for when a sender info gets cleared could be nice to keep in there |
Waiting for the CI build and then I'll merge! Thanks for the PR @stronk-dev |
Codecov Report
@@ Coverage Diff @@
## master #2849 +/- ##
=============================================
Coverage 56.53269% 56.53269%
=============================================
Files 89 89
Lines 19456 19456
=============================================
Hits 10999 10999
Misses 7849 7849
Partials 608 608
Continue to review full report in Codecov by Sentry.
|
What does this pull request do?
Titan node had some curious insights into the relationship between your Eth RPC provider and the response times to GetOrchestrator requests, and started investigating. @ad-astra-video assisted and they found out that during a GetOrchestrator request, it was possible that a blocking RPC call would be made to check the float of the Broadcaster
This commit is pulled from @ad-astra-video fork of go-livepeer and it increases the timeout for how long B info gets cached and does some explicit calls to refresh data in the background every 10 minutes
My questions for the person reviewing this:
cleanupInterval
required, or are the blockwatchers already keeping all relevant B data up-to-date as long as the PM sender is cached?go-livepeer/pm/sendermonitor.go
Lines 329 to 332 in 951f6e6
cleanupInterval
and are there any other calls we might want to do?