-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opengl performance improvements #1410
base: master
Are you sure you want to change the base?
Conversation
instead of repeatedly calling glClientWaitSync with a 1 nanosecond timeout, call it with a 20ms timeout w flush. Decreases average GPU utilisation on my testbench by about 10% (~85%->~75%, 4 * 1080i5000 on k620)
c4f3c9a
to
f1ae1dc
Compare
When running it do blending via OpenGL, this is a tad bit faster.
f1ae1dc
to
e710ba2
Compare
Since we keep filling the command buffer there is no need to flush and we can safely forego it. This marginally improves performance.
Trying this with 4x 1080i50 channels (each playing 2 AMB) on ubuntu 22.04 with a GTX1060, I am seeing gpu usage go from 40-45% to 38-42%, which is not a significant improvement. What gpu and os are you using? On windows it gets stuck in an error loop when playing any media with
It has been quite a while (~10 years) since I have had to think about optimising cuda code, but from what I remember branching is only an issue when threads in the same cluster make take different routes. So for us, different branches being used for each frame being composited should have no major impact? What is the cost of frequently switching shaders? some layers on a channel could be on the fast and some on the slow shader As it currently stands, I am not convinced that this will give a noticeable performance benefit to most users, so I am not convinced it is worth the extra complexity |
Low hanging fruit: change glWaitSync behavior
Optimize fragment shader to discard when invisible (alpha < 0.01)
Change shader (less branching due to non-uniform flow control)
Using a separate shader program as a "fast path" & using openGL blend func (executed on GPU ROP, less texture reads) shaves off another ~5%
Total improvement from ~80-85% to ~50-55% utilization when running 4 HD 50i channels each with 2 layers, close to 2.1 GPU utilization.