Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v22 slower than v21? #47

Open
Boulder08 opened this issue May 6, 2020 · 14 comments
Open

v22 slower than v21? #47

Boulder08 opened this issue May 6, 2020 · 14 comments

Comments

@Boulder08
Copy link

As I measured here: https://forum.doom9.org/showthread.php?p=1910541#post1910541 , the new version with speed improvements seems to be slower than the previous one. Are the CPU instruction sets properly detected? I noticed that the part doing the job is quite old and may not be up to it with these new-gen AMD Ryzens (I'm running a 3900X).

@dubhater
Copy link
Owner

dubhater commented May 6, 2020

Only if AMD changed the way they signal AVX2 support. I don't think they did?

Because of the parameters you used, neither Super nor Degrain1 are using the new AVX2 code, which means it's Analyse that got slower.

Do you see a difference between v21 and v22 when you run Analyse on a 16 bit clip?

@Boulder08
Copy link
Author

The difference seems to be consistent.

Analyse 16-bits, v22 26.22 fps
Analyse 16-bits, v21 28.07 fps
Analyse 8-bits, v22 55.74 fps
Analyse 8-bits, v21 57.96 fps

Which functionalities in MSuper or MDegrainx should be optimized? I could test them as well.

@dubhater
Copy link
Owner

dubhater commented May 6, 2020

Degrain with 8 bit clips, Super with sharp=0 or 2.

@Boulder08
Copy link
Author

Same thing with those, v21 is faster.

sharp=2, v22 55.97 fps
sharp=2, v21 58.17 fps
sharp=2, Degrain 8-bits, v22 64.51 fps
sharp=2, Degrain 8-bits, v21 66.67 fps

@Boulder08
Copy link
Author

Just for fun, I checked what x264 shows:
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

So at least it's working properly.

@sekrit-twc
Copy link

Is v22 compiled with Visual Studio faster than v21? See attached.

vapoursynth-mvtools.zip

@Boulder08
Copy link
Author

Yes, it seems to be faster. Compared to those first tests with 8-bit Analyse and 16-bit degraining, I got 60.43 fps as the result.

@sekrit-twc
Copy link

sekrit-twc commented May 29, 2020

Tried compiling with GCC 9 on Linux. v22 is running faster than v21 for me. Maybe the issue is related to MinGW and cross-compilation.

Script from Doom9 thread:

import vapoursynth as vs

core = vs.core
core.num_threads = 1

core.std.LoadPlugin("/home/user/src/vapoursynth-mvtools/.libs/libmvtools.so")

c = core.std.BlankClip(format=vs.YUV420P8) * 100
s = core.mv.Super(c, pel=2, chroma=True, rfilter=4, sharp=1)

kwargs = {"blksize": 16, "overlap": 8, "search": 5, "searchparam": 8, "pelsearch": 8, "truemotion": False}
b1v = core.mv.Analyse(s, isb=True, delta=1, **kwargs)
f1v = core.mv.Analyse(s, isb=False, delta=1, **kwargs)

kwargs = {"thsad": 200, "thsadc": 100, "limit": 1, "limitc": 2, "thscd1": 300, "thscd2": 80}
c = core.mv.Degrain1(c, s, b1v, f1v, **kwargs)
c.set_output()

Profiler results. Units are perf "cycles" events, which is a proxy for time. In this script, the AVX2 code is offering negligible speedup, because the bulk of the compute is not in SIMD code anyway, due to the mv.Super mode. The fps gains are instead coming from templating and specializing the control flow for the motion estimation.

Kernels      
sym v21 v23  
HorizontalBicubic 34449 43948 1.275741
VerticalBicubic 17062 17427 1.021393
ToPixels_uint16_t_uint8_t 11051 12583 1.13863
SADWrapperU8_AVX2<16u, 16u>::sad_u8_avx2 6707 8806 1.312957
__memset_avx2_erms 7171 7903 1.102078
SADWrapperU8<8u, 8u>::sad_u8_sse2 12621 6507 0.515569
Degrain_avx2<1, 16, 16> 9277 6028 0.649779
Degrain_avx2<1, 8, 8> 5395 4100 0.759963
RB2Cubic 4513 3595 0.796588
copyBlock<16u, 16u> 3974 3351 0.843231
overlaps_avx2<16, 16> 5026 3166 0.629924
overlaps_avx2<8, 8> 2753 2484 0.902288
copyBlock<8u, 8u> 2755 2294 0.832668
__memmove_avx_unaligned_erms 1934 1923 0.994312
PadReferenceFrame 895 1206 1.347486
LimitChanges_sse2 930 914 0.982796
  126513 126235 0.997803
       
Control Flow      
v21      
pobExpandingSearch 41311    
pobSearchMVs 32305    
pobUMHSearch 25482    
mvdegrainGetFrame<1> 17290    
pobInterpolatePrediction 11247    
mvpGetAbsolutePointerPel2 3681    
pobHex2Search 2951    
pobLumaSAD 2006    
mvpGetAbsolutePointerPel1 1989    
mvpGetAbsolutePointer 1455    
pobRefine 1331    
SUM 141048    
       
v23      
pobExpandingSearch<0, 0> 36834    
pobUMHSearch<0, 1> 28107    
mvdegrainGetFrame<1> 13456    
doPobSearchMVs<0, 1> 11239    
pobFetchPredictors 6858    
pobInterpolatePrediction 5472    
pobExpandingSearch<0, 1> 4954    
doPobSearchMVs<0, 0> 3511    
mvpGetAbsolutePointerPel2 2903    
pobHex2Search<0, 1> 2792    
pobGetRefBlockU<1> 1970    
mvpGetAbsolutePointerPel1 1938    
pobGetRefBlockV<1> 1883    
mvpGetAbsolutePointer 1606    
pobRefine<0, 1> 802    
SUM 124325    

@dubhater
Copy link
Owner

Which compiler flags did you use? (And Autotools or Meson?)

@sekrit-twc
Copy link

Default autotools build (./configure && make).

@dubhater
Copy link
Owner

Hmm. The default with Makefile.am is -O2. Meson defaults to -O3. I compiled the v22 and v23 DLLs using Meson. (I don't know about the older ones.) Perhaps that's what makes it slower?

@4re
Copy link

4re commented May 30, 2020

I did some test with the above script and for me r22 and r23 are slightly faster than r21 (~4%).

GCC 10 builds are ~10% bigger than GCC 9 but just a tiny bit faster (~2%).

On my zen2 CPU I used -march=native -O2 -ftree-vectorize -fdevirtualize-at-ltrans -flto=16 -pipe but -O2 for GCC 10 is slightly different (it includes -finline-functions now).

@dubhater
Copy link
Owner

dubhater commented Jun 6, 2020

@Boulder08 Here is v23 compiled with -O2 instead of -O3. That's the only difference. Please test again.
vapoursynth-mvtools-v23-O2-win64.zip

@Boulder08
Copy link
Author

2500 frames of a test script of analysis and degraining in 16 bits:
v23-normal: 12.62 fps
v23-O2: 12.07 fps
v23-clang build from doom9 : 13.32 fps

So it was definitely slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants