-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Never use heap for return buffers #112060
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
a72ec93
to
e00043b
Compare
34edd40
to
f9506f7
Compare
f9506f7
to
d379d38
Compare
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr gcstress-extra, runtime-coreclr libraries-jitstress, runtime-coreclr libraries-pgo, Fuzzlyn, runtime-coreclr pgostress, runtime-coreclr outerloop |
Azure Pipelines successfully started running 8 pipeline(s). |
What's are the worst-case examples of this regression in real-world code? |
I'll check later this week. Presumably, we regress cases where large structs don't contain gc pointers and we introduce a redundant memory move for them when they're saved on the heap. Perhaps, we can complicate things and only guarantee return-buffer-on-stack for structs with gc pointers in the worst case? Hopefully, the PR improves more than it potentially regresses, especially on arm64 where all moves are a bit cheaper (due to better atomicity guarantees of SIMD moves, paired loads/stores) while calls a bit more expensive. Also, hopefully, follow up changes will remove some redundant copies. |
We had complicated scheme like that. The CLR ABI doc has mentions of it - look for IsStructRequiringStackAllocRetBuf - that decided the return buffer convention based on different factors. It was a bug farm that did not survive the test of time. |
Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers): aspnet-win-x64 SPMI collection:
Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)
|
@EgorBot -arm -amd -windows_intel using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
// Structs with GC references
[InlineArray(128)] struct StructGC_1024B { object _; }
[InlineArray(32)] struct StructGC_256B { object _; }
[InlineArray(16)] struct StructGC_128B { object _; }
[InlineArray(8)] struct StructGC_64B { object _; }
[InlineArray(4)] struct StructGC_32B { object _; }
// Structs without GC references
[InlineArray(128)] struct StructNoGC_1024B { long _; }
[InlineArray(32)] struct StructNoGC_256B { long _; }
[InlineArray(16)] struct StructNoGC_128B { long _; }
[InlineArray(8)] struct StructNoGC_64B { long _; }
[InlineArray(4)] struct StructNoGC_32B { long _; }
static class Utils
{
[MethodImpl(MethodImplOptions.NoInlining)]
public static T Create<T>() where T : struct => default;
[MethodImpl(MethodImplOptions.NoInlining)]
public static T Create<T>(T copy) where T : struct => copy;
}
public class BoxBenchmarks
{
object boxedStruct;
[Benchmark] public void Box_StructGC_1024B() => boxedStruct = Utils.Create<StructGC_1024B>();
[Benchmark] public void Box_StructGC_256B() => boxedStruct = Utils.Create<StructGC_256B>();
[Benchmark] public void Box_StructGC_128B() => boxedStruct = Utils.Create<StructGC_128B>();
[Benchmark] public void Box_StructGC_64B() => boxedStruct = Utils.Create<StructGC_64B>();
[Benchmark] public void Box_StructGC_32B() => boxedStruct = Utils.Create<StructGC_32B>();
[Benchmark] public void Box_StructNoGC_1024B() => boxedStruct = Utils.Create<StructNoGC_1024B>();
[Benchmark] public void Box_StructNoGC_256B() => boxedStruct = Utils.Create<StructNoGC_256B>();
[Benchmark] public void Box_StructNoGC_128B() => boxedStruct = Utils.Create<StructNoGC_128B>();
[Benchmark] public void Box_StructNoGC_64B() => boxedStruct = Utils.Create<StructNoGC_64B>();
[Benchmark] public void Box_StructNoGC_32B() => boxedStruct = Utils.Create<StructNoGC_32B>();
}
public class StackBenchmarks
{
[Benchmark] public void Stack_StructGC_1024B() => _ = Utils.Create<StructGC_1024B>(default);
[Benchmark] public void Stack_StructGC_256B() => _ = Utils.Create<StructGC_256B>(default);
[Benchmark] public void Stack_StructGC_128B() => _ = Utils.Create<StructGC_128B>(default);
[Benchmark] public void Stack_StructGC_64B() => _ = Utils.Create<StructGC_64B>(default);
[Benchmark] public void Stack_StructGC_32B() => _ = Utils.Create<StructGC_32B>(default);
[Benchmark] public void Stack_StructNoGC_1024B() => _ = Utils.Create<StructNoGC_1024B>(default);
[Benchmark] public void Stack_StructNoGC_256B() => _ = Utils.Create<StructNoGC_256B>(default);
[Benchmark] public void Stack_StructNoGC_128B() => _ = Utils.Create<StructNoGC_128B>(default);
[Benchmark] public void Stack_StructNoGC_64B() => _ = Utils.Create<StructNoGC_64B>(default);
[Benchmark] public void Stack_StructNoGC_32B() => _ = Utils.Create<StructNoGC_32B>(default);
} |
A histogram for struct sizes in BCL (and the number of structs with gc pointers). ~45% of all structs are less-or-equal than 16 bytes. With GC pointers: 1134 Struct Size HistogramSize in bytes - number of structs Size: 1, Count: 72 |
So there are 2 use-cases:
So far, I wasn't able to find any benchmark in dotnet/performance suit that regresses because of it (I was trying benchmarks which had jit diffs). Neither I was able to see any impact on OrchardCMS benchmark. The original issue stated that TechEmpower Fortunes should see an improvement, unfortunately, I wasn't able to kick off a run yet to see that. Presumably, we can land additional follow-up improvements to benefit from stack-only return buffer like was mentioned before. So not sure what exactly we should do here, I am fine with leaving things as is in Main or we can just merge it, spin for a week or two, if dotnet/performance and aspnet PerfLab results won't be motivating we can just revert it. Unfortunately, we still don't have a reliable way to run dotnet/performance benchmarks in parallel - all attempts in the past had issues with triaging potential improvements/regressions due to noise as we can't use the same script we use for Tuesday/Thursday triaging yet. PS: #112060 (comment) implies that for 85% of structs, the copy is basically 2 mov instructions (with avx512). |
Status quo incentives shape common code patterns. Authors are less inclined to use structs if there is a perfomance cost for doing so. Green field design considerations in ASP.NET Core, Npgsql and libraries based on them (e.g. OrchardCMS) are rooted in - at best - early .NET Core era JIT limitations: redundant copies, bad copying perf, no physical promotion/enregistration, and so on. IMO as long as there are no notable regressions - while good portions of all the measured code is class/heap focused - it seems worthwhile to allow for more struct optimizations rather than fewer (especially if knock-on optimizations can eliminate more copies). Stack-centric and shared-nothing kind of code will continue to be a better fit for the increasingly higher core counts.
My measurement was done with a new driver - not Npgsql (of which I'm also a maintainer though) - it's a reset of my previous work on https://github.com/NinoFloris/Slon. That work is almost ready for TE CI runs. |
@NinoFloris thanks! I'll try benchmarks with Npgsql and TE Fortunes specifically and post the results here |
You can also delete this and this.
I am fine with merging this change and keeping it even if we are not able to find a motivating perf improvement. I think it is more reasonable calling convention in general. The change unifies how we are dealing with arguments passed by reference and return buffers. |
Can you also update runtime/src/coreclr/jit/rationalize.cpp Lines 160 to 163 in cee8434
to TYP_I_IMPL ?
Also, if you want to you can fix up most of the pointer -> byref changes made in However, I'm also ok with leaving that as is and I can clean it up some other time. But can you please change this part of the docs: runtime/docs/design/features/tailcalls-with-helpers.md Lines 262 to 264 in cee8434
|
Final diffs (slightly better when I added "a call is returning a byref struct -> no copy needed". I tried to benchmark the PR against Fortunes and FortunesEF but unfortunately, RPS is just too flaky to be able to detect changes (+/-5%). If no other concern/feedback, can it be approved then? @jkotas @jakobbotsch |
src/coreclr/jit/compiler.h
Outdated
@@ -10793,6 +10793,7 @@ class Compiler | |||
STRESS_MODE(POISON_IMPLICIT_BYREFS) \ | |||
STRESS_MODE(STORE_BLOCK_UNROLLING) \ | |||
STRESS_MODE(THREE_OPT_LAYOUT) \ | |||
STRESS_MODE(NONHEAP_RET_BUFFER) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we run any of these JIT stress modes with naot? If yes, we may need the helper implemented for naot too.
Also, I am not sure about the durable value of this stress mode and helper. I understand that the helper was useful when implementing the change. Do you think that there is high enough probability that we will passing the heap pointers for return buffers by mistake without noticing it in other ways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we run any of these JIT stress modes with naot? If yes, we may need the helper implemented for naot too.
Can't find any evidence that we run jitstress for NAOT even in outerloop and we definitely have no GCStress for it (#107850)
Also, I am not sure about the durable value of this stress mode and helper. I understand that the helper was useful when implementing the change. Do you think that there is high enough probability that we will passing the heap pointers for return buffers by mistake without noticing it in other ways?
I had it locally for R2R too. It seems my test apps fail badly if I remove the importer
code that makes local copies instead of passing heap pointer, even without any explicit stress mode (and the helper), so I presume I can delete it
7b17d37
to
0764818
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@jakobbotsch @dotnet/jit-contrib does the jit side look good (beside leaving a few things to follow up PRs as improvements on top of it) |
CI experiment for #111127
Was:
Now:
where the write barrier is put at the callsite if needed (presumably, it happens rarely)
Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers):
aspnet-win-x64 SPMI collection:
Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)
MihuBot (PMI for BCL):