[WIP] AtomicBuffer weak compareAndSet methods. #330
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The existing cas methods are fine for X86 because of its strong memory model and because cas methods don't fail spuriously due to cacheline-locking.
But on ARM and other ISA's with a weak memory model like RISC-V, the current methods give suboptimal performance.
On ARM, a cas can fail spuriously because the mechanism is optimistic (LLSC). So when the compiler sees a regular compareAndSet, it will internally make use of a loop to ensure that the operation won't fail spuriously. But if the compareAndSet is already used within cas-loop, you get a loop in a loop.
e.g.
When this code would be compiled for ARM, it would lead to a loop in a loop (need to get my hands on some ARM hardware).
It is better to use the following so that the nested loop is not added:
Both of these methods have volatile memory semantics; it is purely an optimization.
In C++ this is made more obvious by having compare_exchange_weak and compare_exchange_strong.
For ultimate performance, methods also have been added that have even weaker memory ordering semantics like weakCompareAndSetRelease. I have not checked the Assembly, but I assume that on the X86 this will still lead to some lock-prefixed instruction and therefor won't buy you much, but on ARM that can be implemented more efficiently.