Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++] Take advantage of trivial relocation in std::vector::erase #116268

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ldionne
Copy link
Member

@ldionne ldionne commented Nov 14, 2024

In vector::erase(iter) and vector::erase(iter, iter), we can take
advantage of a type being trivially relocatable to open up a gap
in the vector and then relocate the tail of the vector into that gap.

The benefit is that relocating an object is often more efficient than
move-assigning and then destroying the original object. For types that
can be relocated trivially but that are complicated enough for the
compiler not to optimize by itself (like std::string), this provides
around a 2x performance speedup in vector::erase (see below).

This optimization requires stopping the usage of Clang's __is_trivially_relocatable
builtin, which doesn't currently honour assignment operators like
is_trivially_copyable does and can lead us to perform incorrect
optimizations.

It is also worth noting that __uninitialized_allocator_relocate has to
be modified so that we can relocate into an overlapping range. This
has an unfortunate impact on its exception safety guarantees, which
needs to be investigated further.

Previous implementation

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_erase_iter_in_middle/vector_int/1024           24.9 ns         24.9 ns     28042962
BM_erase_iter_in_middle/vector_int/4096            107 ns          107 ns      6590592
BM_erase_iter_in_middle/vector_int/10240           271 ns          265 ns      2733478
BM_erase_iter_in_middle/vector_string/1024         349 ns          349 ns      2005886
BM_erase_iter_in_middle/vector_string/4096        1410 ns         1406 ns       498355
BM_erase_iter_in_middle/vector_string/10240       3449 ns         3449 ns       201989
BM_erase_iter_at_start/vector_int/1024            47.1 ns         47.1 ns     14836261
BM_erase_iter_at_start/vector_int/4096             204 ns          204 ns      3430414
BM_erase_iter_at_start/vector_int/10240            504 ns          504 ns      1391373
BM_erase_iter_at_start/vector_string/1024          684 ns          684 ns      1025160
BM_erase_iter_at_start/vector_string/4096         2855 ns         2806 ns       254080
BM_erase_iter_at_start/vector_string/10240        7060 ns         7060 ns        94134

New implementation

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_erase_iter_in_middle/vector_int/1024           26.0 ns         25.9 ns     27127367
BM_erase_iter_in_middle/vector_int/4096            105 ns          105 ns      6515204
BM_erase_iter_in_middle/vector_int/10240           259 ns          258 ns      2800795
BM_erase_iter_in_middle/vector_string/1024         148 ns          147 ns      4725706
BM_erase_iter_in_middle/vector_string/4096         608 ns          606 ns      1168205
BM_erase_iter_in_middle/vector_string/10240       1523 ns         1520 ns       459909
BM_erase_iter_at_start/vector_int/1024            47.1 ns         47.1 ns     14762513
BM_erase_iter_at_start/vector_int/4096             205 ns          205 ns      3403130
BM_erase_iter_at_start/vector_int/10240            507 ns          507 ns      1382716
BM_erase_iter_at_start/vector_string/1024          300 ns          300 ns      2327546
BM_erase_iter_at_start/vector_string/4096         1205 ns         1205 ns       580855
BM_erase_iter_at_start/vector_string/10240        4296 ns         4296 ns       162956

In particular, test everything with both a normal and a min_allocator,
add tests for a few corner cases and add tests with types that are
trivially relocatable. Also add tests that count the number of assignments
performed by vector::erase, since that is mandated by the Standard.

This patch is a preparation for optimizing vector::erase.
In vector::erase(iter) and vector::erase(iter, iter), we can take
advantage of a type being trivially relocatable to open up a gap
in the vector and then relocate the tail of the vector into that gap.

The benefit is that relocating an object is often more efficient than
move-assigning and then destroying the original object. For types that
can be relocated trivially but that are complicated enough for the
compiler not to optimize by itself (like std::string), this provides
around a 2x performance speedup in vector::erase (see below).

This optimization requires stopping the usage of Clang's __is_trivially_relocatable
builtin, which doesn't currently honour assignment operators like
is_trivially_copyable does and can lead us to perform incorrect
optimizations.

It is also worth noting that __uninitialized_allocator_relocate has to
be modified so that we can relocate into an overlapping range. This
has an unfortunate impact on its exception safety guarantees, which
needs to be investigated further.

   Previous implementation
   --------------------------------------------------------------------------------------
   Benchmark                                            Time             CPU   Iterations
   --------------------------------------------------------------------------------------
   BM_erase_iter_in_middle/vector_int/1024           24.9 ns         24.9 ns     28042962
   BM_erase_iter_in_middle/vector_int/4096            107 ns          107 ns      6590592
   BM_erase_iter_in_middle/vector_int/10240           271 ns          265 ns      2733478
   BM_erase_iter_in_middle/vector_string/1024         349 ns          349 ns      2005886
   BM_erase_iter_in_middle/vector_string/4096        1410 ns         1406 ns       498355
   BM_erase_iter_in_middle/vector_string/10240       3449 ns         3449 ns       201989
   BM_erase_iter_at_start/vector_int/1024            47.1 ns         47.1 ns     14836261
   BM_erase_iter_at_start/vector_int/4096             204 ns          204 ns      3430414
   BM_erase_iter_at_start/vector_int/10240            504 ns          504 ns      1391373
   BM_erase_iter_at_start/vector_string/1024          684 ns          684 ns      1025160
   BM_erase_iter_at_start/vector_string/4096         2855 ns         2806 ns       254080
   BM_erase_iter_at_start/vector_string/10240        7060 ns         7060 ns        94134

   New implementation
   --------------------------------------------------------------------------------------
   Benchmark                                            Time             CPU   Iterations
   --------------------------------------------------------------------------------------
   BM_erase_iter_in_middle/vector_int/1024           26.0 ns         25.9 ns     27127367
   BM_erase_iter_in_middle/vector_int/4096            105 ns          105 ns      6515204
   BM_erase_iter_in_middle/vector_int/10240           259 ns          258 ns      2800795
   BM_erase_iter_in_middle/vector_string/1024         148 ns          147 ns      4725706
   BM_erase_iter_in_middle/vector_string/4096         608 ns          606 ns      1168205
   BM_erase_iter_in_middle/vector_string/10240       1523 ns         1520 ns       459909
   BM_erase_iter_at_start/vector_int/1024            47.1 ns         47.1 ns     14762513
   BM_erase_iter_at_start/vector_int/4096             205 ns          205 ns      3403130
   BM_erase_iter_at_start/vector_int/10240            507 ns          507 ns      1382716
   BM_erase_iter_at_start/vector_string/1024          300 ns          300 ns      2327546
   BM_erase_iter_at_start/vector_string/4096         1205 ns         1205 ns       580855
   BM_erase_iter_at_start/vector_string/10240        4296 ns         4296 ns       162956
@ldionne
Copy link
Member Author

ldionne commented Nov 14, 2024

Note that this is stacked on top of #116265.

#else
//
// Note that we don't use Clang's __is_trivially_relocatable builtin because it doesn't honor the presence
// of non-trivial special members like assignment operators, or even a copy constructor, making it possible
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, this change is required or else we fail the test I added in erase_iter.pass.cpp which counts the number of assignments. Because of this, and since the semantics of what it means to be trivially relocatable are still being actively debated, I think it makes sense for us to err on the safer side and use is_trivially_copyable for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrackedAssignment has a trivial destructor and a trivial copy constructor, so it's trivially relocatable by destroying it and copy-constructing a new instance. Is the library code not permitted to do that? If not, would it make sense to have separate internal traits for "trivially assignment-relocatable" versus "trivially construction-relocatable" that only detect the case where you could relocate specifically using assignment versus specifically using construction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, the last days have been busy. I believe what you are suggesting is essentially what http://wg21.link/P2687R9 introduced with the replaceable concept -- or were you aware of that concept but you had something else in mind?

Since I wrote this patch, my understanding has evolved and so did P2687, so I created this draft PR to discuss with the P2687 authors. That PR implements the currently proposed semantics for P2687 (although with a IMO better library API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants