Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast (last) vector dimension being used by different threads might cause false cache sharing #198

Open
3 tasks
hdante opened this issue Sep 3, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@hdante
Copy link

hdante commented Sep 3, 2024

Hello, this is a report of a conceptual performance issue in the code, but without any practical effects - when declaring the locChi2 and locInd multi-dimensional vectors, the dimension used for splitting the work into multiple threads was the last dimension. This means there will be vectors with contiguous data where each contiguous element is accessed by different hardware threads. Since typical cache lines are 64 bytes wide, the same cache lines will be requested and "locked" by multiple hardware threads at the same time, in practice serializing the access to the chi2 values, slowing down and prohibiting the parallelism in the loops using this vector.

Instead, the dimension used for distributing the work between the threads shouldn't be the last one, so that values dispatched to different threads are guaranteed to be in different cache lines.

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@hdante hdante added the bug Something isn't working label Sep 3, 2024
@johannct
Copy link
Member

johannct commented Sep 4, 2024

Do you think you can help us investigate this and the other performance issue you flagged? It would be great to have you in the developer team, given the input you provided so far.

@hdante
Copy link
Author

hdante commented Sep 4, 2024

Hello, Johann, I can help you, but these are not real performance issues. In this issue, the relevant loop is already fast and small, and changes to it won't noticeably impact the performance, so this issue is a matter of recommended practices and avoiding future problems. I've reported two other issues that might be potential multi-threading bugs. I still need to report a real performance issue found in the hot loop, where dictionary lookups are eating part of the performance, but the expected speed-up is around 10%.

I'll leave the suggestions for the reported issues as comments. For this issue, I'd solve it by changing the order of the affected vector's dimensions while assigning the first dimension for the thread ID:

vector<vector<vector<double>>> locChi2(number_threads, vector<vector<double>>(3, vector<double>(dimzg, 1.e9)));
vector<vector<vector<int>>> locInd(number_threads, vector<vector<int>>(3, vector<int>(dimzg, -1)));

Doing only this change does not guarantee that there will be zero false sharing because the different 1-D vectors of length dimzg might be aligned on 16-byte boundaries when using the default dynamic memory allocator, which is smaller than the 64-byte cache line, but this already moves away from the previous situation where there's 100% guarantee of false sharing because the values for different threads are 8-byte contiguous values by design. Another possible change would be to flatten this 3-D vector into a large 1-D vector, but given this partial solution, I wouldn't go any further to optimize an already low cost loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants