-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sub-RFC for increased availability of NUMA API #1545
Changes from all commits
fdc26b4
ce6746d
258b82c
90bfaba
58a441f
5e8b79e
d0bf373
a10984c
35d7f55
81021be
fd3661e
c9d2572
f03f669
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# -*- fill-column: 80; -*- | ||
|
||
#+title: Link ~tbbbind~ with Static HWLOC for NUMA API predictability | ||
|
||
*Note:* This document is a sub-RFC of the [[file:README.md][umbrella RFC about improving NUMA | ||
support]]. Specifically, the "Increased availability of NUMA support" section. | ||
|
||
* Introduction | ||
oneTBB has a soft dependency on several variants of ~tbbbind~, which the library | ||
loads during the initialization stage. Each ~tbbbind~, in turn, has a hard | ||
dependency on a specific version of the HWLOC library [1, 2]. The soft | ||
dependency means that the library continues the execution even if the system | ||
loader fails to resolve the hard dependency on HWLOC for ~tbbbind~. In this | ||
case, oneTBB does not discover the hardware topology. Instead, it defaults to | ||
viewing all CPU cores as uniform, consistent with TBB behavior when NUMA | ||
constraints are not used. As a result, the following code returns the irrelevant | ||
values that do not reflect the actual topology: | ||
|
||
#+begin_src C++ | ||
std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes(); | ||
std::vector<oneapi::tbb::core_type_id> core_types = oneapi::tbb::info::core_types(); | ||
#+end_src | ||
|
||
This lack of valid HW topology, caused by the absence of a third-party library, | ||
is the major problem with the current oneTBB behavior. The problem lies in the | ||
lack of diagnostics making it difficult for developers to detect. As a result, | ||
the code continues to run but fails to use NUMA as intended. | ||
|
||
Dependency on a shared HWLOC library has the following benefits: | ||
1. Code reuse with all of the positive consequences out of this, including | ||
relying on the same code that has been tested and debugged, allowing the OS | ||
to share it among different processes, which consequently improves on cache | ||
locality and memory footprint. That's the primary purpose of shared | ||
libraries. | ||
2. A drop-in replacement. Users are able to use their own version of HWLOC | ||
without recompilation of oneTBB. This specific version of HWLOC could include | ||
a hotfix to support a particular and/or new hardware that a customer has, but | ||
whose support is not yet upstreamed to HWLOC project. It is also possible | ||
that such support won't be upstreamed at all if that hardware is not going to | ||
be available for massive users. It could also be a development version of | ||
HWLOC that someone wants to test on their systems first. Of course, they can | ||
do it with the static version as well, but that's more cumbersome as it | ||
requires recompilation of every dependent component. | ||
|
||
The only disadvantage from depending on HWLOC library dynamically is that the | ||
aleksei-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
developers that use oneTBB's NUMA support API need to make sure the library is | ||
available and can be found by oneTBB. Depending on the distribution model of a | ||
developer's code, this is achieved either by: | ||
1. Asking the end user to have necessary version of a dependency pre-installed. | ||
2. Bundling necessary HWLOC version together with other pieces of a product | ||
release. | ||
|
||
However, the requirement to fulfill one of the above steps for the NUMA API to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However, the need to complete one of the above steps for the NUMA API to function effectively may be seen as inconvenient. More importantly, it is not always immediately clear that these steps are required. Especially, die to the silent fallback behavior when the HWLOC library is not found in the environment. |
||
start paying off may be considered as an incovenience and, what is more | ||
important, it is not always obvious that one of these steps is needed. | ||
Especially, due to silent behavior in case HWLOC library cannot be found in the | ||
environment. | ||
|
||
The proposal is to reduce the effect of the disadvantage of relying on a dynamic | ||
HWLOC library. The improvements involve statically linking HWLOC with one of the | ||
~tbbbind~ libraries distributed together with oneTBB. At the same time, you | ||
retain the flexibility to specify different version of HWLOC library if needed. | ||
|
||
Since HWLOC 1.x is an older version and modern operating systems install HWLOC | ||
2.x by default, the probability of users being restricted to HWLOC 1.x is | ||
relatively small. Thus, we can reuse the filename of the ~tbbbind~ library | ||
linked to HWLOC 1.x for the library linked against a static HWLOC 2.x. | ||
|
||
* Proposal | ||
1. Replace the dynamic link of ~tbbbind~ library currently linked | ||
against HWLOC 1.x with a link to a static HWLOC library version 2.x. | ||
2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the | ||
dependency on functionality provided by the ~tbbbind~ layer. | ||
3. Update the oneTBB documentation, including [[https://oneapi-src.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these pages]], to | ||
detail the steps for identifying which ~tbbbind~ is being used. | ||
|
||
** Advantages | ||
1. The proposed behavior introduces a fallback mechanism for resolving the HWLOC | ||
library dependency when it is not in the environment, while still preferring | ||
user-provided versions. As a result, the problematic oneTBB API usage works | ||
as expected, returning an enumerated list of actual NUMA nodes and core types | ||
on the system the code is running on, provided that the loaded HWLOC library | ||
works on that system and that an application properly distributes all | ||
binaries of oneTBB, sets the environment so that the necessary variant of | ||
~tbbbind~ library can be found and loaded. | ||
2. Dropping support for HWLOC 1.x, does not introduce an additional ~tbbbind~ | ||
variant while maintaining support for widely used versions of HWLOC. | ||
|
||
** Disadvantages | ||
By default, there is still no diagnostics if you fail to correctly setup an | ||
environment with your version of HWLOC. Although, specifying the ~TBB_VERSION=1~ | ||
environment variable helps identify configuration issues quickly. | ||
|
||
* Alternative Handling for Missing System Topology | ||
The other behavior in case HWLOC library cannot be found is to be more explicit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative approach to handle the absence of the HWLOC library is to adopt a more explicit response:
|
||
about the problem of a missing component and to either issue a warning or to | ||
refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw | ||
an exception). | ||
|
||
Comparing these alternative approaches to the one proposed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this be a heading? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No necessarily. |
||
** Common Advantages | ||
- Explicitly indicates that the functionality being used does not work, instead | ||
of failing silently. | ||
- Avoids the need to distribute an additional variant of ~tbbbind~ library. | ||
|
||
** Common Disadvantages | ||
- Requires additional step from the user side to resolve the problem. In other | ||
words, it does not provide complete solution to the problem. | ||
|
||
*** Disadvantages of Issuing a Warning | ||
- The warning may be unnoticed, especially if standard streams are closed. | ||
|
||
*** Disadvantages of Throwing an Exception | ||
- May break existing code that does not expect an exception to be thrown. | ||
- Requires introduction of an additional exception hierarchy. | ||
|
||
* References | ||
1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] | ||
2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in fact most of Linux OSes has obsolete hwloc versions and relying on it does not provide benefits.
IMO, having most up-to-date static HWLOC together with recent versions of oneTBB has benefits and fixes / new features are available to oneTBB immediately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I did not know that. I will consider this in the future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rewrite it to smth like this:
1. Reliability. Using a tested and debugged shared library, oneTBB benefits from established, reliable functionality.
2. Code Reuse. Reuse the same code across different processes, improving cache locality and reducing memory footprint, which is the primary purpose of shared libraries.
3. Drop-In Replacement. Use your version of HWLOC without recompiling oneTBB. It can be useful in the following cases: