-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set number of threads for arrow #134
Comments
Hi @ivirshup, I think that arrow doesn't implement its own threadpool but instead relies on OpenMP for that. So I think controlling the number of OpenMP threads should work: from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api='openmp'):
... |
Thanks for the response. I'm not sure what the specific implementation is, but that example doesn't seem to set the number of threads pyarrow sees. I'll demonstrate: Using threadpoolctl after pyarrow importimport pyarrow as pa
print(pa.cpu_count())
from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api="openmp"):
print(pa.cpu_count())
Using threadpoolctl during pyarrow importfrom threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api="openmp"):
import pyarrow as pa
print(pa.cpu_count())
Setting OMP_NUM_THREADSimport os
os.environ["OMP_NUM_THREADS"] = "1"
import pyarrow as pa
print(pa.cpu_count())
|
Right, I misinterpreted their doc. I looked into their source code and it appears that they implement their own threadpool, which can be configured by the I'm not sure yet if we want to explicitly support arrow. An alternative would be to allow custom controllers as requested here #137. |
I believe I prompted that 😆 |
@ivirshup #138 was merged in the If filename-based dynlib matching we could extend it to complement the filename match with a symbol name match as discussed in #138 (comment) but this is not yet implemented. |
Great! Thanks @ogrisel and @jeremiedbb! I'm a little unfamiliar with linking, as I've avoided learning much C++, but have given this a shot. It seems to work, but there's something a little strange going on. Here's what I've written: import threadpoolctl, pyarrow as pa
class ArrowThreadPoolCtlController(threadpoolctl.LibController):
user_api = "arrow"
internal_api = "arrow"
filename_prefixes = ("libarrow",)
def get_num_threads(self):
print(f"got {pa.cpu_count()} threads")
return pa.cpu_count()
def set_num_threads(self, num_threads):
print(f"set to {num_threads} threads")
pa.set_cpu_count(num_threads)
def get_version(self):
print("get_version called")
return pa.__version__
def set_additional_attributes(self):
pass
threadpoolctl.register(ArrowThreadPoolCtlController)
with threadpoolctl.threadpool_limits(1):
print(pa.cpu_count()) Here's the output:
This is from running it just once. This increases each time I register the class, so it could be nice if there was some level of uniqueness for controllers. Maybe this has to do with the number of dynlibs that start with the prefix? This was run in a conda environment which has these dylibs:
|
Ah, I think I'm starting to see. I think I'm getting a dynlib for all matching files as the expectation is that I am setting the threads directly using the dynlib CDLL object. I'm not sure I'm going to figure out how to do that. Maybe it could be done by calling the C++ methods for setting threads. I think just using |
The purpose of from contextlib import contextmanager
@contextmanager
def limit_arrow(num_threads):
old_num_threads = pa.cpu_count()
try:
pa.set_cpu_count(num_threads)
yield
finally:
pa.set_cpu_count(old_num_threads)
with limit_arrow(1):
... |
That being said, I think it would still be interesting to support arrow directly. For instance threadpoolctl provides a way to limit all supported libraries at once. Not having to write custom context managers for all libraries is nice. I've tried to use the symbols from the shared object but there's a catch. arrow being a c++ library, symbol names are mangled :(
There are ways to demangle the name but it's gonna require some work to implement it in a robust and cross-platform way. |
Yeah, this is really what I like about this library!
So my concern about where calling pyarrow wouldn't work is if I was calling some other program that calls out to pyarrow.compute. If I either don't have pyarrow in this environment, or this program is using a bundled/ separate version of arrow, the pyarrow approach doesn't work. Maybe arrow devs would have interest in supporting this? |
ping @jorisvandenbossche, we'd like to have your opinion on that :) We're interested in adding support for The issue is that since arrow is a c++ library, the names of the symbols are mangled, see #134 (comment), making it hard to retrieve for threadpoolctl. I can see 3 alternatives:
|
Would setting the number of threads used by arrow be in-scope for this library?
(main docs on arrow thread pools)
arrow uses environment variables to set the numbers of threads used at import time, but then allows dynamically changing the number of threads used via setter functions, like
set_cpu_count
. Notably, there are two separate thread pools used one for compute and one for IO.Is this functionality in scope for this library? If so, it would be great to see this feature.
The text was updated successfully, but these errors were encountered: