-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 7th No.1】 Integrate PaddlePaddle as a New Backend #704
base: master
Are you sure you want to change the base?
Conversation
Thanks! Regarding the error, is this only on the most recent PySR version, or the previous one as well? There was a change in how multithreading was handled in juliacall so I wonder if it’s related. Just to check, are you launching this with Python multithreading or multiprocessing? I am assuming it is a bug somewhere in PySR but just want to check if there’s anything non standard in how you are launching things. Another thing to check — does it change if you import paddle first, before PySR? Or does the import order not matter? There is a known issue with PyTorch and JuliaCall where importing torch first prevents an issue with LLVM symbol conflicts (since Torch and Julia are compiled against different LLVM libraries if I remember correctly). Numba has something similar. Not sure if Paddle has something similar |
I believe the issue is indeed related to importing Paddle, as the error only occurs when |
Quick followup – did you also try 0.19.4? There were some changes to juliacall that were integrated in 0.19.4 of PySR. |
It looks like Julia uses LLVM 18: https://github.com/JuliaLang/julia/blob/647753071a1e2ddbddf7ab07f55d7146238b6b72/deps/llvm.version#L8 (or LLVM 17 on the last version) whereas Paddle uses LLVM 11. I wonder if that is causing one of the issues. Can you try running some generic juliacall stuff as shown in the guide here? https://juliapy.github.io/PythonCall.jl/stable/juliacall/. Hopefully we should be able to get a simpler example of the crash. I would be surprised if it is only PySR (and not Julia more broadly) but it could very well be just PySR. |
It still occurs. |
I created a minimal reproducible example, as shown in the code below. import numpy as np
import pandas as pd
from pysr import PySRRegressor
import paddle
paddle.disable_signal_handler()
X = pd.DataFrame(np.random.randn(100, 10))
y = np.ones(X.shape[0])
model = PySRRegressor(
progress=True,
max_evals=10000,
model_selection="accuracy",
extra_sympy_mappings={},
output_paddle_format=True,
# multithreading=True,
)
model.fit(X, y) Interestingly, when running To summarize, when the environment variable |
Are you able to build a MWE that only uses juliacall, rather than PySR? We should hopefully be able to boil it down even further. Maybe you could do something like from juliacall import Main as jl
import paddle
paddle.disable_signal_handler()
jl.seval("""
Threads.@threads for i in 1:5
println(i)
end
""") which is like the simplest way of using multithreading in Julia. |
It will work as expected ➜ python test.py (base)
1
2
3
4
5 |
Since those numbers appear in order it seems like you might not have multi-threading turned on for Julia? Normally they will appear in some random order as each will get printed by a different thread. (Note the environment variables I mentioned above.) |
➜ PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto python test.py (base) |
Excuse me, is the current problem that paddle does not support multi-threading for Julia without setting PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto? from juliacall import Main as jl
jl.seval("""
Threads.@threads for i in 1:5
println(i)
end
""") and the result is also 1 2 3 4 5 |
Excuse me, Is this issue still being followed up? |
Hi, |
Also, @lijialin03 feel free to make a new PR if interested as I’m not sure if @AndPuQing is still working on this one. |
Ok, I understand it, thank you for your reply |
When the environment variable Below is a minimal reproducible example to demonstrate the issue: # test.py
import paddle
from pysr import jl
paddle.disable_signal_handler()
jl.seval(
"""
using Distributed: Distributed, @spawnat, Future, procs, addprocs
macro sr_spawner(expr, kws...)
# Extract parallelism and worker_idx parameters from kws
@assert length(kws) == 2
@assert all(ex -> ex.head == :(=), kws)
@assert any(ex -> ex.args[1] == :parallelism, kws)
@assert any(ex -> ex.args[1] == :worker_idx, kws)
parallelism = kws[findfirst(ex -> ex.args[1] == :parallelism, kws)::Int].args[2]
worker_idx = kws[findfirst(ex -> ex.args[1] == :worker_idx, kws)::Int].args[2]
return quote
if $(parallelism) == :multithreading
Threads.@spawn($(expr))
else
error("Invalid parallel type ", string($(parallelism)), ".")
end
end |> esc
end
@sr_spawner begin
println("Hello, world!")
end parallelism=:multithreading worker_idx=1
"""
) PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto python test.py Interestingly, this problem is not 100% reproducible and may need to be run multiple times.(It looks like the result of some kind of thread competition) Currently, this PR can pass the unit test by default. This error will only be encountered when |
Since this reduces the problem to only an interaction between PaddlePaddle and PythonCall.jl, without PySR being necessary to get this MWE, perhaps we best copy this issue to both the PaddlePaddle repository AND to the PythonCall.jl repository? (Noting that Distributed.jl is in the Julia standard library) I think the code can also be reduced more - there's no need for this sr_spawner macro - all that does is generate code. You can use |
When I use code with changing code to below to make sure the print can be displayed.
No errors encountered. And when I use code below to check thread's number, it is equal to PYTHON_JULIACALL_THREADS.
What is more, when I change code to below, number of "Hello, world!" printed is also right.
So, what is your environment? Could it be causing the problem? Distributor ID: Ubuntu Copyright (c) 2005-2022 NVIDIA Corporation paddlepaddle-gpu==3.0.0b1 |
Could you post this in the PythonCall.jl issues and also Paddle? Since it doesn't seem to be caused by PySR itself it's probably best to move the issue to one of those (you can cc me to those issues) |
@AndPuQing 开发者你好,感谢你的参与!由于你的黑客松赛题完成度较高,其PR已被锁定,请尽快完善锁定的PR,并确保在2025年1月3日前完成合入。逾期未合入PR将无法获得奖金发放。 |
We identified the cause of the random core dumps with the help of @lijialin03. The issue arises because As of now, the only viable solution I’ve found is to move For reference, see the pytest documentation on PYTHONPATH. |
Do we know why exactly paddle must be imported before pysr? Is it due to conflicting symbols exported by an .so file or something? Or is it purely based on signal handlers being in conflict? I feel like there should be a way to solve this for both import orders. Maybe we can ask the paddle team for help on the GitHub issues/forums? |
It is okay. An issue will be opened in the paddle repository later. Also for reference, can we know why torch must be imported after juliacall? This may help troubleshoot the bug. |
So I think (?) the PyTorch issue is mostly resolved, or at least I haven't seen it myself in a while. But basically it was because the LLVM that PyTorch is compiled against is different than the one Julia is compiled against, and RTLD_GLOBAL was used inside PyTorch for loading the C library, which basically exports all the LLVM symbols (which the Julia library loading then conflicts with). There was a nice writeup on this issue here: pytorch/pytorch#78829 (comment) - which might help for debugging the Paddle one if they are related. I think one of the suggested solutions to the PyTorch team was to use |
Thanks, this might help us a lot! |
@AndPuQing 开发者你好,感谢你的参与!你的黑客松赛题完成度较高,题目要求的内容已完成。但是如上讨论,paddle框架本身的冲突短期内难以被解决,PR暂时无法合入,此非该选手所致,因此本赛题依然发放奖金。 |
This PR introduces PaddlePaddle backend support to the PySR library as part of my contribution to the PaddlePaddle Hackathon 7th.
Key Changes:
Related Issues:
Issue Encountered:
During my integration process, I discovered that the method
SymbolicRegression.equation_search
causes a termination by signal, as shown in the following code snippet:Interestingly, this error does not occur when I set
PYTHON_JULIACALL_THREADS=1
. Below are some of the steps I took to investigate the issue:After the termination, I checked
dmesg
, which returned the following:I also used
gdb --args python -m pysr test paddle
to inspect the stack trace. The output is as follows:To be honest, I’m not very familiar with Julia, but it seems that the issue is related to multithreading within the SymbolicRegression library. I ran similar tests with the Torch backend, and below are the results:
I am encountering some challenges understanding the internal behavior of the SymbolicRegression library. I would greatly appreciate any guidance or suggestions on how to resolve this issue.