Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: EXCEPTION_ACCESS_VIOLATION during garbage collection in PySR #661

Open
zzccchen opened this issue Jul 5, 2024 · 22 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@zzccchen
Copy link

zzccchen commented Jul 5, 2024

What happened?

The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection process.

Version

v0.19.0

Operating System

Windows

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

[ Info: Automatically setting `--heap-size-hint=2730M` on each Julia process. You can configure this with the `heap_size_hint_in_bytes` parameter.
[ Info: Importing SymbolicRegression on workers as well as extensions Bumper, LoopVectorization.
[ Info: Finished!
[ Info: Copying definition of loss_fnc to workers...
[ Info: Finished!
[ Info: Started!
32.1%┣█████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                      ┫ 1.0k/3.2k [00:40<01:26, 25it/s]
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffa6106a6b0 -- gc_mark_outrefs at C:/workdir/src\gc.c:2527 [inlined]
gc_mark_and_steal at C:/workdir/src\gc.c:2746
in expression starting at none:0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gc_mark_outrefs at C:/workdir/src\gc.c:2527 [inlined]
gc_mark_and_steal at C:/workdir/src\gc.c:2746
gc_mark_loop_parallel at C:/workdir/src\gc.c:2885
jl_gc_mark_threadfun at C:/workdir/src\partr.c:142
uv__thread_start at /workspace/srcdir/libuv\src/win\thread.c:111
beginthreadex at C:\Windows\System32\msvcrt.dll (unknown line)
endthreadex at C:\Windows\System32\msvcrt.dll (unknown line)
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 9815735891 (Pool: 9517376769; Big: 298359122); GC: 69400

Extra Info

turbo=True, bumper=True

@zzccchen zzccchen added the bug Something isn't working label Jul 5, 2024
@MilesCranmer
Copy link
Owner

Can you try with turbo=False, bumper=False? Those options are experimental and get PySR to use libraries which are bleeding edge. When they work, they are really fast, but they can also cause crashes (especially on Windows).

@zzccchen
Copy link
Author

zzccchen commented Jul 9, 2024

Regrettably. I tried turbo=False, bumper=False parameter and the crash problem still occurred.

@zzccchen
Copy link
Author

zzccchen commented Jul 9, 2024

Could automatically setting --heap-size-hint=2730M cause this problem?

@MilesCranmer
Copy link
Owner

Hm, Can you show the rest of your code?

@zzccchen
Copy link
Author

zzccchen commented Jul 9, 2024

from pysr import PySRRegressor

# data load code

X_123e = data_X_123e.to_numpy()
y_123e = data_y_123e.to_numpy()

sr_model = PySRRegressor(
    binary_operators=[
        "*",
        "+",
        "-",
        "/",
    ],
    unary_operators=["square", "cube", "exp", "log", "sqrt"],
    maxsize=80, 
    maxdepth=10,  
    niterations=100, 
    populations=32, 
    population_size=100, 
    ncycles_per_iteration=550, 
    constraints={
        "/": (-1, 9),
        "^": (-1, 5),
        "exp": 6,
        "square": 6,
        "cube": 6,
        "log": 6,
        "sqrt": 6,
        "abs": 9,
    },
    nested_constraints={
        "square": {"square": 0, "cube": 0, "exp": 1},
        "cube": {"square": 0, "cube": 0, "exp": 1},
        "exp": {"square": 0, "cube": 0, "exp": 0},
        "sqrt": {"sqrt": 0, "log": 0},
        "log": {"log": 0},
    },
    complexity_of_operators={
        "square": 2,
        "cube": 3,
        "exp": 3,
        "log": 3,
        "sqrt": 2,
    },
    complexity_of_constants=4,
    adaptive_parsimony_scaling=150.0,
    weight_add_node=0.79,
    weight_insert_node=5.1,
    weight_delete_node=1.7,
    weight_do_nothing=0.21,
    weight_mutate_constant=0.048,
    weight_mutate_operator=0.47,
    weight_swap_operands=0.1,
    weight_randomize=0.23,
    weight_simplify=0.5,
    weight_optimize=0.5,
    crossover_probability=0.066,
    perturbation_factor=0.076,
    cluster_manager=None,
    precision=32,
    turbo=True,
    bumper=True,
    progress=True,
    elementwise_loss="""
    function loss_fnc(prediction, target)
        percentage_error = abs((prediction - target) / target) * 100
        return percentage_error
    end
    """,
    multithreading=False,
    equation_file=symbol_regression_csv_path,
)

complexity_of_variables = [] # list of complexity
sr_model.fit(
    X_123e, y_123e, complexity_of_variables=complexity_of_variables
)

here is the main code of the workflow.

@zzccchen
Copy link
Author

zzccchen commented Jul 9, 2024

At the same time, I will put the above code in a multi-layer loop to test different feature data sets and the stability of the symbolic regression results. A single loop takes about 2.2 minutes. The program crashes after running for 3-4 hours, running about 80-110 rounds.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 9, 2024

That looks good. Great to see all those options being used! 🙂

(Random comment: your element wise loss divides by the target, so make sure the target > 0, otherwise one target will dominate. But I’m assuming you’re aware of that!)

Other comment: can you try with multithreading=True? With it set to False, and with procs>0 (the default), it will use multiple Julia processes. But if you just use multi-threading instead, it will start up much faster and hopefully be more stable. With multi-processing it is launching new Julia processes every single time it searches. (This is a weakness in the current codebase; I would like to eventually store the processes within PySRRegressor so multiprocessing has fast startup too.)

You can also set multithreading=False, procs=0 to use serial mode.

But it’s curious that it crashes. Since it runs for a few hours, did you notice anything else happening, like the memory usage gradually increasing over that time and not going down?

@zzccchen
Copy link
Author

zzccchen commented Jul 9, 2024

If I use multithreading instead of multiprocessing, the calculation speed will drop from 30it/s to 7it/s on my device, which is a bit unacceptable to me. In addition, I have made sure that my y_true values ​​are all greater than 0. And the memory usage does not fluctuate when the program crashes, occupying only 30% of the total memory.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 9, 2024

Maybe try multithreading=True again, but this time, before loading PySR, set a larger thread count:

import os
os.environ["PYTHON_JULIACALL_THREADS"] = (num_cores) * 2

Where num_cores is the number of CPU cores. The factor of 2 is so there’s some redundancy but you could try more or less depending on performance.

The default behavior of PySR is to start Julia with --threads='auto' which is actually fewer than the number of available cores (so it doesn’t take up the whole CPU). But for high performance you can increase the usage.

The full list of available juliacall environment variables is here: https://juliapy.github.io/PythonCall.jl/stable/juliacall/#julia-config

@zzccchen
Copy link
Author

I tried

import os
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
# or
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
os.environ["PYTHON_JULIACALL_PROCS"] = "64"

But it did not improve the calculation speed, the processor usage was only 20-30%, I am using a 24c32t 14900k processor.

@MilesCranmer
Copy link
Owner

To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.

Also note that the PROCS env variable won’t have any effect.

@zzccchen
Copy link
Author

I had a similar problem when I gave up Windows and moved to Ubuntu 24.04 lts. I also used a tool (tm5) to test the memory. After testing for 1 hour, there was no error and the temperature was stable at 45℃. It doesn't seem to be a hardware problem. This problem is so strange.

Traceback (most recent call last):
  File "/home/zc/Documents/GitHub/MLPIP/notebooks/TC/S2_symbol_regression/S202_sr_123e.py", line 192, in <module>
    sr_model.fit(
  File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 2088, in fit
    self._run(X, y, runtime_params, weights=weights, seed=seed)
  File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 1890, in _run
    out = SymbolicRegression.equation_search(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zc/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl", line 223, in __call__
    return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:949
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:955
  [4] unsafe_read
    @ ./io.jl:774 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:773
  [6] read!
    @ ./io.jl:775 [inlined]
  [7] deserialize_hdr_raw
    @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
juliacall.JuliaError: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
  [4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
  [5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
  [6] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
  [7] #equation_search#26
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
  [8] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
  [9] #equation_search#28
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
 [10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:36
 [11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/base.jl:72
 [12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
    @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/C.jl:63

    nested task error: Distributed.ProcessExitedException(423)
    Stacktrace:
      [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
        @ Base ./task.jl:931
      [2] wait()
        @ Base ./task.jl:995
      [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
        @ Base ./condition.jl:130
      [4] wait
        @ ./condition.jl:125 [inlined]
      [5] take_buffered(c::Channel{Any})
        @ Base ./channels.jl:477
      [6] take!(c::Channel{Any})
        @ Base ./channels.jl:471
      [7] take!(::Distributed.RemoteValue)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
      [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
      [9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
     [10] remotecall_fetch
        @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
     [11] call_on_owner
        @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
     [12] fetch(r::Distributed.Future)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
     [13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
        @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 15, 2024

Just to confirm, there is no crash now? Just that this message is printed?

I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the @async fetch call on that worker failed.

However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous fetch tasks before the worker processes are killed.

@MilesCranmer
Copy link
Owner

I do think it would be better if there was a way to get multithreading to be faster, by increasing PYTHON_JULIACALL_THREADS before importing pysr. Windows multiprocessing seems to occasionally have issues for unknown reasons, and has been quite hard to debug, whereas multithreading has been quite stable.

@zzccchen
Copy link
Author

This message appears when the search process reaches about 30%, and then the search process stops. I can try to reproduce it again to see if it crashes. Also, does using the slurm backend help avoid this problem?

@MilesCranmer
Copy link
Owner

Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.

Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.

I guess this might be hard to make a smaller MWE but (2) would be most useful.


The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.

@zzccchen
Copy link
Author

To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.

Also note that the PROCS env variable won’t have any effect.

I have confirmed this point. If I use os.environ["PYTHON_JULIACALL_THREADS"] = "1", it will warn Warning: You are using multithreading mode, but only one thread is available. Try starting julia with --threads=auto.

@zzccchen
Copy link
Author

Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.

Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.

I guess this might be hard to make a smaller MWE but (2) would be most useful.

The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.

Thank you very much. I need to apply for the relevant code and data to be provided. In addition, I have an Ubuntu 20 server running a single-node slurm. In the preliminary test, the calculation speed is consistent with multi-process. I can test on that device to confirm whether it is a device problem.

@zzccchen
Copy link
Author

Just to confirm, there is no crash now? Just that this message is printed?

I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the @async fetch call on that worker failed.

However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous fetch tasks before the worker processes are killed.

I have confirmed that this prompt will cause the search process to be interrupted. I temporarily bypassed the crash by using try...except Exception... in the Python code, but the memory requested by Julia was not released. This caused my memory to be full after crashing 3 times. Can we use the try-finally block in the Julia source code to improve the stability of the program?

error_log.txt

@zzccchen
Copy link
Author

I think I have found a temporary solution for the time being, which is to manually end the julia process after each search.

import time, os
time.sleep(10)
os.system("killall julia")

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 18, 2024

Thanks. That is good to know.

I do think the way SymbolicRegression.jl launches processes is a bit problematic for large-scale use-cases at the moment. The way it works is that it calls addprocs from within SymbolicRegression.equation_search. This was designed for convenience of users, especially on the Python side, but as far as I can tell it's not well-supported behavior in Julia, which means it needs to do some very fragile things like manually copying function definitions to workers.

What would be better is if PySR did one of the following alternative strategies:

  1. For big jobs, use MPI directly, via MPI.jl. However, this would require the user to call mpiexec manually, rather than launch the multi-processor search from a single Python session. However, it is nice that MPI has support as a standard on every cluster, so we wouldn't need to rely on different cluster manager-specific scripts.
  2. Explore @oschulz's ParallelProcessingTools.jl as an alternative. This uses an elastic manager – which is actually designed for the things PySR is doing, like adding and removing workers. (Right now PySR basically misuses Distributed.jl to start new processes, send code to them, and finally kill them at the end of a search. It works and it's convenient, but I'm not sure it is a sustainable solution)
  3. Start the workers from the Python side, rather than within Julia. Basically, the PySRRegressor object itself would call addprocs, and store the processes as an attribute of the regressor object. It can pass these to equation_search via the procs keyword argument, in which case SymbolicRegression.jl will simply use them.
    • However, this would require rewriting some of the Python side of things so that each jl.seval is called with an @everywhere in front of it – thus executing each Julia snippet on all processes. This also means that it would be harder for users to use jl.seval themselves.
    • This approach would also mean that we could wrap PySR in a Julia module, rather than the current approach of running everything in Julia's Main context – which might interfere with other Python+Julia packages in the future.

I'm not sure how much work each of these options would be. They might be fairly easy to get working though. But it would definitely require some Julia coding (if you are up for it).

@MilesCranmer
Copy link
Owner

Just going to keep this open until there's a better solution than a manual workaround. Ideally the workaround shouldn't be needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants