Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow in Julia nightly, Mac OS #57149

Open
longemen3000 opened this issue Jan 24, 2025 · 5 comments
Open

Stack overflow in Julia nightly, Mac OS #57149

longemen3000 opened this issue Jan 24, 2025 · 5 comments

Comments

@longemen3000
Copy link
Contributor

Hello,

I've been consistently finding stack overflows in my tests for Clapeyron.jl in the Julia nightly - macOS-latest - aarch64 Github actions, worker. The warning is the following:

Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.

And then it hangs until the test timeout.

@vtjnash
Copy link
Member

vtjnash commented Jan 24, 2025

Duplicate of #55513

@vtjnash vtjnash marked this as a duplicate of #55513 Jan 24, 2025
@vtjnash vtjnash closed this as completed Jan 24, 2025
@vtjnash vtjnash reopened this Jan 24, 2025
@vtjnash
Copy link
Member

vtjnash commented Jan 24, 2025

Sorry, I realized that was only one of the issues however. It also causes the OrcJIT to internally crash while running

build_eosmodel at /Users/jameson/.julia/packages/Clapeyron/d5Dw3/src/utils/macros.jl:678
    678     set_reference_state!(model,verbose = verbose)

due to an excess in the number of function calls. Since it was the OrcJIT that crashed, julia is still holding the OrcJIT-related locks after the StackOverflow and thus is unable to continue with code generation (e.g. cannot generate the code to print the error). We used to have workarounds for this (only letting OrcJIT get one function at a time), but that limited scalability. This is an OrcJIT implementation issue, and not something we can solve downstream.

@longemen3000
Copy link
Contributor Author

Is there something I can do, as a pkg developer? To at least reduce the occurrences of the issue?

@vtjnash
Copy link
Member

vtjnash commented Jan 24, 2025

Nothing particularly simple, as I don't think we have any tooling written for helping with that sort of search analysis. You can try to avoid generating excessive methods (I don't know any specifics here for Clapeyron) with distinct signatures. You could try adding @nospecialize to some, for example, and ::SomeType to others, especially if they are recursive over the structure of a type or value or macro argument. Similarly, make sure you aren't calling (or defining) methods with a lot of parameters with a Union sort of type signature (either implicitly because you pass a lot of things of different kinds or explicitly visible because you write the type assertions). Some of these can happen in your libraries instead too.

Unfortunately it seems that means the measured upper limit for OrcJIT to crash is only about 1700 function calls being present, in the best of circumstances (typically probably only half of that), so that is rather a small limit

@oscardssmith
Copy link
Member

Why is macos aarch64 the only affected platform? Also, why isn't the SciML ecosystem running into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants