Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InexactError: Int64(1.0) on macOS (CI) #55878

Open
lmiq opened this issue Sep 25, 2024 · 17 comments
Open

InexactError: Int64(1.0) on macOS (CI) #55878

lmiq opened this issue Sep 25, 2024 · 17 comments
Labels
compiler:effects effect analysis heisenbug This bug occurs unpredictably rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one

Comments

@lmiq
Copy link
Contributor

lmiq commented Sep 25, 2024

I experienced this (non-reproducible) error in a CI run on the pre release (1.11 now), in the MacOS platform:

 ERROR: LoadError: TaskFailedException

    nested task error: InexactError: Int64(1.0)

The line where the error occurred was:

            index = floor(Int, xi) + 1

in which xi is a Float64. As far as I understand, the result of floor should always return an exactly representable Int, so the error should not occur.

When reporting this on Zulip, it, @jlumpe (sorry if the tag is wrong), reported having observed two times similar errors:

https://github.com/mikeingold/MeshIntegrals.jl/actions/runs/10965929580/job/30452792024?pr=73#step:7:191
https://github.com/JoshuaLampert/DispersiveShallowWater.jl/actions/runs/10811172112/job/29990016675?pr=150#step:7:8511

But these with 1.10.

Thus, although hard to reproduce, might be something worth investigating.

@oscardssmith oscardssmith added the rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one label Sep 25, 2024
@oscardssmith
Copy link
Member

something is obviously going horrifically wrong here. It will be very hard to reproduce without an RR trace.

@mbauman
Copy link
Member

mbauman commented Sep 25, 2024

Are you using xi in a muladd in your case? In the linked CI runs, both are happening in cospi or sincospi, where the nearby muladd looks suspiciously like it might result in something like #55785?

@oscardssmith
Copy link
Member

the really confusing thing here is that the error message shows the number as being an int...

@lmiq
Copy link
Contributor Author

lmiq commented Sep 25, 2024

The function where the error appeared for me was this one:

@inline function particle_cell(x::SVector{N}, box::Box) where {N}
    CartesianIndex(
        ntuple(N) do i
            xmin = box.computing_box[1][i]
            xi = (x[i] - xmin) / box.cell_size[i]
            index = floor(Int, xi) + 1
            return index
        end
    )
end

(from here: https://github.com/m3g/CellListMap.jl/blob/f85ede5118462331d23303833249f84ac6942ff3/src/Box.jl#L584)

@giordano
Copy link
Contributor

Tangentially related, since as far as I understand @testset sets the RNG status internally, it'd be nice if @testset also printed the RNG status to let people reproduce the failure.

@lmiq
Copy link
Contributor Author

lmiq commented Sep 25, 2024

Unfortunately I rerun the job and I can't reproduce it, but I'm using StableRNGs. So I'm not sure if that would really help.

@mbauman
Copy link
Member

mbauman commented Sep 25, 2024

the really confusing thing here is that the error message shows the number as being an int

Yes, that's very fast-mathy. Int conversion does the following:

    %1 = call double @llvm.trunc.f64(double %0)
    %2 = fsub double %0, %1
    %3 = fcmp une double %2, 0.000000e+00
    br i1 %3, label %error_path, label %ok_path

And then the %error_path uses %0. If that fsub fuses into an fma with an inlined/prior multiplication to create %0, you could get a different answer for the br. But this should be reproducible with a stable RNG, so it may be a red herring.

@giordano
Copy link
Contributor

That's not annotated contract though, is it?

@mbauman
Copy link
Member

mbauman commented Sep 25, 2024

Correct. This could be dependent upon surrounding code if it inlines, but neither sinpi nor sincospi inline, so I don't think this is reason here.

@oscardssmith
Copy link
Member

oscardssmith commented Sep 25, 2024

even if it is, that doesn't actually explain anything. This error can only occur if the compiler proves that xi is outside of [-0x1p63, 0x1p63) which 1.0 obviously isn't, so this bug requires a pretty dramatic miscompile.

@oscardssmith oscardssmith added heisenbug This bug occurs unpredictably compiler:effects effect analysis labels Sep 25, 2024
@lmiq
Copy link
Contributor Author

lmiq commented Oct 17, 2024

@oscardssmith
Copy link
Member

does RR work for Mac? I really want to see what is going on here

@lmiq
Copy link
Contributor Author

lmiq commented Oct 17, 2024

And again: https://github.com/m3g/ComplexMixtures.jl/actions/runs/11391544466/job/31695572484

Seems that I can get these errors with some frequency. If someone can give me some hint on how to provide further information (I don't have a Mac, unfortunately), I will be glad to help.

If someone has a mac, maybe with some luck the bug can be reproduced by running the testset where this happened for me last time, or maybe just running tons of floor(Int, 100.d0 * rand()) or something similar.

    using ComplexMixtures
    using PDBTools: readPDB, select
    using ComplexMixtures.Testing: data_dir
    using Test

    # Test simple three-molecule system: cross correlation
    atoms = readPDB("$data_dir/toy/cross.pdb")
    protein = AtomSelection(select(atoms, "protein and model 1"), nmols=1)
    water = AtomSelection(select(atoms, "resname WAT and model 1"), natomspermol=3)
    traj = Trajectory("$data_dir/toy/cross.pdb", protein, water, format="PDBTraj")

    for nthreads in [1,2], lastframe in [1, 2], low_memory in [true, false]
        options = Options(;
            seed=321,
            StableRNG=true,
            nthreads,
            silent=true,
            n_random_samples=10^5,
            lastframe,
        )
        R = mddf(traj, options; low_memory)
        @test R.volume.total == 27000.0
        @test R.volume.domain ≈ R.volume.total - R.volume.bulk
        @test isapprox(R.volume.domain, (4π / 3) * R.dbulk^3; rtol=0.01)
        @test R.density.solute ≈ 1 / R.volume.total
        @test R.density.solvent ≈ 3 / R.volume.total
        @test R.density.solvent_bulk ≈ 2 / R.volume.bulk
        @test sum(R.md_count) ≈ 1
        @test sum(R.coordination_number) ≈ 51
        C = coordination_number(traj, options; low_memory)
        @test C.volume.total == R.volume.total
        @test C.volume.domain ≈ 0.0
        @test C.density.solute ≈ 1 / C.volume.total
        @test C.density.solvent ≈ 3 / C.volume.total
        @test C.density.solvent_bulk == 0.0
        @test C.md_count == R.md_count
        @test coordination_number(C) == coordination_number(R)
    end

@giordano
Copy link
Contributor

I don't have a Mac

Usual reminder that https://github.com/mxschmitt/action-tmate lets you log into github actions runners.

@lmiq
Copy link
Contributor Author

lmiq commented Oct 18, 2024

I didn't know that existed.

I logged into a macOS-latest machine, and ran hundreds of billions of floor(Int, 1000*rand()) operations, and didn't get any error. Also, I ran a few times the tests of my package, also without any error.

Unfortunately, this is really hard to reproduce. I thought I would get it failing easily, since I got the error often today when pushing new commits to the repo.

I don't know if that helps, but this is the CI workflow of the repo where the issue appears: https://github.com/m3g/ComplexMixtures.jl/blob/main/.github/workflows/ci.yml

and this is the CI workflow of the tmate session I used lastly: https://github.com/lmiq/tmate/blob/main/.github/workflows/ci.yml

Maybe there is some configuration difference that someone can pinpoint.

@IanButterworth
Copy link
Member

I've not reviewed this in detail but one of your failing CI runs was with coverage on and bounds checking forced on. You might need those to reproduce?

See the args that the packages were precompiled for, in the logs.

@IanButterworth
Copy link
Member

For reproducing locally, julia-runtest runs now show how to rerun locally with the same config julia-actions/julia-runtest#124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:effects effect analysis heisenbug This bug occurs unpredictably rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one
Projects
None yet
Development

No branches or pull requests

5 participants