Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test on Julia v1.10 neural networks with CUDA #578

Open
giordano opened this issue Jan 19, 2025 · 3 comments
Open

Failing test on Julia v1.10 neural networks with CUDA #578

giordano opened this issue Jan 19, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@giordano
Copy link
Member

I'm opening this ticket with the goal of making CI all green. The only (systematic?) error I see is https://buildkite.com/julialang/reactant-dot-jl/builds/3529#01947d51-fb0d-4ea7-b297-05c1cacec5db/241-843

E0119 07:11:07.708265 3775037 buffer_comparator.cc:156] Difference at 32: -0.138306, expected 1.27086
E0119 07:11:07.708338 3775037 buffer_comparator.cc:156] Difference at 33: 2.6427, expected 1.29896
E0119 07:11:07.708346 3775037 buffer_comparator.cc:156] Difference at 34: -0.0935357, expected 1.16674
E0119 07:11:07.708355 3775037 buffer_comparator.cc:156] Difference at 35: -5.72594e-15, expected 1.14534
E0119 07:11:07.708366 3775037 buffer_comparator.cc:156] Difference at 36: -0.00014386, expected 1.23717
E0119 07:11:07.708396 3775037 buffer_comparator.cc:156] Difference at 38: 0.530177, expected 1.03532
E0119 07:11:07.708413 3775037 buffer_comparator.cc:156] Difference at 39: 4.34351, expected 1.12292
E0119 07:11:07.708431 3775037 buffer_comparator.cc:156] Difference at 40: 2.32142, expected 1.20271
E0119 07:11:07.708440 3775037 buffer_comparator.cc:156] Difference at 41: -4.29512e-06, expected 1.32259
E0119 07:11:07.708450 3775037 buffer_comparator.cc:156] Difference at 42: -0.0126644, expected 1.14966
2025-01-19 07:11:07.708473: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.710511 3775037 buffer_comparator.cc:156] Difference at 32: -0.138306, expected 1.27086
E0119 07:11:07.710540 3775037 buffer_comparator.cc:156] Difference at 33: 2.6427, expected 1.29896
E0119 07:11:07.710548 3775037 buffer_comparator.cc:156] Difference at 34: -0.0935357, expected 1.16674
E0119 07:11:07.710556 3775037 buffer_comparator.cc:156] Difference at 35: -5.72594e-15, expected 1.14534
E0119 07:11:07.710566 3775037 buffer_comparator.cc:156] Difference at 36: -0.00014386, expected 1.23717
E0119 07:11:07.710575 3775037 buffer_comparator.cc:156] Difference at 38: 0.530177, expected 1.03532
E0119 07:11:07.710585 3775037 buffer_comparator.cc:156] Difference at 39: 4.34351, expected 1.12292
E0119 07:11:07.710594 3775037 buffer_comparator.cc:156] Difference at 40: 2.32142, expected 1.20271
E0119 07:11:07.710604 3775037 buffer_comparator.cc:156] Difference at 41: -4.29512e-06, expected 1.32259
E0119 07:11:07.710614 3775037 buffer_comparator.cc:156] Difference at 42: -0.0126644, expected 1.14966
2025-01-19 07:11:07.710628: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.712635 3775037 buffer_comparator.cc:156] Difference at 64: -0.00532919, expected 1.14767
E0119 07:11:07.712664 3775037 buffer_comparator.cc:156] Difference at 65: 5.69504, expected 1.22237
E0119 07:11:07.712672 3775037 buffer_comparator.cc:156] Difference at 66: 1.58326, expected 1.10278
E0119 07:11:07.712680 3775037 buffer_comparator.cc:156] Difference at 68: 13.1101, expected 1.15004
E0119 07:11:07.712690 3775037 buffer_comparator.cc:156] Difference at 69: 0.766964, expected 1.12814
E0119 07:11:07.712699 3775037 buffer_comparator.cc:156] Difference at 70: 5.61118, expected 1.26758
E0119 07:11:07.712708 3775037 buffer_comparator.cc:156] Difference at 71: -2.24383e-18, expected 1.23554
E0119 07:11:07.712719 3775037 buffer_comparator.cc:156] Difference at 72: 0.15351, expected 1.04266
E0119 07:11:07.712728 3775037 buffer_comparator.cc:156] Difference at 74: 0.186936, expected 1.13546
E0119 07:11:07.712737 3775037 buffer_comparator.cc:156] Difference at 75: 0.770386, expected 1.21905
2025-01-19 07:11:07.712749: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.714718 3775037 buffer_comparator.cc:156] Difference at 64: -0.00532919, expected 1.14767
E0119 07:11:07.714732 3775037 buffer_comparator.cc:156] Difference at 65: 5.69504, expected 1.22237
E0119 07:11:07.714738 3775037 buffer_comparator.cc:156] Difference at 66: 1.58326, expected 1.10278
E0119 07:11:07.714743 3775037 buffer_comparator.cc:156] Difference at 68: 13.1101, expected 1.15004
E0119 07:11:07.714748 3775037 buffer_comparator.cc:156] Difference at 69: 0.766964, expected 1.12814
E0119 07:11:07.714754 3775037 buffer_comparator.cc:156] Difference at 70: 5.61118, expected 1.26758
E0119 07:11:07.714759 3775037 buffer_comparator.cc:156] Difference at 71: -2.24383e-18, expected 1.23554
E0119 07:11:07.714764 3775037 buffer_comparator.cc:156] Difference at 72: 0.15351, expected 1.04266
E0119 07:11:07.714782 3775037 buffer_comparator.cc:156] Difference at 74: 0.186936, expected 1.13546
E0119 07:11:07.714785 3775037 buffer_comparator.cc:156] Difference at 75: 0.770386, expected 1.21905
2025-01-19 07:11:07.714792: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.716652 3775037 buffer_comparator.cc:156] Difference at 128: -0.0170399, expected 1.21455
E0119 07:11:07.716666 3775037 buffer_comparator.cc:156] Difference at 129: -1.30294e-05, expected 1.28192
E0119 07:11:07.716674 3775037 buffer_comparator.cc:156] Difference at 130: 5.08182, expected 1.27711
E0119 07:11:07.716679 3775037 buffer_comparator.cc:156] Difference at 131: -1.81729e-10, expected 1.26972
E0119 07:11:07.716684 3775037 buffer_comparator.cc:156] Difference at 132: 1.50692, expected 1.15995
E0119 07:11:07.716689 3775037 buffer_comparator.cc:156] Difference at 133: 5.29193, expected 1.16513
E0119 07:11:07.716694 3775037 buffer_comparator.cc:156] Difference at 134: -2.25432e-21, expected 1.14136
E0119 07:11:07.716699 3775037 buffer_comparator.cc:156] Difference at 135: 1.61094, expected 1.19622
E0119 07:11:07.716703 3775037 buffer_comparator.cc:156] Difference at 136: -0.0260566, expected 1.15325
E0119 07:11:07.716722 3775037 buffer_comparator.cc:156] Difference at 137: 2.21617, expected 1.25995
2025-01-19 07:11:07.716730: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.718597 3775037 buffer_comparator.cc:156] Difference at 128: -0.0170399, expected 1.21455
E0119 07:11:07.718611 3775037 buffer_comparator.cc:156] Difference at 129: -1.30294e-05, expected 1.28192
E0119 07:11:07.718616 3775037 buffer_comparator.cc:156] Difference at 130: 5.08182, expected 1.27711
E0119 07:11:07.718621 3775037 buffer_comparator.cc:156] Difference at 131: -1.81729e-10, expected 1.26972
E0119 07:11:07.718625 3775037 buffer_comparator.cc:156] Difference at 132: 1.50692, expected 1.15995
E0119 07:11:07.718631 3775037 buffer_comparator.cc:156] Difference at 133: 5.29193, expected 1.16513
E0119 07:11:07.718637 3775037 buffer_comparator.cc:156] Difference at 134: -2.25432e-21, expected 1.14136
E0119 07:11:07.718645 3775037 buffer_comparator.cc:156] Difference at 135: 1.61094, expected 1.19622
E0119 07:11:07.718650 3775037 buffer_comparator.cc:156] Difference at 136: -0.0260566, expected 1.15325
E0119 07:11:07.718656 3775037 buffer_comparator.cc:156] Difference at 137: 2.21617, expected 1.25995
2025-01-19 07:11:07.718663: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.720531 3775037 buffer_comparator.cc:156] Difference at 128: -0.0170399, expected 1.21455
E0119 07:11:07.720546 3775037 buffer_comparator.cc:156] Difference at 129: -1.30294e-05, expected 1.28192
E0119 07:11:07.720550 3775037 buffer_comparator.cc:156] Difference at 130: 5.08182, expected 1.27711
E0119 07:11:07.720554 3775037 buffer_comparator.cc:156] Difference at 131: -1.81729e-10, expected 1.26972
E0119 07:11:07.720558 3775037 buffer_comparator.cc:156] Difference at 132: 1.50692, expected 1.15995
E0119 07:11:07.720561 3775037 buffer_comparator.cc:156] Difference at 133: 5.29193, expected 1.16513
E0119 07:11:07.720565 3775037 buffer_comparator.cc:156] Difference at 134: -2.25432e-21, expected 1.14136
E0119 07:11:07.720569 3775037 buffer_comparator.cc:156] Difference at 135: 1.61094, expected 1.19622
E0119 07:11:07.720573 3775037 buffer_comparator.cc:156] Difference at 136: -0.0260566, expected 1.15325
E0119 07:11:07.720578 3775037 buffer_comparator.cc:156] Difference at 137: 2.21617, expected 1.25995
2025-01-19 07:11:07.720584: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.722451 3775037 buffer_comparator.cc:156] Difference at 256: 4.92301, expected 1.12643
E0119 07:11:07.722466 3775037 buffer_comparator.cc:156] Difference at 257: 4.87487, expected 1.20989
E0119 07:11:07.722471 3775037 buffer_comparator.cc:156] Difference at 258: 3.31381, expected 1.24958
E0119 07:11:07.722474 3775037 buffer_comparator.cc:156] Difference at 259: -0.0336357, expected 1.33563
E0119 07:11:07.722478 3775037 buffer_comparator.cc:156] Difference at 260: 8.18605, expected 1.05137
E0119 07:11:07.722482 3775037 buffer_comparator.cc:156] Difference at 261: 5.70035, expected 1.15042
E0119 07:11:07.722485 3775037 buffer_comparator.cc:156] Difference at 262: 2.09771, expected 1.1152
E0119 07:11:07.722489 3775037 buffer_comparator.cc:156] Difference at 263: 1.83666, expected 1.21897
E0119 07:11:07.722492 3775037 buffer_comparator.cc:156] Difference at 264: 0.0268196, expected 1.05355
E0119 07:11:07.722496 3775037 buffer_comparator.cc:156] Difference at 265: 8.56637, expected 1.24402
2025-01-19 07:11:07.722520: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.724389 3775037 buffer_comparator.cc:156] Difference at 256: 4.92301, expected 1.12643
E0119 07:11:07.724402 3775037 buffer_comparator.cc:156] Difference at 257: 4.87487, expected 1.20989
E0119 07:11:07.724406 3775037 buffer_comparator.cc:156] Difference at 258: 3.31381, expected 1.24958
E0119 07:11:07.724409 3775037 buffer_comparator.cc:156] Difference at 259: -0.0336357, expected 1.33563
E0119 07:11:07.724413 3775037 buffer_comparator.cc:156] Difference at 260: 8.18605, expected 1.05137
E0119 07:11:07.724416 3775037 buffer_comparator.cc:156] Difference at 261: 5.70035, expected 1.15042
E0119 07:11:07.724419 3775037 buffer_comparator.cc:156] Difference at 262: 2.09771, expected 1.1152
E0119 07:11:07.724423 3775037 buffer_comparator.cc:156] Difference at 263: 1.83666, expected 1.21897
E0119 07:11:07.724427 3775037 buffer_comparator.cc:156] Difference at 264: 0.0268196, expected 1.05355
E0119 07:11:07.724433 3775037 buffer_comparator.cc:156] Difference at 265: 8.56637, expected 1.24402
2025-01-19 07:11:07.724439: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.726308 3775037 buffer_comparator.cc:156] Difference at 513: 2.49547, expected 1.15973
E0119 07:11:07.726322 3775037 buffer_comparator.cc:156] Difference at 514: -0.0821428, expected 1.0959
E0119 07:11:07.726326 3775037 buffer_comparator.cc:156] Difference at 515: 0.0716812, expected 1.25584
E0119 07:11:07.726329 3775037 buffer_comparator.cc:156] Difference at 516: 4.80085, expected 1.16786
E0119 07:11:07.726333 3775037 buffer_comparator.cc:156] Difference at 517: 0.535988, expected 1.23813
E0119 07:11:07.726336 3775037 buffer_comparator.cc:156] Difference at 518: 8.14411, expected 1.1318
E0119 07:11:07.726339 3775037 buffer_comparator.cc:156] Difference at 519: -0.00232611, expected 1.02703
E0119 07:11:07.726344 3775037 buffer_comparator.cc:156] Difference at 520: 7.27455, expected 1.1116
E0119 07:11:07.726347 3775037 buffer_comparator.cc:156] Difference at 521: -0, expected 1.14691
E0119 07:11:07.726351 3775037 buffer_comparator.cc:156] Difference at 522: 3.30098, expected 1.16739
2025-01-19 07:11:07.726357: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0119 07:11:07.728241 3775037 buffer_comparator.cc:156] Difference at 513: 2.49547, expected 1.15973
E0119 07:11:07.728254 3775037 buffer_comparator.cc:156] Difference at 514: -0.0821428, expected 1.0959
E0119 07:11:07.728258 3775037 buffer_comparator.cc:156] Difference at 515: 0.0716812, expected 1.25584
E0119 07:11:07.728262 3775037 buffer_comparator.cc:156] Difference at 516: 4.80085, expected 1.16786
E0119 07:11:07.728265 3775037 buffer_comparator.cc:156] Difference at 517: 0.535988, expected 1.23813
E0119 07:11:07.728269 3775037 buffer_comparator.cc:156] Difference at 518: 8.14411, expected 1.1318
E0119 07:11:07.728272 3775037 buffer_comparator.cc:156] Difference at 519: -0.00232611, expected 1.02703
E0119 07:11:07.728275 3775037 buffer_comparator.cc:156] Difference at 520: 7.27455, expected 1.1116
E0119 07:11:07.728281 3775037 buffer_comparator.cc:156] Difference at 521: -0, expected 1.14691
E0119 07:11:07.728287 3775037 buffer_comparator.cc:156] Difference at 522: 3.30098, expected 1.16739
2025-01-19 07:11:07.728292: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
2025-01-19 07:11:48.187100: I external/xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_136', 144 bytes spill stores, 144 bytes spill loads
E0119 07:11:48.227028 3775037 buffer_comparator.cc:156] Difference at 0: 279.579, expected 311.298
E0119 07:11:48.227086 3775037 buffer_comparator.cc:156] Difference at 2: 275.812, expected 309.55
E0119 07:11:48.227091 3775037 buffer_comparator.cc:156] Difference at 4: 281.699, expected 314.286
2025-01-19 07:11:48.227103: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1082] Results do not match the reference. This is likely a bug/unexpected loss of precision.
Lux.jl Integration: Test Failed at /var/lib/buildkite-agent/builds/gpuci-14/julialang/reactant-dot-jl/test/nn/lux.jl:66
  Expression: ≈(dps1, dps2, atol = 0.001, rtol = 0.01)
   Evaluated: Float32[-0.017089598 0.021821085 -0.021864016; 0.017089598 -0.021821085 0.021864016] ≈ ConcreteRArray{Float32, 2}(Float32[-0.03380743 -0.0052448246 -0.03258616; 0.03380743 0.0052448246 0.03258616]) (atol=0.001, rtol=0.01)

I'm tentatively assigning it to @avik-pal since it's about Lux.

@giordano giordano added the bug Something isn't working label Jan 19, 2025
@avik-pal
Copy link
Collaborator

xref #444

@giordano
Copy link
Member Author

giordano commented Jan 19, 2025

If it's the same that doesn't seem to be fixed contrary to #444 (comment)? Edit: I see it was just reopened.

@giordano
Copy link
Member Author

I was looking into this yesterday, interestingly tests pass for me on aarch64, fail only on x86_64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants