-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasionally recurring bug: Invalid Int32: "280375465082892" (ArgumentError)
#14496
Comments
This looks very similar to https://forum.crystal-lang.org/t/mysterious-spec-failure-in-ecosystem-test/6748 The compiler stack trace here might be helpful. |
I'm pretty sure it's the same indeed. @refi64 noticed this: https://discord.com/channels/591460182777790474/611671115835768862/1199390683220414474
How can I do that? |
I mean the file you already shared. backtrace-int32-error.txt |
The stack trace indicates the invalid literal appears inside code generated from macro @wout Could you look in your sources to find out which files are being run like that? Probably |
Sure! Here's lib: In src there's nothing:
|
Do you render a template? such as ECR for instance? |
No, in this particular project we're using Lucky HTML exclusively. But I did notice this (
|
I tracked down this problem to be a memory corruption in LLVM type structures. I'm still not 100% sure how this is happening, but I have a theory (more below). This corruption ends up impacting code generation for macro runs and member offset calculations. I've seen two separate macro run failures:
The offset calculation issue is what manifests as the Anyway, what happens is that the call to LLVM's As to why this is happening, as I said, I'm not 100% sure. But my guess right now is that the array of members for LLVM struct types is not being properly marked by the GC and that memory is being reused and overwritten by other parts of the code. One thing that sticks out is: the type structure itself is allocated by LLVM, but the array of its elements is allocated by Crystal. So my guess is that the interior of the LLVM type structure (where the pointer to the array buffer is stored) is somehow invisible to the GC. I tested running with the GC disabled (by setting the env varible |
That array of elements ultimately goes to
|
Interesting, that's one less reason to worry about then. Then I have no idea yet how the corruption is happening 😁 |
If you guys would like a consistent way to create this error check out: https://github.com/sol-vin/minievents/actions/runs/9007590688/job/24748627867 This job consistently causes one of the two errors. It is tied to something to do with using Also during my time working on this issue I also encountered a fairly random |
I managed to get a pretty small test case that reproduces the issue consistently (at least in my machine running Ubuntu Linux with the release version of Crystal 1.12.1). What I'm not being able to do it is reproduce the issue with Crystal compiled from source. For reference:
I also tried building the compiler in release mode, and with static linking but still no luck reproducing the issue. This was with BDWGC installed from distro packages. I'm now trying with one built from sources as well. The test case itself consists of a couple of files: # main.cr
COUNT=6
{{ run("./gen", "#{COUNT}") }}
macro read_sample(exe)
{{ run(exe, "main.cr").stringify }}
end
{% for i in 1..COUNT %}
puts read_sample("./read{{i}}")
{% end %} # gen.cr
(1..(ARGV[0].to_i)).each do |i|
File.write("read#{i}.cr", "puts #{Random.rand}\nputs File.read(ARGV[0])")
end
puts ARGV[0].to_i The randomness is to avoid Crystal caching the result of the macro runs, but not really important. The first macro run, But anyway, this forces the compiler to repeatedly compile the generated For the compiler built from source, I pushed the count up to 50 and still didn't trigger the bug. |
Small update: I built a compiler from source using the base Docker image and steps from the distribution scripts and I can reproduce the issue with it, consistently. Just using the GC compiled from source did not do it. But as before, running with |
Oooh, that's interesting. The Dockerfile has actually changed quite significantly since the last release. It doesn't build libgc anymore and instead uses the libraries provided from the system packages. Previously, including the 1.12 releases, libgc was built twice, once for the compiler itself linking musl, and once for distributing with the release compiler, linked against glibc. For reference, see the diff of latest changes. Since this issue reproduces with the latest version of the Dockerfile, it doesn't seem to depend on our custom libgc build. |
BIG UPDATE: found the root of the problem and I think I understand why is happening, with 98% of certainty (I reserve the remaining 2% because it's late and I'm really tired by now 😴). We're getting a cache poisoning situation in LLVM's Now, the I hope the problem is clear by now. After some number of iterations of When Crystal tries to compile some program again (either another macro, or the main program), it encounters the implementation for I'm 100% sure of the cache poisoning (I added some debugging code and was able to reproduce the issue) and, as I said, 98% sure about all the rest. What I don't know is what's the proper way to fix it.
I'm not sure. On the Crystal side, the instances and lifetimes are hard to follow as well. I think we're creating new LLVM contexts, not necessarily to create a new program (search for |
Wow man, that's amazing detective work! The best is to place this in the llvm forum and hear what they say about it. |
So the crux of the issue is that we're keeping LLVMDataLayout references while we disposed of the LLVMContext: the former keeps a cache of LLVMTypeRef allocated into the bump allocator that are freed when we dispose of the LLVMContext.
We do that during codegen: each compilation unit (aka module) has its own LLVMTargetMachine and thus its own LLVMDataLayout. A question is why do we keep creating and deleting LLVM contexts? What about using a single one? LLVM contexts are thread unsafe but semantic phase is single threaded anyway (at worst each thread could have its own context). Alternatively we could ask LLVM for the data layout string and interpret it ourselves. That would be duplicating what LLVM is already doing (prone to bugs), but we wouldn't have to create LLVMTypeRef until we reach the codegen phase. |
I just encountered this during a Github CI deployment. Unfortunately I can't offer much more than the stacktrace. |
Thanks. |
In order to validate any fix or workaround for this issue, we need to establish a clear path for reproduction. We have a small test case in #14496 (comment), but only a subset of compiler builds seems to be affected (but if it is, it's very consistent). All instances I am aware of are with the release compiler from https://github.com/crystal-lang/crystal/releases or Current nightly builds (with LLVM 18.1.6) error as well, but with a different message: |
Some more details to understand the flow of the issue on the crystal side (aka thinking & writing out loud). The This will create, then reuse on subsequent calls, a The next step is to compile the program with that host compiler. This will create a new program that replaces the And here comes something related to the issue as detailed by @ggiraldez: while we create the program we reuse the compiler's target_machine that is memorized which means we reuse the same LLVM::TargetData to compile the main program and each macro run, despite creating types from distinct LLVM contexts (one per program). Looking at LLVM::IR::DataLayout I understand it keeps an internal cache (inside LLVM) as @ggiraldez explained (I got confused by the name and thought it was our own local cache in LLVMTyper). From this additional investigation, I think using a fresh LLVM::TargetMachine for each new Crystal::Program, instead of sharing a single instance, should fix the issue as we'd use a new LLVM::TargetData (aka LLVM::IR::DataLayout) with a fresh internal cache, ready to create & cache new types in the LLVM::Context of each program. |
I'm able to reproduce the bug with local compiler builds via distribution-scripts, using the old build configuration from https://github.com/crystal-lang/distribution-scripts/tree/a9e0c6c12987b8b01b17e71c38f489e45937e1bf (i.e. without the recent changes since 1.12): git clone https://github.com/crystal-lang/distribution-scripts && cd distribution-scripts
git checkout a9e0c6c12987b8b01b17e71c38f489e45937e1bf
make -C linux build PREVIOUS_CRYSTAL_VERSION=1.11.1 CRYSTAL_SHA1=23f1c53342fddef61600890ee5db174396889
tar -xf build/crystal-23f1c53342fddef61600890ee5db174396889-1-linux-x86_64.tar Release mode is not necessary. The current build configuration (https://github.com/crystal-lang/distribution-scripts/tree/7a013f14ed64e7e569b5e453eab02af63cf62b61) with LLVM 18 only errors when built in release mode. So apparently between LLVM 15 and LLVM 18 there are some changes which result in a slightly different error behaviour. |
I have a simple patch that creates a new With this patch, the error no longer reproduces, neither with the old build configuration (LLVM 15) nor the new one (LLVM 18; in release mode). |
The fix from #14694 is now available in the current nightly build. It fixes the reproduction, but I'd like to see confirmation it works for the real use cases as well. |
Fixed by #14694 |
Bug Report
This is a bug I've been dealing with for years now, and unfortunately, it's very hard to reproduce. Until now, it mostly happened with deployments, but today, I got it again in development. It's in a Lucky project that has been fine in development, but after updating Crystal this morning, it wouldn't run. Here's the full backtrace:
backtrace-int32-error.txt
The weird thing about this bug is that it's inconsistent; it only pops up occasionally. It seems that sometimes the Int32 in question is valid; other times, it's not. When I had it in deploys, it behaved the same. One deployment would fail, and the next would go through. Sometimes, multiple deploys would fail consecutively; sometimes, it would be fine for many times in a row.
At least one person I spoke to in the Lucky Discord has the same thing. They do the same and re-deploy until deployment
is successful. I mentioned it before in the Crystal Discord, but I've never been able to create an example that would consistently fail.
It's the same as now in development. It happened once, but now I can boot the app repeatedly without issues. When I had it in deployment, switching from the crystal-alpine package to crystal-debian solved it for me, and it hasn't returned since January this year.
Crystal version
Crystal 1.12.1 [4cea101] (2024-04-11)
LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu
OS
elementary OS 7.1 Horus x86_64 (Ubuntu 22.04 LTS)
The text was updated successfully, but these errors were encountered: