Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary corruption and control characters added in printed binaries #8685

Closed
lhoguin opened this issue Jul 24, 2024 · 17 comments
Closed

Binary corruption and control characters added in printed binaries #8685

lhoguin opened this issue Jul 24, 2024 · 17 comments
Assignees
Labels
not a bug Issue is determined as not a bug by OTP team:VM Assigned to OTP team VM

Comments

@lhoguin
Copy link
Contributor

lhoguin commented Jul 24, 2024

Describe the bug
There seems to be some corruption related to binaries. When printing the binary or writing it to a file, we get either the correct <<"cq">> or the incorrects <<"cq", 0, 0>>, <<"cq", 0, 1>> or <<"cq", 0, 2>>.

To Reproduce
On one terminal:

git clone https://github.com/rabbitmq/rabbitmq-server
cd rabbitmq-server
make -C deps/rabbit run-broker

On another:

git clone https://github.com/rabbitmq/rabbitmq-perf-test
cd rabbitmq-perf-test
make run ARGS="-ad false -f persistent -u cq -C 1 -c 1 -mp priority=1 -qa x-max-priority=2 -y 0"

Then open the logs file (path provided at the end of startup in the first terminal) and search for {resource, to find the bad binaries.

Sometimes you will get the correct {resource,<<"/">>,queue,<<"cq">>}, and sometimes something like {resource,<<"/">>,queue,<<99,113,0,2>>}, even (and especially) in the same crash report!

Also can happen with another message that you can find searching after unclean shutdown where in this case the bad binary is formatted and we see control characters (assuming your editor prints them).

Expected behavior
Just get <<"cq">> and not extra bytes. When printed, no control characters after the expected string.

Affected versions
Tested on OTP-26.1.2 and OTP-27.0.1.

Additional context
Is it harmless?

@lhoguin lhoguin added the bug Issue is reported as a bug label Jul 24, 2024
@lhoguin lhoguin changed the title Control characters added in printed binaries Binary corruption and control characters added in printed binaries Jul 24, 2024
@jhogberg jhogberg self-assigned this Jul 24, 2024
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Jul 24, 2024
@lucioleKi
Copy link
Contributor

lucioleKi commented Jul 24, 2024

Hi! Could you print the size of the binary? It will help us narrow it down to the printing routine.

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 24, 2024

Adding byte_size(QueueName) to the formatted log (that contains after unclean shutdown) says 4 bytes (instead of the expected 2). So it seems to be a bit more than just formatting then.

@frej
Copy link
Contributor

frej commented Jul 24, 2024

Completely by random I today found and fixed an error #8686 in the alias analysis that could lead to corruption of binaries if you append to a binary coming from a literal map. Does the corruption go away if you compile with +no_ssa_opt_destructive_update?

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 25, 2024

The corruption remains.

Note that this binary is not appended to. It is parsed from data from the socket and kept from that point in a record, never to be modified. It may or may not be a sub-binary.

Anyway it should be fairly trivial to reproduce with the steps I provided. It happens 100% of the time for us and multiple people have confirmed it.

@frej
Copy link
Contributor

frej commented Jul 25, 2024

Note that this binary is not appended to. It is parsed from data from the socket and kept from that point in a record, never to be modified. It may or may not be a sub-binary.

Right, then it's not #8686. I must confess that I didn't look at the reproducer, I saw this when I was about to leave for the day, and wanted to stop anyone wasting time trying to triage a bug which could be my fault.

@mkuratczyk
Copy link
Contributor

Please note that we've now merged a fix to the issue that was causing the process crash, which led to the corrupted line being logged. To reproduce now, you need to checkout a commit from before the fix:
git checkout 4863bc3b8febc6cf488c9a99ad5ea6854bf586da.

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 25, 2024

Note that the after unclean shutdown log line shows the problem regardless of a crash occurring, although I'm not sure this one happens 100% of the time.

@lucioleKi
Copy link
Contributor

Thank you for all the additional info. We don't have a machine that could build RabbitMQ unfortunately. (For security reasons we're advised against cloning a repo without Ericsson's approval.) We would really appreciate it if you could isolate this issue more. This could very well be a bug in our code, and it's probably more severe than a printing problem. It's just difficult to track it down and fix it with the info we have now.

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 26, 2024

There didn't use to be this constraint before. We've submitted bugs in the past that needed RabbitMQ to reproduce, and I've personally submitted Cowboy issues that needed cloning Cowboy or Cowlib to reproduce. These are some of the biggest Erlang projects so perhaps you could request a permanent exception from Ericsson if that's really an issue.

We can try to simplify but it will take time and we don't know if we will succeed as this seems to be the only binary that's misbehaving over the entire RabbitMQ codebase.

@josevalim
Copy link
Contributor

josevalim commented Jul 30, 2024

I was curious about the root cause and I wanted to take a look, but debugging the whole rabbitmq project is a bit daunting if you never did it before. Do you folks have any pointers about where the root cause can be?

Also, how are you printing it using the io module? Does it also show up differently on erlang:display/1? It may be worth checking it out as it is done directly in C instead of going through the IO system.

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 30, 2024

Hello Jose, I will try your suggestion but I expect more of the same: the erroneous binary has been printed through logger[1] and also seen in crash logs, written to disk via file:write_file("/tmp/file", Bin) (no formatting) as well as byte_size(Bin) reporting a size of 4 instead of 2.

[1] https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_index_v2.erl#L258-L264

Note that while at the moment we can only reproduce using RabbitMQ, there actually isn't that much of RabbitMQ in use to reproduce: sending one message to one priority queue. Because of how priority queues are implemented as multiple queues internally, all in a single process, and each internal queue has this binary in its state (again, all part of a bigger state for this one process), we think the problem is related to that.

@josevalim
Copy link
Contributor

Thanks, @lhoguin. And you are right, if byte_size returned 4, there is no need for display!

@frej
Copy link
Contributor

frej commented Jul 31, 2024

Note that this binary is not appended to. It is parsed from data from the socket and kept from that point in a record, never to be modified. It may or may not be a sub-binary.

@lhoguin: Have you verified (using either tracing or a plain io:format trace printout) that the binary is non-corrupted when it is inserted in the record (or does a printout make the corruption go away)?

As I feel responsible for the compiler's destructive update optimizations (and not running on Ericsson hardware), I have tried to reproduce this using both OTP-26.1.2 (OTP-26 was the first version supporting destructive update of binaries) and OTP-27.0.1, but unfortunately I'm unable to build rabbitmq-server successfully. What really would help is if you could do a bisect and find out which OTP commit breaks your system

@lhoguin
Copy link
Contributor Author

lhoguin commented Jul 31, 2024

@frej I will see if I can get to it before the end of the week. I am off for two weeks after that. Otherwise maybe @mkuratczyk will have time to try to bisect when he's back from vacation on Monday.

Alternatively, you may want to try one of the pre-built RabbitMQs at https://github.com/rabbitmq/rabbitmq-server/releases/tag/v4.0.0-beta.3

@frej
Copy link
Contributor

frej commented Aug 1, 2024

@lhoguin before spending the time for a full bisect, just figuring out when the binary is corrupted (as I referred to earlier) is probably worthwhile.

Alternatively, you may want to try one of the pre-built RabbitMQs at https://github.com/rabbitmq/rabbitmq-server/releases/tag/v4.0.0-beta.3

Unfortunately pre-built blobs aren't very helpful when hunting down potential miscompilation :(

@mkuratczyk
Copy link
Contributor

Good news: turns out this is totally our "fault". It's not a bug at all - it's a 10+ years old intentional behaviour in a rarely used RabbitMQ feature. No-one currently working on RabbitMQ was aware of it and it didn't even occur to us, that this could be deliberate. :)

Feel free to close this issue. I can't do it, since it was created by @lhoguin.

For those interested: when a classic queues is declared with x-max-priority, multiple "subqueues" are created, one for each priority. Those subqueues append their priority value to the queue's name (<<NameBin/binary, 0, P:8>>) and those mutated subqueue names are kept in the main queue's state:
https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_priority_queue.erl#L103-L104

That's why we were seeing both the correct resource value ({resource,<<"/">>,queue,<<"cq">>}) as well as what looked like a corrupted value ({resource,<<"/">>,queue,<<99,113,0,2>>}).

@dgud dgud closed this as completed Aug 5, 2024
@dgud dgud added not a bug Issue is determined as not a bug by OTP and removed bug Issue is reported as a bug labels Aug 5, 2024
@lhoguin
Copy link
Contributor Author

lhoguin commented Aug 19, 2024

Sigh. Sorry for the noise.

keyhanjk added a commit to keyhanjk/rabbitmq-server that referenced this issue Sep 3, 2024
for better documenting according to this issue (not bug)
erlang/otp#8685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug Issue is determined as not a bug by OTP team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

7 participants