Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Missing Serial Data Under Heavy Load (RDT-1099) #334

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

erhankur
Copy link
Contributor

Description

Under heavy loads, the _listen() function produces truncated or corrupted log output.

Example test:
In a normal sequence, each log line should contain a contiguous block of data like ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/.
However, in the snippet below, we see a break in the expected pattern:

2025-01-19 23:02:17 7b40::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 7b80::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 7bc0::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 7c00::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 7c40::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 7c80::ABCDEFGHIJqrstuvwxyz0123456789+/
2025-01-19 23:02:17 bc40::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 bc80::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 bcc0::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
2025-01-19 23:02:17 bd00::ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

In line 7c80, the string abruptly jumps from ABCDEFGHIJ to qrstuvwxyz0123456789+/, indicating partial or misaligned writes—a classic sign of log corruption. Furthermore, the next line jumps from data-len identifier 7c80 to bc40, which hints at missing data in between. Under high throughput, _listen() may process chunks so rapidly that they become interleaved or partially dropped.

During the internal CI tests, it appears in the hw_stack_guard_cpu1 test with the following snippet:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAFAOUDkAAAAAAAAATPr0TwAAAAABAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAHFjwTwAAAAAAAAAAAAAAACqq8E8AAAAAAODxT2QAAAB+0AJA
AAAAAAAAAAA09PRPynsBQAgAAAA09PRPkPf0T7B5AUAAAAAAAAAAAAEAAAA8fwJA

Adding 100ms sleep at the end of each loop iteration fixes the truncation and data missing.

Related

Testing


Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@erhankur erhankur changed the title Prevent serial data missing Fix Missing Serial Data Under Heavy Load Jan 19, 2025
@github-actions github-actions bot changed the title Fix Missing Serial Data Under Heavy Load Fix Missing Serial Data Under Heavy Load (RDT-1099) Jan 19, 2025
@horw
Copy link
Member

horw commented Jan 20, 2025

Is it a very common error for now? It looks like adding 0.1 might help in some cases, but it won't solve the issue itself.

@erhankur
Copy link
Contributor Author

erhankur commented Jan 20, 2025

I’m not sure which part of the code is causing the problem, but I've noticed that as the listener processes more data, it slows down and eventually starts missing data. This might be related to the queue size. On my macOS, it's 32 KB, and the issue appears around offset 0x7C80. There seems to be a correlation.

Adding sleep helps to see more data at the console without corruption. (At least in 30 seconds time period)

@hfudev
Copy link
Member

hfudev commented Jan 20, 2025

@erhankur Thanks for the PR.

The serial object sets the read/write interval here. If the queue is empty, the `_listen' function is skipped.

Could you check if changing the interval in serial.py from 0.05 to 0.1 helps? I think either setting the timeout to 0.1 in the _listen function or in serial.py should both work, but we shouldn't set it in both places.

@erhankur
Copy link
Contributor Author

Unfortunately, increasing the timeout only in serial.py/_event_loop didn't help. Removing it and adding to dut_factory/_listen didn't help, either. It works if they have both sleep.

@hfudev hfudev merged commit 8cc5879 into espressif:main Jan 21, 2025
4 of 5 checks passed
@hfudev
Copy link
Member

hfudev commented Jan 21, 2025

okay. let's merge it. Thank you @erhankur!

@horw
Copy link
Member

horw commented Jan 22, 2025

@erhankur Unfortunately, this implementation led to extremely slow chunk reading when used under QEMU. time.sleep is breaking QEMU testing because we put many small chunks into the MessageQueue, and when we read from it, we process them one by one with sleep.

@erhankur
Copy link
Contributor Author

                    s = self._s.read_all()
                    self._q.put(s)

While I was debugging, len(s) was maximum 1020 bytes. Do you say if there is more than 1020 bytes we need to get more and put to the queue? If there is a way to increase max chunk size, it would be also a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants