Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tests): Add core dump watch and analysis #94

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Conversation

uristdwarf
Copy link
Collaborator

This commit aims to improve the behavior and improve the output of test results. When a core dump occurs, it terminates the test immediately (regardless whether it was SaunaFS or not that caused it, the environment is unreliable for accurate test results). It then uses GDB to analyze the core dump(s).

There's a few issues with this: First it requires modifying /proc/sys/kernel/core_pattern, which will affect the whole system. While I've tried to ensure that the original pattern is restored after failures and core dumps, I'm not completely confident it will. Second is the fact that GDB might not be needed to print the backtrace: LD_PRELOAD=/lib/libSegFault.so may be a better alternative, but I didn't have time to test it.

Thus this feature is hidden behind a feature flag as experimental until these issues are solved.

@uristdwarf
Copy link
Collaborator Author

uristdwarf commented May 24, 2024

Mhm, it seems I made a mistake with the commits. I need to fix them.

@uristdwarf uristdwarf marked this pull request as draft May 24, 2024 13:22
@uristdwarf uristdwarf marked this pull request as ready for review May 24, 2024 13:24
@uristdwarf
Copy link
Collaborator Author

uristdwarf commented May 27, 2024

Dammit, it dropped my suggestions commit when I forced pushed.

@uristdwarf
Copy link
Collaborator Author

Fixed

Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did only a minor suggestion. However, I couldn't test the feature. I tried to induce segfaults errors using division by zero or index out of range scenarios in Ganesha tests, but I couldn't see the creation of coredumps in the expected folder or logged information useful about the situation.

Besides, I saw a permission denied error for coredump_restore function. Below is the screenshot showing the error.
Screenshot from 2024-05-31 13-55-27

@@ -154,6 +154,8 @@ apt_packages=(
libnfsidmap-dev
libnsl-dev
libsqlite3-dev
xfslibs-dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have support for all platforms, new added packages should be added also for Fedora if they are available.

  • xfslibs-dev package is different in Fedora: xfsprogs-devel. Therefore, it must also be added to the list of dnf_packages.

  • In the case of inotify-tools package, is common for both platforms. Therefore, it should be moved to common_packages.

What about gdb package? Is not needed for accessing coredumps info?

In all cases, the list of packages is sorted alphabetically, so this order should be respected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDB was probably included as a dependency in other packages, but I've added a explicit reference to it.

@uristdwarf
Copy link
Collaborator Author

I tried to induce segfaults errors using division by zero or index out of range scenarios in Ganesha tests, but I couldn't see the creation of coredumps in the expected folder or logged information useful about the situation.

Besides, I saw a permission denied error for coredump_restore function. Below is the screenshot showing the error.

You need to allow saunafstests to modify /sys/kernel/core_pattern with sudo and tee. The setup_machine.sh test was updated to accommodate this, but for existing setups you need to manually add the permissions.

Another way to test this is by setting /sys/kernel/core_pattern manually to $coredump_pattern (which is currently /tmp/temp-cores/core-%e-%p-%t. The script will detect you have the core pattern set and will use it. You might want to make note of your current core_pattern if you want to restore it, although you could just reboot if you forgot.

@ralcolea
Copy link
Contributor

ralcolea commented Jun 3, 2024

You need to allow saunafstests to modify /sys/kernel/core_pattern with sudo and tee. The setup_machine.sh test was updated to accommodate this, but for existing setups you need to manually add the permissions.

I did this manually by adding the same rule you provided in the setup_machine.sh script but it didn't work.
What I didn't try was to modify the /sys/kernel/core_pattern. I could try to see if it works.

I set the core_pattern to /tmp/temp-cores/core-%e-%p-%t, but I couldn't see any info related with coredumps.
I don't hesitate that it works, but I would like that other team member could test this feature to confirm that is working in other environments.

@uristdwarf
Copy link
Collaborator Author

I set the core_pattern to /tmp/temp-cores/core-%e-%p-%t, but I couldn't see any info related with coredumps.
I don't hesitate that it works, but I would like that other team member could test this feature to confirm that is working in other environments.

Interesting. I'll need to test it a little bit on an actual VM then. Could you post the entire log for a single test that should coredump?

@ralcolea
Copy link
Contributor

ralcolea commented Jun 4, 2024

@uristdwarf below I share the patch for the test I used to validate the feature and the log generated when running the test. Please, let me know if you need something else.
coredumps.patch
coredumps-failing.log

@uristdwarf
Copy link
Collaborator Author

@uristdwarf below I share the patch for the test I used to validate the feature and the log generated when running the test. Please, let me know if you need something else. coredumps.patch coredumps-failing.log

Mhm... I don't know if shared objects have the same behavior as executable regarding core dumps, there's a trap log from the kernel which I've not seen before. In any case, it should be also supported for shared objects if possible.

Could you quickly check if using the same code in src/main/main.cc produces the expected result? Try any test involving master/chunkserver/metalogger (like SanityChecks*crc_error_fixing

@uristdwarf
Copy link
Collaborator Author

uristdwarf commented Jun 4, 2024

Also, are you sure you ran the test with COREDUMP_WATCH=1 saunafs-tests ...? I think it should at least tell that watches are setup from inotify, like this:

$ inotifywait -m "/tmp/" -e create -e moved_to
Setting up watches.
Watches established.

Otherwise, if it failed then it should at least print a message stating it could not setup the feature. But from the logs I couldn't see any of the messages.

@ralcolea
Copy link
Contributor

ralcolea commented Jun 4, 2024

Also, are you sure you ran the test with COREDUMP_WATCH=1 saunafs-tests ...? I think it should at least tell that watches are setup from inotify, like this:

Actually no, I don't remember to set this COREDUMP_WATCH=1, sorry 😅
I tested the following command, but I got the same results:

COREDUMP_WATCH=1; saunafs-tests --gtest_filter="GaneshaTests.test_nfs_ganesha_copy*"

Is there any other way to setup this variable?

@uristdwarf
Copy link
Collaborator Author

Need to figure out why this doesn't work on Ubuntu desktop at least, putting back to draft

@uristdwarf uristdwarf marked this pull request as draft June 5, 2024 11:20
aNeutrino
aNeutrino previously approved these changes Jun 11, 2024
Copy link
Contributor

@aNeutrino aNeutrino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This commit aims to improve the behavior and improve the output of test
results. When a core dump occurs, it terminates the test immediately
(regardless whether it was SaunaFS or not that caused it, the
environment is unreliable for accurate test results). It then uses
GDB to analyze the core dump(s).

There's a few issues with this: First it requires modifying
`/proc/sys/kernel/core_pattern`, which will affect the whole system.
While I've tried to ensure that the original pattern is restored after
failures and core dumps, I'm not completely confident it will. Second is
the fact that GDB might not be needed to print the backtrace:
`LD_PRELOAD=/lib/libSegFault.so` may be a better alternative, but I
didn't have time to test it.

Thus this feature is hidden behind a feature flag as experimental until
these issues are solved.

Co-authored-by: aNeutrino <[email protected]>
Copy link
Contributor

@aNeutrino aNeutrino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants