Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux: zpl_tmpfile: ensure ACL init succeeds before d_tmpfile #16625

Closed
wants to merge 1 commit into from

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Oct 8, 2024

Motivation and Context

Other than #16612, this is the only place I see where trouble similar to #16608 could arise. One hint pointing to tmpfile is that the crashes are now way more frequent with 6.10 kernel, which introduced O_TMPFILE support for overlayfs, which agrees with that with our workload, if there's a general bug with O_TMPFILE, we'd be hitting it lot more often.

I'm not sure all the supported kernel versions will be okay with these modifications, I'm posting this as a draft to let the bots chew on it to find out.

Description

Linux: zpl_tmpfile: ensure ACL init succeeds before d_tmpfile

Inode should be linked to dentry and hashed only after both
zpl_xattr_security_init and zpl_init_acl were successful, which
apparently isn't always true for example when memory is tight.

How Has This Been Tested?

Still waiting for vpsAdminOS container templates build to complete, if it does, then this fixes #16608. The major PITA here is that it takes 8.5hrs for it to fail currently :D

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. This should be fine on all supported kernels but let's let the CI confirm that.

module/os/linux/zfs/zpl_inode.c Outdated Show resolved Hide resolved
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 8, 2024
@snajpa snajpa marked this pull request as ready for review October 9, 2024 01:52
@snajpa
Copy link
Contributor Author

snajpa commented Oct 9, 2024

As it turns out, this along with #16612 is insufficient, now I see the VERIFY tripped, the igrab in zget still can work out to NULL. So it's not fixing my problem :-D

@snajpa
Copy link
Contributor Author

snajpa commented Oct 9, 2024

cannot open 'testpool': pool I/O is currently suspended <- this is interesting, but I think unrelated to what I'm doing in the PR? will force-push to run the tests again

@snajpa
Copy link
Contributor Author

snajpa commented Oct 10, 2024

Oh I like this one, from alma9:
(fwiw I think it's unrelated but hits at dnode dirty detection that still is AFAIK unsolved)

  [ 3768.354941] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cp_files/cp_stress
  [ 3887.080148] (node->next == LIST_POISON1) is equivalent to (node->prev == LIST_POISON2)
  [ 3887.080152] PANIC at list.h:187:list_link_active()
  [ 3887.084495] Showing stack for process 566921
  [ 3887.086033] CPU: 2 PID: 566921 Comm: seekflood Tainted: P           OE     -------  ---  5.14.0-427.37.1.el9_4.x86_64 #1
  [ 3887.089643] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 3887.094929] Call Trace:
  [ 3887.096030]  <TASK>
  [ 3887.097275]  dump_stack_lvl+0x34/0x48
  [ 3887.098626]  spl_panic+0xd1/0xe9 [spl]
  [ 3887.100051]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.101621]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.103386]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.104841]  ? dbuf_rele_and_unlock+0x270/0x520 [zfs]
  [ 3887.112559]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.114002]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.116556]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.119314]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.123656]  ? dnode_hold_impl+0x8df/0x10c0 [zfs]
  [ 3887.127820]  spl_assert+0x17/0x20 [zfs]
  [ 3887.131812]  list_link_active+0x53/0x70 [zfs]
  [ 3887.136131]  dnode_is_dirty+0x5d/0x190 [zfs]
  [ 3887.140516]  dmu_offset_next+0x67/0x210 [zfs]
  [ 3887.145355]  zfs_holey_common+0xa0/0x1a0 [zfs]
  [ 3887.149392]  zfs_holey+0x4b/0x80 [zfs]
  [ 3887.153150]  zpl_llseek+0x89/0xc0 [zfs]
  [ 3887.157083]  ksys_lseek+0x68/0xb0
  [ 3887.159401]  do_syscall_64+0x5c/0x90
  [ 3887.162380]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.166402]  ? syscall_exit_work+0x103/0x130
  [ 3887.170262]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.173055]  ? syscall_exit_to_user_mode+0x22/0x40
  [ 3887.177277]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.182171]  ? do_syscall_64+0x69/0x90
  [ 3887.184482]  ? do_syscall_64+0x69/0x90
  [ 3887.187779]  ? do_syscall_64+0x69/0x90
  [ 3887.190170]  ? syscall_exit_to_user_mode+0x22/0x40
  [ 3887.193358]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3887.195979]  ? do_syscall_64+0x69/0x90
  [ 3887.197177]  ? do_syscall_64+0x69/0x90
  [ 3887.198384]  ? sysvec_apic_timer_interrupt+0x3c/0x90
  [ 3887.199916]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

Inode should be linked to dentry and hashed only after both
zpl_xattr_security_init and zpl_init_acl were successful, which
apparently isn't always true for example when memory is tight.

Signed-off-by: Pavel Snajdr <[email protected]>
@satmandu
Copy link
Contributor

Did this end up fixing your crashes?

@snajpa
Copy link
Contributor Author

snajpa commented Oct 29, 2024

@satmandu nope, current progress here #16324 (comment)

@snajpa snajpa closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

General Protection Fault while looking up a file
3 participants