Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs_deadman_failmode = continue can panic in zfs_refcount_destroy_many #378

Open
lundman opened this issue May 14, 2024 · 5 comments
Open

Comments

@lundman
Copy link

lundman commented May 14, 2024

To trigger the bug referenced in #364 (comment) and probably also openzfs#15802

* set zfs_deadman_failmode = continue
* set zfs_deadman_synctime_ms to something low (60000 -> 1000; 60s down to 1s).
* create zvol, with compression=zstd-19
* copy large (1g+) text file to zvol (format mount, whatever)

Will see message:

zio_deadman_impl(): slow zio[12]
zio_deadman(): zio_wait restarting hung I/O for pool 'tank'

I can confirm that
https://github.com/openzfs/zfs/blob/master/module/zfs/zio.c#L2205-L2208

		if (failmode == ZIO_FAILURE_MODE_CONTINUE &&
		    taskq_empty_ent(&pio->io_tqent)) {
			zio_interrupt(pio);
		}

are called a few times, until eventually

https://github.com/openzfsonwindows/openzfs/blob/development/module/zfs/refcount.c#L98

VERIFY3( rc->rc_count == number ) failed (5120 == 0)
Break instruction exception - code 80000003 (first chance)
OpenZFS!zfs_refcount_destroy_many+0x90:

With the stack:

>	OpenZFS!zfs_refcount_destroy_many+0x90 [C:\src\openzfs\module\zfs\refcount.c @ 98] 	C/C++/ASM
 	OpenZFS!zfs_refcount_destroy+0x17 [C:\src\openzfs\module\zfs\refcount.c @ 113] 	C/C++/ASM
 	OpenZFS!abd_fini_struct+0x68 [C:\src\openzfs\module\zfs\abd.c @ 160] 	C/C++/ASM
 	OpenZFS!abd_free+0xdd [C:\src\openzfs\module\zfs\abd.c @ 320] 	C/C++/ASM
 	OpenZFS!zio_pop_transforms+0x76 [C:\src\openzfs\module\zfs\zio.c @ 452] 	C/C++/ASM
 	OpenZFS!zio_done+0x1392 [C:\src\openzfs\module\zfs\zio.c @ 4795] 	C/C++/ASM
 	OpenZFS!__zio_execute+0x2f4 [C:\src\openzfs\module\zfs\zio.c @ 2363] 	C/C++/ASM
 	OpenZFS!zio_execute+0x30f [C:\src\openzfs\module\zfs\zio.c @ 2274] 	C/C++/ASM
 	OpenZFS!vdev_mirror_io_start+0x4d3 [C:\src\openzfs\module\zfs\vdev_mirror.c @ 694] 	C/C++/ASM
 	OpenZFS!zio_vdev_io_start+0x12b [C:\src\openzfs\module\zfs\zio.c @ 3946] 	C/C++/ASM
 	OpenZFS!__zio_execute+0x2f4 [C:\src\openzfs\module\zfs\zio.c @ 2363] 	C/C++/ASM
 	OpenZFS!zio_execute+0x30f [C:\src\openzfs\module\zfs\zio.c @ 2274] 	C/C++/ASM
 	OpenZFS!taskq_thread+0x51a [C:\src\openzfs\module\os\windows\spl\spl-taskq.c @ 2083] 	C/C++/ASM

However, note that

1: kd> dt rc
Local var @ 0xffff978c7383e548 Type refcount*
0xffff878f`487195a0 
   +0x000 rc_count         : 0

So we die in the VERIFY because rc_count is 5120, but when I inspect rc_count it is now 0. Certainly feels like a race. I additionally had a breakpoint at:
https://github.com/openzfsonwindows/openzfs/blob/development/module/zfs/zio.c#L2420
but it did not get triggered.

@lundman
Copy link
Author

lundman commented May 14, 2024

adjacent (non-boring) stacks

   4.003894  ffff878f3b5a8040 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c
                                        nt!ExpWorkerThread+0x155
                                        nt!PspSystemThreadStartup+0x57
                                        nt!KiStartSystemThread+0x34

   4.0029fc  ffff878f36f75040 0000938 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeRemovePriQueue+0x1ff
                                        nt!ExpWorkerThread+0xd3
                                        nt!PspSystemThreadStartup+0x57
                                        nt!KiStartSystemThread+0x34

   4.003490  ffff878f39963040 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c

   4.003260  ffff878f3b02c040 0000003 READY      nt!KiSwapContext+0x76
                                        nt!KeYieldExecution+0xf11
                                        nt!KeDelayExecutionThread+0x147
                                        OpenZFS!kpreempt+0x2f
                                        OpenZFS!kmem_magazine_destroy+0x1ce
                                        OpenZFS!kmem_depot_ws_reap+0x1ca
                                        OpenZFS!kmem_cache_reap+0x122
                                        OpenZFS!taskq_thread+0x51a

   4.000cf4  ffff878f3f6d6080 0000000 RUNNING    OpenZFS!zfs_refcount_destroy_many+0x90
                                        OpenZFS!zfs_refcount_destroy+0x17
                                        OpenZFS!abd_fini_struct+0x68
                                        OpenZFS!abd_free+0xdd
                                        OpenZFS!zio_pop_transforms+0x76
                                        OpenZFS!zio_done+0x1392
                                        OpenZFS!zio_execute+0x30f
                                        OpenZFS!zio_execute+0x30f
                                        OpenZFS!vdev_mirror_io_start+0x4d3
                                        OpenZFS!zio_vdev_io_start+0x12b
                                        OpenZFS!zio_execute+0x30f
                                        OpenZFS!zio_execute+0x30f
                                        OpenZFS!taskq_thread+0x51a


   4.002f34  ffff878f43491080 0000000 READY      nt!KiSwapContext+0x76
                                        nt!KiDispatchInterrupt+0x67e
                                        nt!KiDpcInterrupt+0x350
                                        nt!KeReleaseSpinLock+0x2d
                                        OpenZFS!spl_mutex_exit+0xfd
                                        OpenZFS!__zfs_dbgmsg+0xba
                                        OpenZFS!__dprintf+0x29e
                                        OpenZFS!zio_deadman+0x13c
                                        OpenZFS!zio_wait+0x69e
                                        OpenZFS!dsl_pool_sync+0x2f9
                                        OpenZFS!spa_sync_iterate_to_convergence+0x109
                                        OpenZFS!spa_sync+0x66d
                                        OpenZFS!txg_sync_thread+0x45b
                                        nt!PspSystemThreadStartup+0x57

   4.000d3c  ffff878f3a48d080 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c


   4.002908  ffff878f3e6f62c0 0000000 READY      nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c


   4.00325c  ffff878f3a126040 0000000 STANDBY    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c
                                        nt!ExpWorkerThread+0x155
                                        nt!PspSystemThreadStartup+0x57
                                        nt!KiStartSystemThread+0x34

   4.002c90  ffff878f38f06040 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c

   4.001bd4  ffff878f3b0c6580 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c
                                        nt!ExpWorkerThread+0x155
                                        nt!PspSystemThreadStartup+0x57
                                        nt!KiStartSystemThread+0x34

   4.0013ec  ffff878f3b0b45c0 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c

   4.001704  ffff878f42a13080 0000000 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0xab5
                                        nt!KiCommitThreadWait+0x137
                                        nt!KeWaitForSingleObject+0x256
                                        OpenZFS!spl_mutex_enter+0x11f
                                        OpenZFS!spl_cv_wait+0xf8
                                        OpenZFS!txg_wait_synced_impl+0x290
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c
                                        nt!ExpWorkerThread+0x155
                                        nt!PspSystemThreadStartup+0x57
                                        nt!KiStartSystemThread+0x34

   4.0036e8  ffff878f3fbd6080 0000000 RUNNING    nt!write_char+0x46
                                        nt!write_string+0x48
                                        nt!output_l+0x5cc
                                        nt!snprintf+0x69
                                        OpenZFS!__dprintf+0x1ee
                                        OpenZFS!txg_wait_synced_impl+0x200
                                        OpenZFS!txg_wait_synced+0x29
                                        OpenZFS!dmu_tx_wait+0x2ac
                                        OpenZFS!dmu_tx_assign+0x17e
                                        OpenZFS!zvol_os_write_zv+0x3c9
                                        OpenZFS!wzvol_WkRtn+0x54e
                                        OpenZFS!wzvol_GeneralWkRtn+0x7b
                                        nt!IopProcessWorkItem+0x13c

@lundman
Copy link
Author

lundman commented May 22, 2024

Applying openzfs#16216 and running through the test above, the ASSERT has not been triggered. The breakpoint at zio_wait restarting hung I/O for pool 'tank' has kicked in so I feel it should have had a panic after that. So it does appear to help with this problem, even if it isn't obvious to me why that is.

@lundman
Copy link
Author

lundman commented May 22, 2024

Woo commenting never fails!

VERIFY3( rc->rc_count == number ) failed (5120 == 0)
Break instruction exception - code 80000003 (first chance)
OpenZFS!zfs_refcount_destroy_many+0x90:

So we are in fact tracking two problems in openzfs#15802

1: kd> dt rc
Local var @ 0xffff978c74d95748 Type refcount*
0xffff878f`85da9ba0 
   +0x000 rc_count         : 0
   +0x008 rc_mtx           : kmutex
   +0x038 rc_tree          : avl_tree
   +0x058 rc_removed       : list
   +0x078 rc_removed_count : 0
   +0x07c rc_tracked       : 0 ( B_FALSE )

same thing where rc_count is actually 0, even if the ASSRT print says otherwise.

buffer:

FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586098688 delta=2048
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3BDFB080: dprintf: dsl_dataset.c:291:dsl_dataset_block_kill(): freeing ds=81 DVA[0]=<0:421e7a00:c00> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/c00P birth=81L/81P fill=1 cksum=0000
013e082cd2e0:00022b37cff68f1a:023e5ed0cc6a6648:b6f79395539380b9
FFFF878F3BDFB080: dprintf: dsl_dataset.c:145:dsl_dataset_block_born(): ds=FFFF878F49BF1000 DVA[0]=<0:ac8e2c00:1400> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/1400P birth=105L/105P fill=1
cksum=0000025d95e0c23b:000641ab5cf96f7c:0a80ce5969bbb04a:37e02aab71c09ebd
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586100736 delta=-1024
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3BDFB080: dprintf: dsl_dataset.c:291:dsl_dataset_block_kill(): freeing ds=81 DVA[0]=<0:421e9400:1000> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/1000P birth=81L/81P fill=1 cksum=00
0001d898ceee0b:000405240751c14a:056995fd2da433e0:6e23f7873626b125
FFFF878F3BDFB080: dprintf: dsl_dataset.c:145:dsl_dataset_block_born(): ds=FFFF878F49BF1000 DVA[0]=<0:ac8e4000:c00> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/c00P birth=105L/105P fill=1 ck
sum=000001536f6b6065:000221f16aedc85e:0224ef65e5714350:9b9edd3b7e772116
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586099712 delta=512
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3A7C3080: zio.c:2197:zio_deadman_impl(): slow zio[11]: zio=FFFF878F636B5E00x timestamp=10539098126100 delta=42800 queued=0 io=10539098127900 path=/dev/physicaldrive0 last=10539077025100 type=2 priority=3 flags=0x184080 stage=0x200000 pipeline=0x2e00000 pipeline-trace=0x200001 objset=81 object=1
level=0 blkid=72629 offset=2899201024 size=5120 error=0
VERIFY3( rc->rc_count == number ) failed (5120 == 0)

size and rc_count both being 5120 is curious.

@amotin
Copy link

amotin commented May 23, 2024

I haven't looked what is that I/O, but it seems reasonable to me that if some ZIO haven't completed, its buffer is likely not freed. "failmode=continue" is a direct request for troubles.

@lundman
Copy link
Author

lundman commented May 23, 2024

Yeah I assume the IO was scheduled, but not yet completed - the deadman triggers and the IO is restarted. It would seem this isn't done as easily as the code is.
I would prefer it didn't panic of course. It can be triggered on a system without tuning, it is just harder. If we don't want to support failmode=continue then we should probably remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants