roachtest: fix TestVMPreemptionPolling data race #135312

DarrylWong · 2024-11-15T18:53:11Z

This changes monitorForPreemptedVMs to read pollPreemptionInterval just once, so when unit tests change the value we don't run into concurrent access.

Fixes: #135267
Epic: none
Release note: none

This changes monitorForPreemptedVMs to read pollPreemptionInterval just once, so when unit tests change the value we don't run into concurrent access. Fixes: cockroachdb#135267 Epic: none Release note: none

cockroach-teamcity · 2024-11-15T18:53:21Z

This change is

srosenberg · 2024-11-16T00:10:15Z

pkg/cmd/roachtest/test_runner.go

@@ -2139,13 +2139,14 @@ func monitorForPreemptedVMs(ctx context.Context, t test.Test, c cluster.Cluster,
 	if c.IsLocal() || !c.Spec().UseSpotVMs {
 		return
 	}
+	interval := pollPreemptionInterval


I was going to mention we might need a mutex on the previous PR. In general, when there are multiple hooks (for mocking), it might be best to stick them in a mutex-protected struct.

That aside, I'm a bit puzzled by the data race in [1]. The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled. The write on L754 races with the read of the previous subtest, right? That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

[1] #135267

Ah yes, I was going to but forgot to leave a comment saying something along the lines of "I'm not really sure how this fixes the data race because I'm not 100% what the data race is".

That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

That was my initial thought too, but I commented out the first of the two tests and was still getting the data race. Even weirder is that when both tests are running, the data race always happens on the second. That would suggest it is reliant on both tests running but it isn't. To further add to that, commenting out the second test causes the first to start failing with a data race.

Not really sure what to make of that 😅 I also tried adding a defer leaktest.AfterTest which didn't catch anything.

The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled

it might be best to stick them in a mutex-protected struct.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

We both missed an essential hint in the data race report–Goroutine 741 (finished). At the time when the data race report is printed [1], the goroutine running monitorForPreemptedVMs has already exited. This confirms that the runCtx is indeed cancelled.

We can also conclude that it's the write of the second subtest which races with the read of the first subtest. The sequence of (scheduled) events is roughly as follows. The goroutine of the first subtest exits before monitorForPreemptedVMs exits; at this point, tsan has recorded a read from the corresponding goroutine id. Next, the goroutine for the second subtest is executed; tsan records the write. A data race is detected. Subsequently, monitorForPreemptedVMs exits. The reporter iterates through the implicated goroutines, printing the stack traces, at which point Goroutine 741 already finished :)

Your fix works because the read in monitorForPreemptedVMs must happen before the first subtest goroutine exits, and transitively it happens before the second subtest does the write. s.Run essentially establishes the happens-before order; from runTest,

monitorForPreemptedVMs(runCtx, t, c, l) // This is the call to actually run the test. s.Run(runCtx, t, c)

Thus, the only read in monitorForPreemptedVMs now happens before s.Run exits, and by extension runTest.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

It might be a good idea since the reason the current fix works is rather non-trivial.

[1] https://github.com/llvm/llvm-project/blob/8c7c8eaa1933d24c1eb869ba85469908547e3677/compiler-rt/lib/tsan/rtl/tsan_report.cpp#L424

roachtest: fix TestVMPreemptionPolling data race

3b0e46f

This changes monitorForPreemptedVMs to read pollPreemptionInterval just once, so when unit tests change the value we don't run into concurrent access. Fixes: cockroachdb#135267 Epic: none Release note: none

DarrylWong marked this pull request as ready for review November 15, 2024 18:56

DarrylWong requested a review from a team as a code owner November 15, 2024 18:56

DarrylWong requested review from srosenberg and nameisbhaskar and removed request for a team November 15, 2024 18:56

srosenberg approved these changes Nov 15, 2024

View reviewed changes

srosenberg reviewed Nov 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: fix TestVMPreemptionPolling data race #135312

roachtest: fix TestVMPreemptionPolling data race #135312

DarrylWong commented Nov 15, 2024

cockroach-teamcity commented Nov 15, 2024

srosenberg Nov 16, 2024 •

edited

Loading

DarrylWong Nov 16, 2024 •

edited

Loading

srosenberg Nov 17, 2024

roachtest: fix TestVMPreemptionPolling data race #135312

Are you sure you want to change the base?

roachtest: fix TestVMPreemptionPolling data race #135312

Conversation

DarrylWong commented Nov 15, 2024

cockroach-teamcity commented Nov 15, 2024

srosenberg Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

DarrylWong Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

srosenberg Nov 17, 2024

Choose a reason for hiding this comment

srosenberg Nov 16, 2024 •

edited

Loading

DarrylWong Nov 16, 2024 •

edited

Loading