Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: fix TestVMPreemptionPolling data race #135312

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DarrylWong
Copy link
Contributor

This changes monitorForPreemptedVMs to read pollPreemptionInterval just once, so when unit tests change the value we don't run into concurrent access.

Fixes: #135267
Epic: none
Release note: none

This changes monitorForPreemptedVMs to read pollPreemptionInterval
just once, so when unit tests change the value we don't run into
concurrent access.

Fixes: cockroachdb#135267
Epic: none
Release note: none
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong marked this pull request as ready for review November 15, 2024 18:56
@DarrylWong DarrylWong requested a review from a team as a code owner November 15, 2024 18:56
@DarrylWong DarrylWong requested review from srosenberg and nameisbhaskar and removed request for a team November 15, 2024 18:56
@@ -2139,13 +2139,14 @@ func monitorForPreemptedVMs(ctx context.Context, t test.Test, c cluster.Cluster,
if c.IsLocal() || !c.Spec().UseSpotVMs {
return
}
interval := pollPreemptionInterval
Copy link
Member

@srosenberg srosenberg Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention we might need a mutex on the previous PR. In general, when there are multiple hooks (for mocking), it might be best to stick them in a mutex-protected struct.

That aside, I'm a bit puzzled by the data race in [1]. The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled. The write on L754 races with the read of the previous subtest, right? That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

[1] #135267

Copy link
Contributor Author

@DarrylWong DarrylWong Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I was going to but forgot to leave a comment saying something along the lines of "I'm not really sure how this fixes the data race because I'm not 100% what the data race is".

That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

That was my initial thought too, but I commented out the first of the two tests and was still getting the data race. Even weirder is that when both tests are running, the data race always happens on the second. That would suggest it is reliant on both tests running but it isn't. To further add to that, commenting out the second test causes the first to start failing with a data race.

Not really sure what to make of that 😅 I also tried adding a defer leaktest.AfterTest which didn't catch anything.

The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled

it might be best to stick them in a mutex-protected struct.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We both missed an essential hint in the data race report–Goroutine 741 (finished). At the time when the data race report is printed [1], the goroutine running monitorForPreemptedVMs has already exited. This confirms that the runCtx is indeed cancelled.

We can also conclude that it's the write of the second subtest which races with the read of the first subtest. The sequence of (scheduled) events is roughly as follows. The goroutine of the first subtest exits before monitorForPreemptedVMs exits; at this point, tsan has recorded a read from the corresponding goroutine id. Next, the goroutine for the second subtest is executed; tsan records the write. A data race is detected. Subsequently, monitorForPreemptedVMs exits. The reporter iterates through the implicated goroutines, printing the stack traces, at which point Goroutine 741 already finished :)

Your fix works because the read in monitorForPreemptedVMs must happen before the first subtest goroutine exits, and transitively it happens before the second subtest does the write. s.Run essentially establishes the happens-before order; from runTest,

monitorForPreemptedVMs(runCtx, t, c, l)
// This is the call to actually run the test.
s.Run(runCtx, t, c)

Thus, the only read in monitorForPreemptedVMs now happens before s.Run exits, and by extension runTest.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

It might be a good idea since the reason the current fix works is rather non-trivial.

[1] https://github.com/llvm/llvm-project/blob/8c7c8eaa1933d24c1eb869ba85469908547e3677/compiler-rt/lib/tsan/rtl/tsan_report.cpp#L424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmd/roachtest: TestVMPreemptionPolling failed
3 participants