Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race due to trimming ManagedFields in resource syncer #891

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

tpantelis
Copy link
Contributor

This was observed in a unit test elsewhere that configures a resync period. On re-sync, the K8s DeltaFIFO retrieves every object from the cache store and re-queues them. It also invokes the transform function which results in a data race when the resource syncer tries to nil the ManagedFields. This is because the object instance is the same as that stored in the cache which was previously accessed by another thread. So mutating it without the protection of a lock is unsafe. This is really an issue with the DeltaFIFO - it is documented that the "TransformFunc sees the object before any other actor, and it is now safe to mutate the object in place instead of making a copy" however this is not the case on a re-sync. The DeltaFIFO should either elide the TransformFunc on re-sync or make a copy of the object retrieved from the cache store.

As a workaround, the resource syncer should only set ManagedFields to nil if it is non-nil, which will be the case a a re-sync. Added a unit test to cover this case.

Also the object passed to the TransformFunc could be a DeletedFinalStateUnknown so we need to handle that as well.

This was observed in a unit test elsewhere that configures a resync
period. On resync, the K8s DeltaFIFO retrieves every object from the
cache store and re-queues them. It also invokes the transform function
which results in a data race when the resource syncer tries to nil the
ManagedFields. This is because the object instance is the same as that
stored in the cache which was previously accessed by another thread. So
mutating it without the protection of a lock is unsafe. This is really
an issue with the DeltaFIFO - it is documented that the "TransformFunc
sees the object before any other actor, and it is now safe to mutate the
object in place instead of making a copy" however this is not the case
on a re-sync. The DeltaFIFO should either elide the TransformFunc on
re-sync or make a copy of the object retrieved from the cache store.

As a workaround, the resource syncer should only set ManagedFields to
nil if it is non-nil, which will be the case a a re-sync. Added a unit
test to cover this case.

Also the object passed to the TransformFunc could be a
DeletedFinalStateUnknown so we need to handle that as well.

Signed-off-by: Tom Pantelis <[email protected]>
@submariner-bot
Copy link
Contributor

🤖 Created branch: z_pr891/tpantelis/transform_fn_data_race

Expect(resourceutils.MustToMeta(obj).GetManagedFields()).Should(BeNil())

// Sleep a little so a re-sync occurs and doesn't cause a data race.
time.Sleep(200)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t something be checked after the sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing we can check here but Ginkgo will fail with a data race if it occurs. So we're just trying to trigger a data race here.

@skitt
Copy link
Member

skitt commented Apr 16, 2024

Has this been reported upstream?

@tpantelis
Copy link
Contributor Author

tpantelis commented Apr 16, 2024

Has this been reported upstream?

kubernetes/kubernetes#124337

@tpantelis tpantelis merged commit 5cfcef6 into submariner-io:devel Apr 17, 2024
16 checks passed
@submariner-bot
Copy link
Contributor

🤖 Closed branches: [z_pr891/tpantelis/transform_fn_data_race]

@tpantelis tpantelis deleted the transform_fn_data_race branch July 17, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants