Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrent map panic on metadata #2

Merged
merged 2 commits into from
Jan 8, 2025

Conversation

djdongjin
Copy link
Member

@djdongjin djdongjin commented Dec 26, 2024

Fix (require release and update otelttrpc in containerd) containerd/containerd#11138

If the ctx has a ttrpc.MD, inject will reuse the same ttrpc.MD across requests, causing concurrent map write panic.

This PR adds a UT for concurrent client/server request to capture the issue, and fix this by cloning ttrpc.MD on inject.

Running the UT without the fix (commit2) can repro the panic:

$ go test ./... -race
...
WARNING: DATA RACE
Read at 0x00c000114550 by goroutine 57:
  github.com/containerd/otelttrpc.inject()
--- FAIL: TestClientCallServerConcurrent (0.02s)
    testing.go:1399: race detected during execution of test
...

With the fix (test pass and CI green):

$ go test ./... -race
?       github.com/containerd/otelttrpc/internal        [no test files]
ok      github.com/containerd/otelttrpc (cached)

@djdongjin djdongjin force-pushed the fix-concurrent-map-write-panic branch 2 times, most recently from 9f57df4 to 97fca02 Compare December 26, 2024 21:46
metadata_supplier.go Outdated Show resolved Hide resolved
@djdongjin
Copy link
Member Author

cc @cpuguy83 @just1not2

func inject(ctx context.Context, propagators propagation.TextMapPropagator, req *ttrpc.Request) context.Context {
md, ok := ttrpc.GetMetadata(ctx)
if !ok {
md = make(ttrpc.MD)
} else {
// make a copy to avoid concurrent read/write panic
md = copyMD(md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be implemented in ttrpc.GetMetadata directly.

Copy link
Member Author

@djdongjin djdongjin Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel letting caller decide if it should copy (if modification is needed) or not will be more flexible? and it's more similar to http.Header.

E.g., if ttrpc.GetMetadata returns a copy directly, then every extract call will also make a copy (which is unnecessary since it's only read IIUC).

func extract(ctx context.Context, propagators propagation.TextMapPropagator) context.Context {
md, ok := ttrpc.GetMetadata(ctx)
if !ok {
md = make(ttrpc.MD)
}

And also we need to modify other places (e.g., in containerd) where the caller already makes copy itself.

https://github.com/containerd/containerd/blob/c3efa0cb339ffc2d6c9f03d7fb4e5e6577d0194b/pkg/namespaces/ttrpc.go#L42-L44

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we'll anyway need to be doing cross-package synchronization (at least merge this, preferably tag a patch-release, then update containerd to point to the new release), maybe once we're at it, we should go all the way and

  1. review and merge ttrpc #177, and tag a new patch release
  2. switch here to using the new ttrpc.MD.Clone() introduced by #177, merge this, and tag a new patch release
  3. point containerd to the newly tagged releases and update the ttrpc.MD-copying code in containerd to use the same ttrpc.MD.Clone()

@cpuguy83 WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great to me 👍 , happy to update this PR after containerd/ttrpc#177 is merged/released, and update containerd code once both PR are merged/released.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR to use the new MD.Clone from containerd/ttrpc#177

Thanks @fuweid for helping make the quick release! (https://github.com/containerd/ttrpc/releases/tag/v1.2.7)

interceptor_test.go Outdated Show resolved Hide resolved
@djdongjin djdongjin force-pushed the fix-concurrent-map-write-panic branch from 97fca02 to b9ec520 Compare December 27, 2024 17:42
@djdongjin djdongjin force-pushed the fix-concurrent-map-write-panic branch from b9ec520 to eb0c89c Compare December 27, 2024 17:45
@djdongjin djdongjin force-pushed the fix-concurrent-map-write-panic branch 2 times, most recently from 312dbd7 to 0d7ed46 Compare December 30, 2024 01:40
@djdongjin djdongjin requested review from klihub and cpuguy83 December 30, 2024 01:58
@djdongjin
Copy link
Member Author

kindly ping @klihub, PTAL when you have some time 🙏 thanks!

interceptor_test.go Outdated Show resolved Hide resolved
example/go.mod Show resolved Hide resolved
@djdongjin djdongjin force-pushed the fix-concurrent-map-write-panic branch from 0d7ed46 to 2ba3be1 Compare January 2, 2025 16:59
@klihub
Copy link
Member

klihub commented Jan 2, 2025

kindly ping @klihub, PTAL when you have some time 🙏 thanks!

I'm fine in general with this PR. It is a necessary fix, and now that we export MD.Clone() from ttrpc itself and we use it here, I think this is good.

I have two small nits remaining. One is about getting rid of that post-go1.22 comment by changing the code so that its behavior is independent of the golang version used. The other is a more general one about bumping dependencies... and I'm not sure about this one at the deps-detail level. I haven't looked at all the changes there and, for instance, compare the bumps to the versions used in containerd itself, or something similar. @cpuguy83 Any thoughts on that one ?

@djdongjin
Copy link
Member Author

djdongjin commented Jan 2, 2025

Hi @klihub thanks for the comments.

One is about getting rid of that post-go1.22 comment by changing the code so that its behavior is independent of the golang version used.

Good point, updated as you suggested.

The other is a more general one about bumping dependencies... and I'm not sure about this one at the deps-detail level.

Also replied inline, only ttrpc 1.2.2 -> 1.2.7 is manuall updated and all others are from go mod tidy. :)


On a side note, my personal preference (on deps) is to bump up regularly (i.e., not keep at minimum version like go version in go.mod):

  1. I feel older deps more tend to have/impacted by vulnerbilities.
  2. Updating regularly makes bump up easily. Otherwise, every bump up will be a minor/major update including many commits/changes.
  3. Unlike the go binary (where a user usually uses across all their go projects, and likely upgrade more reguarly), I think a user will mostly likely inherit the whichever dependency version we use, in their individual project (unless they explicitly bump up or their other deps use a new version), which makes 1 impact our users as well.

But this is only my view :) happy to discuss more and follow the guidance of the project/community, thanks!

@djdongjin djdongjin requested a review from klihub January 2, 2025 17:17
@cpuguy83
Copy link
Member

cpuguy83 commented Jan 2, 2025

@cpuguy83 Any thoughts on that one

Seems like there isn't really a choice here since the bumps are coming from ttrpc itself.

@djdongjin
Copy link
Member Author

Hi @klihub 👋 kindly ping, let me know if you have any other comments/suggestions on this PR, thank you!

@klihub
Copy link
Member

klihub commented Jan 7, 2025

Hi @klihub 👋 kindly ping, let me know if you have any other comments/suggestions on this PR, thank you!

@djdongjin I think this is fine now.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any locking in MD.clone so we would still fall to concurrent read/write operations during the cloning func.. IOW do we need to serialize read/write to the metadata here?

Is there any similar chance for a parallel problem on calls to "extract?"

@cpuguy83
Copy link
Member

cpuguy83 commented Jan 7, 2025

The solution is to clone if you need to modify anything.
So there shouldn't be concurrent read/write.

@djdongjin
Copy link
Member Author

djdongjin commented Jan 7, 2025

Don't see any locking in MD.clone so we would still fall to concurrent read/write operations during the cloning func.. IOW do we need to serialize read/write to the metadata here?

Yeah as Brian mentioned we will clone if we need to make modifications, so write always happens after read (clone).

Is there any similar chance for a parallel problem on calls to "extract?"

IIUC, the case you describe is:

  1. a server extract (read MD_0)
  2. N clients inject (read/clone MD_0, write MD_I)
  3. a server extract (read MD_I).

This PR fixes the issue on parallel step 2. (Otherwise N client all writes to MD_0). I think there won't be read/write map panics, since 1 and concurrent 2s are all read on MD_0; 3 can only happen after 2 completes the clone (write MD_I). (Let me know if this makes sense, maybe sounds a little confusing :))

@mikebrow
Copy link
Member

mikebrow commented Jan 8, 2025

Don't see any locking in MD.clone so we would still fall to concurrent read/write operations during the cloning func.. IOW do we need to serialize read/write to the metadata here?

Yeah as Brian mentioned we will clone if we need to make modifications, so write always happens after read (clone).

Is there any similar chance for a parallel problem on calls to "extract?"

IIUC, the case you describe is:

  1. a server extract (read MD_0)
  2. N clients inject (read/clone MD_0, write MD_I)
  3. a server extract (read MD_I).

This PR fixes the issue on parallel step 2. (Otherwise N client all writes to MD_0). I think there won't be read/write map panics, since 1 and concurrent 2s are all read on MD_0; 3 can only happen after 2 completes the clone (write MD_I). (Let me know if this makes sense, maybe sounds a little confusing :))

:-) on the n clients case... inject A starts cloneA of metadata in ctx address "prior" and change occurs on cloneA's address which is set in ctx after performing inject... in parallel inject B started cloneB of address "prior" before inject A wrote overwrote "prior" with cloneA ... inject B finishes... overwrites cloneA with cloneB...

Is that ^ possible?

@djdongjin
Copy link
Member Author

djdongjin commented Jan 8, 2025

I think it's not possible, because neither inject A nor inject B will overwrote "prior" . (ttrpc.WithMetadata in inject, or the underlying context.WithValue always returns a copy of the original ctx.

So in this case:

  • The ctx prior will always have its original metadata (say MD_0).
  • inject A starts cloneA (from ctx prior and MD_0), gets MD_A, attachs MD_A to ctx inject_A and returns ctx inject_A.
  • inject B starts cloneB (from ctx prior and MD_0), gets MD_B, attachs MD_B to ctx inject_B and returns.

WDYT @mikebrow

@mikebrow
Copy link
Member

mikebrow commented Jan 8, 2025

I think it's not possible, because neither inject A nor inject B will overwrote "prior" . (ttrpc.WithMetadata in inject, or the underlying context.WithValue always returns a copy of the original ctx.

So in this case:

  • The ctx prior will always have its original metadata (say MD_0).
  • inject A starts cloneA (from ctx prior and MD_0), gets MD_A, attachs MD_A to ctx inject_A and returns ctx inject_A.
  • inject B starts cloneB (from ctx prior and MD_0), gets MD_B, attachs MD_B to ctx inject_B and returns.

WDYT @mikebrow

Ah I see now.. so each inject creates a new context thus there are N different copies in a context "tree" if you will and any reader will be in their branch of the tree... "func WithValue(parent Context, key, val any) Context
WithValue returns a copy of parent in which the value associated with key is val." LGTM

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow merged commit ef43615 into containerd:main Jan 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants