-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to use cgroup v2 prior to cgroup v1 in path resolver #1978
Conversation
Signed-off-by: Dom Del Nano <[email protected]>
// TODO(ddelnano): this aids in testing the LegacyCGroupPathResolver. Determine what | ||
// to do with this. | ||
if (cgroupv2 || FLAGS_force_cgroup2_mode) { | ||
return cgv2_base_path; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that this code branch needs to stay to support the LeagcyCGroupPathResolverTest.Cgroup2Format test. Was the intention to remove the legacy resolver at some point? Is that something that could be considered now?
Tested on OVH Cloud - MKS 1.29.3
|
@@ -28,11 +28,21 @@ | |||
#include "src/shared/metadata/cgroup_path_resolver.h" | |||
|
|||
DEFINE_bool(force_cgroup2_mode, true, "Flag to force assume cgroup2 fs for testing purposes"); | |||
DEFINE_bool(force_cgroup1_mode, false, "Flag to force use of cgroup v1 fs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to do this more, but using flags for testing is generally error prone, because they persist across tests. I'd recommend finding a way to avoid flags like this. This goes for the force_cgroup2_mode
as well, if we're going to try to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this was if it's worth extensively testing the new logic ordering (v2 before v1), then this flag would be used to restore the old behavior in the event there was a distro/system that had issues. Since the tests expected that the check would happen in the previous ordering, it also needed to be used there as well.
|
||
namespace px { | ||
namespace md { | ||
|
||
StatusOr<std::string> CGroupBasePath(std::string_view sysfs_path) { | ||
std::string cgv2_base_path = absl::StrCat(sysfs_path, "/cgroup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test extensively if we change this. Systems can have both cgroup and cgroup2 if I recall correctly, and this change has to potential to break things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the OVH case is one where cgroup v1 and v2 are both present, but our logic fails for it.
I agree that this needs to be tested very carefully. After thinking about this more, I think it will be difficult to be confident that all edge cases are covered even if time is invested in testing the known cases.
I'm going to think about this more and see if there is a safer option for handling the OVH case. If the cgroup v1 code doesn't short circuit and allows reaching the cgroup v2 logic, then testing this reordering can be avoided.
// TODO(ddelnano): this aids in testing the LegacyCGroupPathResolver. Determine what | ||
// to do with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default value of FLAGS_force_cgroup2_mode
be false? I would think that makes more sense. Maybe that helps with cleaning up the tests.
But I also think using flags in tests is error prone, as mentioned above, so we should try to get away from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about the possibility of FLAGS_force_cgroup2_mode
's default true value meant this code path could be used in production clusters. In theory, the caller should not parse a cgv2_base_path
correctly (since the filesystem check didn't indicate cgroup v2). However, since both v1 and v2 can be active and this OVH issue shows that these cases are complicated, I'm a little worried about removing this.
@@ -32,6 +32,7 @@ constexpr std::string_view kContainerID = | |||
"a7638fe3934b37419cc56bca73465a02b354ba6e98e10272542d84eb2014dd62"; | |||
|
|||
TEST(CGroupPathResolver, GKEFormat) { | |||
FLAGS_force_cgroup1_mode = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we must set flags in test, then use PX_SET_FOR_SCOPE
to make sure it doesn't leak out.
#1981 is the approach we will be going with. I'm working on addressing the final comments. |
Summary: Update cgroup resolver to try all possible base paths For certain environments, like OVH cloud's managed kubernetes service, the auto discovery mechanism fails due to a confusion between cgroup v1 and v2. This is fixed by attempting all cgroup base paths rather than trying the first to match and failing if it doesn't work. This is the second iteration of this change (#1978 was the earlier attempt). Relevant Issues: N/A Type of change: /kind bug Test Plan: Community user deployed this change and verified their OVH MKS cluster is working now --------- Signed-off-by: Dom Del Nano <[email protected]>
Summary: Update cgroup resolver to try all possible base paths For certain environments, like OVH cloud's managed kubernetes service, the auto discovery mechanism fails due to a confusion between cgroup v1 and v2. This is fixed by attempting all cgroup base paths rather than trying the first to match and failing if it doesn't work. This is the second iteration of this change (pixie-io#1978 was the earlier attempt). Relevant Issues: N/A Type of change: /kind bug Test Plan: Community user deployed this change and verified their OVH MKS cluster is working now --------- Signed-off-by: Dom Del Nano <[email protected]> GitOrigin-RevId: ff83032
Summary: Attempt to use cgroup v2 prior to cgroup v1 in path resolver
For certain environments, like OVH cloud's managed kubernetes service, the auto discovery mechanism fails due to a confusion between cgroup v1 and v2. This is fixed by attempting to resolve the cgroup v2 path before v1. This change also introduces a new flag
--force_cgroup1_mode
, which can restore the old ordering if this causes any unforeseen issues.Relevant Issues: N/A
Type of change: /kind bug
Test Plan: Community user deployed this change and verified their OVH MKS cluster is working now
Changelog Message: Update cgroup path resolution to prefer cgroup v2 over v1 to fix an issue with OVH cloud Kubernetes clusters