-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Secrets not being removed after duplicate scan-vulnerabilityreport-job creation fails #1038
Comments
Hello again, I had a hunch that maybe the image could not be pulled or analysed by trivy. I looked into the create jobs logs and used the command to manually trigger trivy: TRIVY_USERNAME=<artifactory-user> TRIVY_PASSWORD=<artifactory-pw> \
trivy \
--quiet client \
--format json \
--remote http://trivy:4975 \
<private-repo>/<image>:<tag> It actually works and delivers a vulnerability report. So I have to assume this is an issue with starboard. Maybe something to do with the short lived nature of the analyzed pods? I'm out of ideas for now. Is there an option to ignore certain images? This would be a workaround for now |
Thank you for sharing all these details @MPritsch. It's very helpful. If you disabled configuration auditing and you you see some secrets being created by Starboard, it must be Trivy (plugin) trying to scan private container images that refer to ImagePullSecrets. This mechanism is explained in https://aquasecurity.github.io/starboard/v0.14.1/integrations/private-registries/ One problem with the current implementation that may cause excessive secret creation in case of errors is that we don't use deterministic names for secrets created programmatically and associated with scan jobs (see here). Instead we use UUID, which are not cache friendly. We had similar issues with the Conftest plugin and when we switched to deterministic names it was mitigated. There is no way to ignore certain images in v0.14.1, but there is open discussion #670 where you can share your feedback. |
Thank you for explaining your current implementation. You are saying that switching secrets to deterministic names would result in less secrets being creating. Hence we would only see a single secret for a failed scan job. Is this something on your rodemap to be implemented? While I understand that this would at least prevent alot of the secrets being created, it would not stop the root cause. The pods I analysed are created in short intervals and only live for a few seconds most of the time. Therefore the names of these pods are also different each time. Looking into the mechanism explanation (https://aquasecurity.github.io/starboard/v0.14.1/integrations/private-registries/):
We fail at the 3rd point when the scan job is being created. My question is why is 1-3 repeated 3 times at the same time? |
Using deterministic names for secrets is just a mitigation that in my opinion we should put in place ASAP. The controller is using cache for storing Kubernetes objects retrieved from API server and with random names we do not leverage caching at all. Beyond that, by creating so many secrets with random names we don't help Kubernetes garbage collector to reclaim orphaned resources. But, there is more to be done in order to handle all types of failures. The tricky one is when we create a scan Job, but the scan Pod hasn't been created for many reasons, e.g. resource limits or some admission controller blocking scan Jobs. In such cases, the Controller Manager would create a Kubernetes Event (not the same even as received by controllers) that indicates the root cause of the problem as explained in #871. Unfortunately in the current implementation, the operator is waiting for scan Job success or failure and neglects the Kubernetes Event. I think it's also worth revisiting the logic, and as you mentioned, see if we can remove secrets programmatically in case of failures. |
We'll release starboard that's using deterministic names for secrets in v0.15.0. Here's the PR #1069 |
What steps did you take and what happened:
Install starboard with trivy in clientserver mode on a private EKS cluster and let it scan all namespaces and pods. We disabled ConfigAuditReports after we suspected this will solve the issue. Which it did for one cluster, but another one still produces secrets. This issue is related to #936. I think it's better to create a new ticket with more detailed information, as the other one is already marked as "wont fix".
After a day there are 2000 secrets left behind. This makes querying secrets unresponsive after a few days and they need to be manually deleted. Which means we won't be able to use starboard in our clusters as it will break secrets in them.
This issue only happens for one pod in 2 namespaces and with different image-tags (think of dev and tst namespace). Every other pod being scanned doesn't leave a secret behind. Usually these pods get automatically created and only live for a few seconds. In the given example, the pod lives for a few minutes.
While the logs below show that a vulnerability report was created and succeeded with 201, it actually never appears in our report overview.
[A clear and concise description of what the bug is, and what commands you ran.]
Starboard tries to create the scan-vulnerabilityreport jobs twice for the same Pod and fails the second time. Before each request it creates a secret. Only the secret of the successful job is being deleted. The secret for the failed job however is not and stays in the cluster forever.
Within these secrets are the credentials for trivy to pull the image from a private registry. We checked the credentials, they are valid and work with our registry.
The logs of the starboard-operator show no mention of this, even with debug logs enabled. Using the audit logs of the EKS cluster we were able to track down this issue.
In the following image I show you our audit logs. I removed confidential information and hope that you still will be able to gather enough from the context.
What did you expect to happen:
Starboard should delete secrets after failed job creations. Maybe it should not attempt a second/third creation of a job, that already exists.
Anything else you would like to add:
I can provide further information or logs. However I will need to censor certain informations, which makes it unfeasible to provide whole log chains.
[Miscellaneous information that will assist in solving the issue.]
Environment:
starboard version
): Helm: 0.14.1, Image: 0.9.1kubectl version
): 1.21The text was updated successfully, but these errors were encountered: