Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 'container not running' on 'writing container filesystem' step #1213

Open
tkrishtop opened this issue Nov 13, 2024 · 4 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tkrishtop
Copy link
Contributor

Bug Description

One of Arkady's Telco partners has particularly large images, 2-4GB. They statistically frequently have an error 'container not running'. It seems to be related to the fact that preflight tries to run a container before having it fully-pulled.

Version and Command Invocation

1.10.0

Steps to Reproduce:

failed: [jumphost] (item={'started': 1, 'finished': 0, 'ansible_job_id': '761330612078.1855963', 'results_file': '/var/lib/dci-openshift-app-agent/.ansible_async/761330612078.1855963', 'changed': True, 'failed': False, 'item': [{'container_image': 'quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0'}, {'changed': True, 'path': '/tmp/preflight_container_artifacts_3u3uvfok', 'uid': 1002, 'gid': 981, 'owner': 'dci-openshift-app-agent', 'group': 'dci-openshift-app-agent', 'mode': '0700', 'state': 'directory', 'secontext': 'unconfined_u:object_r:user_tmp_t:s0', 'size': 6, 'invocation': {'module_args': {'state': 'directory', 'prefix': 'preflight_container_artifacts_', 'suffix': '', 'path': None}}, 'failed': False, 'item': {'container_image': 'quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0'}, 'ansible_loop_var': 'item'}, ' '], 'ansible_loop_var': 'item'}) => {"ansible_job_id": "761330612078.1855963", "ansible_loop_var": "async_result_item", "async_result_item": {"ansible_job_id": "761330612078.1855963", "ansible_loop_var": "item", "changed": true, "failed": false, "finished": 0, "item": [{"container_image": "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"}, {"ansible_loop_var": "item", "changed": true, "failed": false, "gid": 981, "group": "dci-openshift-app-agent", "invocation": {"module_args": {"path": null, "prefix": "preflight_container_artifacts_", "state": "directory", "suffix": ""}}, "item": {"container_image": "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"}, "mode": "0700", "owner": "dci-openshift-app-agent", "path": "/tmp/preflight_container_artifacts_3u3uvfok", "secontext": "unconfined_u:object_r:user_tmp_t:s0", "size": 6, "state": "directory", "uid": 1002}, " "], "results_file": "/var/lib/dci-openshift-app-agent/.ansible_async/761330612078.1855963", "started": 1}, "attempts": 2, "changed": true, "cmd": ["podman", "run", "--rm", "--privileged", "-e", "PFLT_JUNIT=true", "-e", "PFLT_LOGLEVEL=trace", "-e", "PFLT_ARTIFACTS=/artifacts", "-e", "PFLT_LOGFILE=/artifacts/preflight.log", "-e", "PFLT_DOCKERCONFIG=/opt/config.json", "-v", "/tmp/preflight_container_artifacts_3u3uvfok:/artifacts", "-v", "/tmp/preflight_tmp_dir.mzhit2o4/config.json:/opt/config.json", "quay.io/opdev/preflight:1.10.0", "check", "container", "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"], "delta": "0:00:05.162823", "end": "2024-09-27 07:54:11.786187", "finished": 1, "msg": "non-zero return code", "rc": 143, "start": "2024-09-27 07:54:06.623364", "stderr": "time=\"2024-09-27T12:54:06Z\" level=debug msg=\"config file not found, proceeding without it\"
time=\"2024-09-27T12:54:06Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"
time=\"2024-09-27T12:54:06Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0 for platform amd64\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0\"
time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-72102398
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-72102398/fs
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-72102398/fs
time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\"", "stderr_lines": ["time=\"2024-09-27T12:54:06Z\" level=debug msg=\"config file not found, proceeding without it\"", "time=\"2024-09-27T12:54:06Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"", "time=\"2024-09-27T12:54:06Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0 for platform amd64\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0\"", "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-72102398", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-72102398/fs", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-72102398/fs", "time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\""], "stdout": "", "stdout_lines": []}
failed: [jumphost] (item={'started': 1, 'finished': 0, 'ansible_job_id': '867988553009.1855982', 'results_file': '/var/lib/dci-openshift-app-agent/.ansible_async/867988553009.1855982', 'changed': True, 'failed': False, 'item': [{'container_image': 'quay.io/nepravee/cmm-aimipps:CMM24.11.0.0'}, {'changed': True, 'path': '/tmp/preflight_container_artifacts_rlrtl_ob', 'uid': 1002, 'gid': 981, 'owner': 'dci-openshift-app-agent', 'group': 'dci-openshift-app-agent', 'mode': '0700', 'state': 'directory', 'secontext': 'unconfined_u:object_r:user_tmp_t:s0', 'size': 6, 'invocation': {'module_args': {'state': 'directory', 'prefix': 'preflight_container_artifacts_', 'suffix': '', 'path': None}}, 'failed': False, 'item': {'container_image': 'quay.io/nepravee/cmm-aimipps:CMM24.11.0.0'}, 'ansible_loop_var': 'item'}, ' '], 'ansible_loop_var': 'item'}) => {"ansible_job_id": "867988553009.1855982", "ansible_loop_var": "async_result_item", "async_result_item": {"ansible_job_id": "867988553009.1855982", "ansible_loop_var": "item", "changed": true, "failed": false, "finished": 0, "item": [{"container_image": "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"}, {"ansible_loop_var": "item", "changed": true, "failed": false, "gid": 981, "group": "dci-openshift-app-agent", "invocation": {"module_args": {"path": null, "prefix": "preflight_container_artifacts_", "state": "directory", "suffix": ""}}, "item": {"container_image": "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"}, "mode": "0700", "owner": "dci-openshift-app-agent", "path": "/tmp/preflight_container_artifacts_rlrtl_ob", "secontext": "unconfined_u:object_r:user_tmp_t:s0", "size": 6, "state": "directory", "uid": 1002}, " "], "results_file": "/var/lib/dci-openshift-app-agent/.ansible_async/867988553009.1855982", "started": 1}, "attempts": 1, "changed": true, "cmd": ["podman", "run", "--rm", "--privileged", "-e", "PFLT_JUNIT=true", "-e", "PFLT_LOGLEVEL=trace", "-e", "PFLT_ARTIFACTS=/artifacts", "-e", "PFLT_LOGFILE=/artifacts/preflight.log", "-e", "PFLT_DOCKERCONFIG=/opt/config.json", "-v", "/tmp/preflight_container_artifacts_rlrtl_ob:/artifacts", "-v", "/tmp/preflight_tmp_dir.mzhit2o4/config.json:/opt/config.json", "quay.io/opdev/preflight:1.10.0", "check", "container", "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"], "delta": "0:00:04.910753", "end": "2024-09-27 07:54:11.786600", "finished": 1, "msg": "non-zero return code", "rc": 143, "start": "2024-09-27 07:54:06.875847", "stderr": "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"config file not found, proceeding without it\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimipps:CMM24.11.0.0 for platform amd64\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimipps:CMM24.11.0.0\"
time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-1331259970
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-1331259970/fs
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-1331259970/fs
time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\"", "stderr_lines": ["time=\"2024-09-27T12:54:07Z\" level=debug msg=\"config file not found, proceeding without it\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimipps:CMM24.11.0.0 for platform amd64\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimipps:CMM24.11.0.0\"", "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-1331259970", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-1331259970/fs", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-1331259970/fs", "time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\""], "stdout": "", "stdout_lines": []}

Expected Result

preflight running normally 100% of time instead of random occurrences of 'container not running'

Actual Result

random occurrences of 'container not running'

Additional Context

The code source producing an error seems to be here. Should we explore the crane options to ensure that the image was completely pulled and include them into preflight code?

@tkrishtop tkrishtop added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2024
@acornett21
Copy link
Contributor

acornett21 commented Nov 13, 2024

@tkrishtop Is this for preflight check container? preflight doesn't run the container as part of check container, does this happen outside of DCI against this image? To me this looks like DCI is killing this pod due to resource constraints. We see the same thing in Konflux for large images.

@tkrishtop
Copy link
Contributor Author

tkrishtop commented Nov 14, 2024

Hi @acornett21 thank you for checking.

Is this for preflight check container

yes, here is preflight.log

time="2024-10-02T10:30:06Z" level=debug msg="config file not found, proceeding without it"
time="2024-10-02T10:30:06Z" level=info msg="certification library version" version="1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>"
time="2024-10-02T10:30:07Z" level=info msg="running checks for quay.io/XXX/YYY:tag for platform amd64"
time="2024-10-02T10:30:07Z" level=info msg="target image" image="quay.io/XXX/YYY:tag"
time="2024-10-02T10:30:07Z" level=debug msg="pulling image from target registry"
time="2024-10-02T10:30:07Z" level=debug msg="created temporary directory" path=/tmp/preflight-985576278
time="2024-10-02T10:30:07Z" level=debug msg="exporting and flattening image"
time="2024-10-02T10:30:07Z" level=debug msg="extracting container filesystem" path=/tmp/preflight-985576278/fs
time="2024-10-02T10:30:07Z" level=debug msg="writing container filesystem" outputDirectory=/tmp/preflight-985576278/fs

To me this looks like DCI is killing this pod due to resource constraints.

The error code 143 suggests that the podman process was terminated by a SIGTERM (signal 15).
There could be one of two situations here:

  • Resource Constraints
  • Preflight timeout killing podman pull

When we test having images pre-pulled locally, the error disappears. That makes me think that we really deal with preflight killing podman pull here and not with the resource constraints.

We see the same thing in Konflux for large images.

Do you use any workarounds to fix the issue?

@acornett21
Copy link
Contributor

For clarity, preflight doesn't use podman, so there is not container in container type situation here. We use crane as a library.

Since this works with cached images, I'm going to make the logical assumption that this is the same problem as konflux with resources on the host. Can you check the CPU/Memory profile on the host to see if DCI is killing this?

For Konflux, users can increase the CPU/Memory of their pipeline or task. I'd assume DCI has a similar feature.

@acornett21
Copy link
Contributor

I also just tested the below image a few times which is almost 8GB and preflight has no issue (assuming you have enough /tmp storage space)

  • docker.io/sagemathinc/cocalc-docker:latest

I tested this directly on a host with the preflight binary. So no container, CI, etc concerns. This makes me think it's more of a CI issue, since this is double the size of the largest image in question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants