Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Elastic Agent docker image for azure containers runtime #82

Closed
jlind23 opened this issue Feb 7, 2022 · 46 comments · Fixed by #3084, #3576 or #3778
Closed

Update Elastic Agent docker image for azure containers runtime #82

jlind23 opened this issue Feb 7, 2022 · 46 comments · Fixed by #3084, #3576 or #3778
Assignees
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Feb 7, 2022

Describe the enhancement:
Elastic Agent docker image is currently not working out of the box for azure container runtime

Docker-entrypoint is executed but it cannot launch elastic-agent because it doesn't find the file.
And it does not find the file because it does not have enough rights to do so.

This is due to the fact that Elastic Agent user is part of the root user group but somehow Azure container runtime has some security restriction.

This is the solution provided so far by @eedugon :

FROM docker.elastic.co/beats/elastic-agent:7.16.2
USER root
RUN chown -R :elastic-agent /usr/share/elastic-agent
USER elastic-agent

We should ensure that this fix wouldn't break anything related to openshift.

More information can be found here: #147

Describe a specific use case for the enhancement or feature:

@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@jsoriano
Copy link
Member

jsoriano commented Feb 7, 2022

We should ensure that this fix wouldn't break anything related to openshift.

The actual requirement in OpenShift (and other securized kubernetes environments), is that it should be possible to run the image with arbitrary users (in docker run, with arbitrary user ids in --user=<user id>).
The recommendation for this is to assign required files to the root group, that is the group assigned by docker to users that don't have a group.

@jsoriano
Copy link
Member

jsoriano commented Feb 8, 2022

One thing to try, suggested by @eedugon would be to make the default user the owner of the files.

This way, when running in an environment that runs with the default user, but without root group, as seems to be the case in Azure, the user will have permissions because is the owner. When running in an environment that randomizes uids, as Openshift, the user will have permissions because it belongs to the root group.

This would require to add --chown={{ .user }}:root when copying the directories:
https://github.com/elastic/beats/blob/68872a0e7f8aa1b0eab7cbd46e8088dc71bd67fb/dev-tools/packaging/templates/docker/Dockerfile.tmpl#L136-L140
https://github.com/elastic/beats/blob/68872a0e7f8aa1b0eab7cbd46e8088dc71bd67fb/dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl#L145
https://github.com/elastic/beats/blob/68872a0e7f8aa1b0eab7cbd46e8088dc71bd67fb/dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl#L156

This could also allow to remove --groups 0 from useradd, what could help on #147.

@jlind23 jlind23 transferred this issue from elastic/beats Mar 7, 2022
@narph
Copy link
Contributor

narph commented May 30, 2022

Spent too much time on this one, just to circle back, when running:

az container create --resource-group group1  --name elastic-agent --image docker.elastic.co/beats/elastic-agent:7.16.2  --restart-policy Never --ip-address Public  --environment-variables FLEET_ENROLLMENT_TOKEN='...' FLEET_ENROLL='1' FLEET_URL='...'

it fails creating the container with:

Start streaming logs:
[WARN  tini (19)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
/usr/local/bin/docker-entrypoint: line 11: exec: elastic-agent: not found

I tried to create a small reproduction to validate useradd refuses to add the elastic-agent user to root group as claimed here: https://github.com/elastic/sdh-beats/issues/1603#issuecomment-1012408050.

# Prepare home in a different stage to avoid creating additional layers on
# the final image because of permission changes.
FROM ubuntu:20.04 AS home

RUN mkdir -p /usr/share/elastic-agent/data /usr/share/elastic-agent/data/elastic-agent-4a0299/logs
RUN touch /usr/share/elastic-agent/test-file.txt

FROM ubuntu:20.04

COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent

# Elastic Agent needs group permissions in the home itself to be able to
# create fleet.yml when running as non-root.
# RUN chmod 0770 /usr/share/elastic-agent

RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent

USER elastic-agent

WORKDIR /usr/share/elastic-agent
ENTRYPOINT ["/usr/bin/sleep", "1d"]

Running id or groups under the elastic-agent account never lists the root group.

There does not seem to be anything special about the root group on azure though as the following

az container create --resource-group control-plane  --name ubuntu2 --image ubuntu:20.04  --debug --restart-policy Never --command-line "/usr/bin/getent group root"

returns root:x:0: in the logs.

One interesting observation is that the order of the copy of the home directory with chown and adding the user to the groupadd matters.

COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent

RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent

ls -al in the elastic-agent's home returns the following indicating ownership by root:

elastic-agent@SandboxHost-637895294345371045:~$ ls -al
total 12
drwxr-xr-x 3 root root 4096 May 30 17:42 .
drwxr-xr-x 1 root root 4096 May 30 17:42 ..
drwxr-xr-x 3 root root 4096 May 30 17:11 data
-rw-r--r-- 1 root root    0 May 30 17:11 test-file.txt

Swapping these around though will update the ownership to elastic-agent correctly

RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent
COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent
total 12
drwxr-xr-x 3 elastic-agent root 4096 May 30 17:48 .
drwxr-xr-x 1 root          root 4096 May 30 17:48 ..
drwxr-xr-x 3 elastic-agent root 4096 May 30 17:11 data
-rw-r--r-- 1 elastic-agent root    0 May 30 17:11 test-file.txt

### I will try to recreate a new image to test the swap, will update the issue soon.

Just adding --chown={{ .user }}:root as described #82 (comment) did not work.
Also tried to make use of any of the flags in in https://docs.microsoft.com/en-us/cli/azure/container?view=azure-cli-latest#az-container-create to assign user to group unsuccessfully.


Regarding the Tini error, looks like the container reserves the PID 1 to the pause process.
image

few workarounds I've tested (successfully) here:

@jsoriano
Copy link
Member

jsoriano commented May 30, 2022

looks like the container reserves the PID 1 to the pause process.

This is interesting, this might give a clue about how to reproduce this environment in a normal local docker. It seems they are using pause containers in Azure. These "pause" containers are a strategy used in container orchestrators to setup namespaces before starting the actual container, and maintain them if the containers unexpectedly die. The pause container would contain a single process, /pause, that does nothing and lives during all the lifetime of the scheduled load.

In Kubernetes, when you start a pod, a pause container is started first, and then all the containers in the pod are attached to the network namespace of this container.
Here in azure it seems that also the pid namespace is shared between the pause container and the actual container.

$ docker run --name mypausecontainer -d k8s.gcr.io/pause
$ docker run --pid=container:mypausecontainer -it --rm docker.elastic.co/elastic-agent/elastic-agent-complete:8.2.0
[WARN  tini (10)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
{"log.level":"info","@timestamp":"2022-05-30T18:46:54.569Z","log.origin":{"file.name":"cmd/run.go","file.line":153},"message":"APM instrumentation disabled","ecs.version":"1.6.0"}
...

I didn't manage to reproduce the "not found" errors though, they are probably related to permissions.

@jsoriano
Copy link
Member

jsoriano commented May 30, 2022

Update, I managed to reproduce the azure errors with a local docker using also random uids and gids:

$ docker run --name mypausecontainer -d k8s.gcr.io/pause
$ docker run --pid=container:mypausecontainer --user 213321443:32143 -it --rm docker.elastic.co/elastic-agent/elastic-agent-complete:8.2.0
[WARN  tini (10)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
/usr/local/bin/docker-entrypoint: line 11: exec: elastic-agent: not found

I don't have a solution, but this may ease how to test this kind of environments locally and in CI.

@narph
Copy link
Contributor

narph commented May 31, 2022

@jsoriano , thanks , I am trying to test a potential solution but Azure container create apis are down(?), I am getting service unavailable. Hopefully it will back soon so I can test.

@narph narph mentioned this issue Jun 1, 2022
6 tasks
@narph
Copy link
Contributor

narph commented Jun 1, 2022

The good news: azure container service is back online and the workaround here #493 would have worked for the azure containers and OpenShift as well.

The bad news: due to #398 which adds the checks if user is root https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/vault/seed.go#L33 we get:

1},"message":"Starting enrollment to URL: ....","ecs.version":"1.6.0"}
Error: fail to enroll: could not save enrollment information: non-root file owner
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.3/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For

This is because the check checks both if the user and group owner is root.

uid, _ := stat.UID()
gid, _ := stat.GID()
if uid == 0 && gid == 0 {
return true, nil
}

However with my PR #493 and also the workaround that was initially suggested by @eedugon the permissions will get updated to

drwxr-xr-x 3 elastic-agent root 4096 May 30 17:48 .

Can we be more lenient here and just ensure the files are owned by the root group and not the root user? cc @aleksmaus

@ph any feedback here is welcome

@narph
Copy link
Contributor

narph commented Jun 1, 2022

update

When just removing check on root user it will still fail, logged the group id and user id inside the azure container

Error: fail to enroll: could not save enrollment information: the real group is 1000, user is 1000

both user and group are elastic-agent.

Removing the entire check isFileOwnerRoot worked and the agent is up and running.
If anyone is interested to test with mariana12/img:hello3.

Conclusion:

Can we safely remove isFileOwnerRoot?
Should we update the owner of the elastic-agent directory as implemented by #493 for all docker images or create a separate docker image for Azure?
Should we leave it for now as a known limitation?

@jlind23
Copy link
Contributor Author

jlind23 commented Jun 1, 2022

My two cents:

  • I do not want to create some Azure specific docker image
  • Consider the unknown impacts that may generate the removal of isFileOwnerRoot, I would rather leave it as is for know and document it as a known limitation
    @ph @cmacknz @joshdover I may need your thoughts here

@ph
Copy link
Contributor

ph commented Jun 1, 2022

For the vault check @aleksmaus has relaxed the check.

IMHO:

  • I would not create another docker images, we will open a pandoras box.
  • I've suggested to Aleks to be able to disable the check, we used to have in beats BEAT_STRICT_PERMS we could do the same here, and allow people to set that environment variable when running into container?

Maybe we have different runtime profile: one for desktop, one for containers, others?

@blakerouse
Copy link
Contributor

If it's a check that should always be relaxed when running in a container. The Elastic Agent already knows that its running in a container, because it's started with the container subcommand. So we could have that permission check always relaxed when running from that subcommand.

@ph
Copy link
Contributor

ph commented Jun 2, 2022

@blakerouse good point, this subcommand already "implies" the container profile I've mentioned before and run would be default environment.

@narph
Copy link
Contributor

narph commented Jun 7, 2022

hi folks,

The isFileOwnerRoot check has been rolled back in main completely so we don't need to address the failure to enroll anymore.

We do still have the initial issue that in Azure we need to update the default file permissions so that the elastic-agent user actually has access to the elastic-agent binary.

/usr/local/bin/docker-entrypoint: line 11: exec: elastic-agent: not found

My PR #493 fixes this but updates the file ownership completely in the docker container.

As I mentioned above, that would mean either we create a dedicated docker image for the Azure containers or we further investigate the effects these changes will have on the other platforms and we introduce them in the official docker image.
Else, we document it as a known limitation.

The majority seems against making major changes to the docker image at the moment or maintaining a separate image, can we confirm that these are the conclusions for now?

@jlind23
Copy link
Contributor Author

jlind23 commented Jun 7, 2022

@narph i do confirm. Let's document this behavior for now and let users tweak their configuration.
@ph @pierrehilbert any objections?

@jsoriano
Copy link
Member

jsoriano commented Jun 8, 2022

My PR #493 fixes this but updates the file ownership completely in the docker container.

I think that the changes in ownership in this PR make sense, and should be fine. They are in line with what @eedugon suggested. And files continue being owned by the root group, following OpenShift recommendations.
Also, this is coherent with other Elastic images. In Kibana and Elasticsearch images, files are owned by kibana/elasticsearch users, and root group.
We may want to do this change in any case.

@orvillelim
Copy link

orvillelim commented Jul 25, 2022

Could be a little unrelated. But might help someone.

Mine was removing volumes /usr/share/elastic-agent:/usr/share/elastic-agent in docker compose yml.

Host details:
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
Elastic agent: elastic-agent:8.2.0
Docker version 20.10.17, build 100c701
Docker Compose version v2.6.0

Not completely sure why it works tho

@sunileman
Copy link

sunileman commented Oct 4, 2022

I tried the work around on azure container instance and it didn't work

FROM docker.elastic.co/beats/elastic-agent:8.4.2
USER root
RUN chown -R :elastic-agent /usr/share/elastic-agent
USER elastic-agent

Any other suggestions?

@jlind23 jlind23 assigned michalpristas and unassigned narph Jan 26, 2023
@jsoriano
Copy link
Member

jsoriano commented Jan 27, 2023

@111andre111 found a similar issue when using elastic-package in Docker for Mac. When enabling the experimental containerd image store, elastic-agent doesn't start, with:

/usr/local/bin/docker-entrypoint: line 14: exec: elastic-agent: not found

Kibana, Elasticsearch and package-registry containers start on this environment, so at least this part seems to be an issue specific to the Elastic Agent images. Applying the permission changes in Mariana's PR (#493) may solve this issue.

Maybe Azure is also using containerd under the hood.

@jlind23 could this be re-prioritized? maybe this affects all containerd-based environments and not only Azure, and it would break all developer environments in Mac and Windows if this gets enabled by default.

@jlind23
Copy link
Contributor Author

jlind23 commented Jan 27, 2023

@jsoriano added this to one of our next sprint in order to fix this.

@cmacknz cmacknz added the QA:Ready For Testing Code is merged and ready for QA to validate label Jul 24, 2023
@cmacknz
Copy link
Member

cmacknz commented Jul 24, 2023

@elastic/fleet-qasource-external can you please test installing Elastic Defend in a Docker container to see if it continues to install correctly after this change?

We likely only need to test one of these integrations to see if this change had any consequences since the requirement to be root is the same for all of them.

@harshitgupta-qasource
Copy link

Hi @cmacknz

Thank you for the update.

We have re-validated this on latest 8.9.0 BC5 Kibana cloud environment and had below observations:

Observations:

  • On adding elastic defend integration to docker agent policy, docker agent goes to Unhealthy stat.

  • However, as per the last information we had and the testcase: #C150159, Elastic defend integration is not supported for the docker agents.

  • We have observed the same docker agent behavior with elastic defend on earlier kibana versions too.

Build details:
VERSION: 8.9.0
BUILD: 64715
COMMIT: beb56356c5c037441f89264361302513ff5bd9f8

Screenshot:
image

Please let us know if anything else is required from our end.

Thanks

@cmacknz
Copy link
Member

cmacknz commented Jul 25, 2023

Thanks, I think that is a strong hint that what was suggested in #82 (comment) about inputs that require root in containers no longer working after this change.

Elastic Defend isn't supported in containers, but the beta Defend for Containers integration is and it also has this requirement. @elastic/fleet-qasource-external can you test installing the Defend for Containers integration as well?

@harshitgupta-qasource
Copy link

Hi @cmacknz

Thank you for the update.

We have re-validated this on latest 8.9.0 release build and had below observations:

Observations:

  • On adding defend for container integration to docker agent policy, docker agent goes to Unhealthy status.

Build details:
VERSION: 8.9.0
BUILD: 64715
COMMIT: beb56356c5c037441f89264361302513ff5bd9f8

Screenshot:
image

Please let us know if anything else is required from our end.

Thanks

@cmacknz
Copy link
Member

cmacknz commented Jul 27, 2023

Thank you, Cloud Defend is broken by this change then. It also seems to have affected synthetics.

@michalpristas we need to revert #3084 while we sort this out.

@111andre111
Copy link
Contributor

111andre111 commented Jul 28, 2023

@cmacknz @michalpristas
I kind of disagree that this pr needs to be reverted.

The nature of what has been changed would not really be an issue for a privileged docker container.
I didn't really start the container as privileged, but like you can see for kubernetes here you find the manifest that needs to be downloaded and probably corrected to the newer image versions:
https://www.elastic.co/guide/en/security/8.9/kubernetes-dashboard.html#_download_and_modify_the_daemonset_manifest
There you see that the manifest starts the container as privileged

Not really sure what exactly has been tested by @harshitgupta-qasource .

The pr itself doesn't really change something from a user perspective itself. It just changes the rights to /usr/share/elastic-agent appropriately and would be only valid for non-privileged containers.

In my opinion it's somehow expected Defend would not run in a non-privileged scenario.
The comment here
#82 (comment)
is only a symptom of that and goes back to that investigation point:
#82 (comment)
where basically containerd cuts away the supplementary group. Some of that has been mentioned earlier.

And here
https://github.com/elastic/elastic-agent/pull/3084/files#diff-21d89ed5947be16887bfdba934bd3af27ef7252551adccd996a9bdc961222f39R147
we add with --groups 0 elastic-agent user to root group as a supplentary group.

That could be probably changed to the primary group.

But long story short, that Elastic Defend doesn't work on the unprivileged use might be probably expected and I should have been testing on a privileged container which I didn't test.
Maybe something similar is needed for Synthetics and didn't really change to the situation of before we did this PR.

Unprivileged the user would be always elastic-agent before and after the change.

@111andre111
Copy link
Contributor

111andre111 commented Jul 28, 2023

I did some tests without containerd and I think that part as a test with a pure docker environment should be excluded from this ticket. I don't know if pure docker fleet installation is even supported with Elastic Endpoint or if there are some capabilities probably needed. I think that should be investivated with Endpoint, Profiler and Synthetics team.

At least under Kubernetes mentioned in last comment deployment stays healthy with rolling out Elastic Defend and Elastic Defend for Containers ( for the latter one I needed to add some capabilities but that's it)

End in Kubernetes for Elastic Endpoint that is solved with some sidecar containers.

So this scenario could be probably tested with the new Image.

@harshitgupta-qasource
Copy link

Hi @111andre111

We have re-validated this on latest 8.9.0 release build and please find below detailed steps we have followed for this test:

Steps to reproduce:

  1. Deploy an Ubuntu ARM64 Machine.
  2. Install Docker Engine on Ubuntu.
  3. Run Docker :
sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=[https://...................................:443](https://.................................../) \
--env FLEET_ENROLLMENT_TOKEN=<Enrollment Token> \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/beats/elastic-agent:8.9.0
 
  1. Docker agent is installed on Kibana
  2. Now add the Defend for docker integration to docker agents policy .
  3. Observe that after adding Defend for docker integration to docker agent policy, docker agent goes to Unhealthy status.

Build details:
VERSION: 8.9.0
BUILD: 64715
COMMIT: beb56356c5c037441f89264361302513ff5bd9f8

Screen-shot
image

Agent Diagnostic logs:
elastic-agent-diagnostics-2023-07-31T06-18-21Z-00.zip

Please let us know if we are missing anything here.
Thanks!

@michalpristas
Copy link
Contributor

michalpristas commented Aug 1, 2023

yeah i did the revert at the end of my day for quick remedy but
8.9 does not even contain a change. looking at it i don't see a reason to do this revert as well as @111andre111 said.
agent is not running and was not running as root, what changed is just a permissions for config file, user even kept its 0 group membership

@111andre111
Copy link
Contributor

@harshitgupta-qasource Tbh I don't think the docker statement would even work with Elastic Defend like I mentioned earlier it needs privileged rights.
so the docker statement needs in my opinion at least --privileged --user 0.
I am not sure if that's the right approach to test Elastic Defend or Elastic Defend for Containers in a pure Docker environment. That's why I said earlier as well we should ask for Input from the appropriate responsible teams, probably in this case @kevinlog

@cmacknz
Copy link
Member

cmacknz commented Aug 1, 2023

CC @andrewvc since I believe this change broke the synthetics container, or at least reverting it fixed on of their E2E tests.

I don't believe we can bring this change back without confirming that everything that worked before still works afterwards, and ensuring our documentation reflects any changes to the way the containers need to be started.

@michalpristas
Copy link
Contributor

can we get failures from e2e?

@cmacknz
Copy link
Member

cmacknz commented Aug 2, 2023

Sent the link privately.

@kevinlog
Copy link

kevinlog commented Aug 4, 2023

@111andre111

I am not sure if that's the right approach to test Elastic Defend or Elastic Defend for Containers in a pure Docker environment.

I think you are correct here. I don't believe you can run Elastic Defend in a Docker container. @ferullo @nfritts could give more info

@jlind23
Copy link
Contributor Author

jlind23 commented Aug 16, 2023

@michalpristas @pierrehilbert what's the latest here? Were you able to look at the E2E failures and see what the problem was?

@michalpristas
Copy link
Contributor

no progress

@cmacknz
Copy link
Member

cmacknz commented Nov 23, 2023

Re-opening this as the fix needed to be reverted due to problems with ECE deployments. See #3711.

@JanKnipp
Copy link

Hi,
the same issue seems to occur on Azure Container Apps, at least i'm seeing the error "/usr/local/bin/docker-entrypoint: line 14: exec: elastic-agent: not found". The strange thing is that I already have a running instance within Azure Container Apps and the only difference that I can see in my tests is that one of the environments is using "Workload profiles". I'll have a look at it during the day to see if i can reproduce the working version.

@amolnater-qasource amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment