Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all service containers of a service are not fetched in /etc/opts/hosts file #5

Open
ghost opened this issue Aug 27, 2017 · 4 comments

Comments

@ghost
Copy link

ghost commented Aug 27, 2017

I have created a service with 16 containers and running an MPI task from the master node. I have noticed that not all the service containers are taking the load. Then I opened the /etc/opts/hosts file which is supposed to have a list of all service containers but I found most of the time 2-3 containers are not listed in it.

I have figured it out that this is an issue with "netstat -t" command inside get_hosts, which can not resolve all containers name and hence returning fewer addresses most of the time.

@NLKNguyen
Copy link
Owner

Are you using the Single Host or Multi Host orchestration? and what is the version of Docker?

I notice in the Multi Host solution, the availability of all services is sometimes late, and I have to rerun the commands to get them all up.

Any alternative suggestion to netstat -t is welcome. At some point I'll look into the new Docker (haven't checked since January but heard some big noise in the Summer) to see what's been updated that can provide better solution to this topic.

@ghost
Copy link
Author

ghost commented Aug 30, 2017

I am using multiple host and docker version is 1.16.0
"netstat" is slow and it not picking all the containers address.
I made a local script which prepares the list of hosts and scp the file into the master container before login and starting the mpi task from inside.
"docker service ps --no-trunc master-service-name"
"docker service ps --no-trunc worker-service-name" commands gives all required literals to prepre the hostFile.

I did it in java/python but to keep your project as it is, it will be better to use another shell script to populate the same.

@lzhou-arch
Copy link

lzhou-arch commented Apr 7, 2018

I noticed similar issues while running MPI jobs. Some of the worker nodes occasionally get lost from the /etc/opts/hosts. It won't cause problems when running a short MPI job, but it will hang there forever for some longer jobs.

Any ideas to bring the hanging jobs back?

@simonholgate
Copy link
Contributor

This might be a similar issue to #4 and netstat.

I've produced a solution using dig based on https://stackoverflow.com/questions/49446165/how-to-get-all-ip-addresses-on-a-docker-network

I'll make a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants