Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many open files #58

Open
parmentelat opened this issue Jan 15, 2018 · 13 comments
Open

too many open files #58

parmentelat opened this issue Jan 15, 2018 · 13 comments

Comments

@parmentelat
Copy link
Owner

Today and not for the first time, I could spot occurrences of this message in the logs

Jan 14 16:07:45 thurst dockerd-current[1574]: OSError: [Errno 24] Too many open files

This clearly happens after a few weeks of operations. For the time being this is another reason to reboot the nbhosting box every once and again - like a couple weeks, or maybe even just one, feels about right

It is not very clear at this point how to handle it properly; the django app does not seem to be the culprit here, although even that is unclear.

@parmentelat
Copy link
Owner Author

interesting read possibly/probably related here
elastic/elasticsearch#18022

@parmentelat
Copy link
Owner Author

parmentelat commented Jan 16, 2018

ulimit -n shows the allowed number of open files; this seems to be on a per-process basis

the odd thing is from within a container we can see some very decent limit in the 10**6, but in the host it shows only 1024

besides docker seems to have a --ulimit option; however as per the above, this does not seem to be our problem

@parmentelat
Copy link
Owner Author

parmentelat commented Jan 17, 2018

global kernel setting is through /proc/sys/fs/file-max
right now thurst has 79190755; thermals has 26372286 (exactly one third)
it looks sa if it is proportional to the amount of memory
that does not seem to be the problem here

I'm getting the same numbers from within a container, so again the settings seem fine

@parmentelat
Copy link
Owner Author

all this boils down to, the culprit seems to be the per-process limit in the root context. Could it affect dockerd ?

@parmentelat
Copy link
Owner Author

parmentelat commented Jan 19, 2018

indeed dockerd does complain; as an experiment I did a live change with

root@thurst ~ (master *) # ulimit -n -S
1024
root@thurst ~ (master *) # ulimit -n -H
4096
root@thurst ~ (master *) # ulimit -n 20480
root@thurst ~ (master *) # ulimit -n -H
20480
root@thurst ~ (master *) # ulimit -n -S
20480

which did not seem to fix the problem, at least not without restarting anything


EDIT:

  • this seems to affect the current shell only - and I guess its descendants
  • the systemd service file for docker correctly specifies a decent number for that ulimit
root@thurst ~ (master *) # grep LimitNOFILE /usr/lib/systemd/system/docker.service
LimitNOFILE=1048576

@parmentelat
Copy link
Owner Author

parmentelat commented Jan 19, 2018

also maybe worth investigating, as it could be related I guess, the absence of /etc/docker/daemon.json triggers this

Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203203015+01:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203267653+01:00" level=error msg="open /etc/docker/daemon.json: no such file or directory"

@parmentelat
Copy link
Owner Author

made an attempt at creating an empty /etc/docker/daemon.json on thermals
this seems to not create any additional problem when done live; however after a reboot this all ends up in nbhosting emitting an internal error 500, so that's not good at all

@parmentelat
Copy link
Owner Author

useful comments to monitor the issue

# running containers
root@thurst ~ (master *) # docker ps | grep -v IMAGE | wc -l
41
# number of files open by the docker daemon
root@thurst ~ (master *) # ls  /proc/$(pgrep dockerd)/fd | wc -l
242
# or 
root@thurst ~ (master *) # lsof -p $(pgrep dockerd) | wc -l
287
# total number of open files in the system
root@thurst ~ (master *) # lsof | wc -l
1148699
# cat /proc/sys/fs/file-max
79190755

don't get it...

@parmentelat
Copy link
Owner Author

parmentelat commented Jan 22, 2018

as of jan 22 2018
I am adding lsof inside the flotpython3 image to help troubleshoot this leakage

also:

  • box was rebooted on jan. 20 circa 08:48
  • and issued its first 'too many open files' around jan. 20 at 23:19

@parmentelat
Copy link
Owner Author

dockers=$(docker ps --format '{{.Names}}')
for docker in $dockers; do
    echo -n ==================== $docker " "
    docker exec -ti $docker lsof | wc -l
done

@parmentelat
Copy link
Owner Author

possible breakthrough; when inside a container:

root@f60d53d63043:~# ulimit -n
1048576

BUT

root@f60d53d63043:~# su - uid39096
No directory, logging in with HOME=/
uid39096@f60d53d63043:/$ id
uid=39096(uid39096) gid=39096(uid39096) groups=39096(uid39096)
uid39096@f60d53d63043:/$ ulimit -n
1024

@parmentelat
Copy link
Owner Author

in the commit below, an attempt to get this right is made in the dockerfile for the mooc in production

parmentelat/flotpython-archive@f141802

the image we use is based on ubuntu

when run as root, processes in the container have a decent limit of 1024*1024 open files
but regular user processes see a limit of 1024 open files which apparently is not enough for, I expect, the jupyter server process

proposed patch is to create a limit file in /etc/security/limits.d/ to increase this

users who need an upper limit might be the ones who play a lot with pdflatex when they download the notebooks as pdf. What is truly amazing though is that some containers have been observed with as many as several hundreds of thousands of entries in lsof.

Putting this deployment in observation; rolling out the new image is done lazily, hence it takes a good day before it's really effective

@parmentelat
Copy link
Owner Author

unfortunately, I could see the bad symptom again today.

it actually pops out immediately when you look at the CPU load, my conjecture is that when the issue triggers CPU load gets a steady increase. See load chart on nbhosting.inria.fr on Feb. 3 2018, between 22:00 and 23:00, load raises from 0 to 2 and won't get back down.

there were only 2 active containers when I ran this issue just now feb 6 circa 10:20; surprisingly one had been up for 2 days, which should not happen in normal circumstances.

I restarted the docker daemon - which also caused the containers to shutdown; this was - of course - enough to bring the box back to normal : the load is back to 0 and I can't see a 'Too many open files' message in the journal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant