too many open files #58

parmentelat · 2018-01-15T13:58:11Z

Today and not for the first time, I could spot occurrences of this message in the logs

Jan 14 16:07:45 thurst dockerd-current[1574]: OSError: [Errno 24] Too many open files

This clearly happens after a few weeks of operations. For the time being this is another reason to reboot the nbhosting box every once and again - like a couple weeks, or maybe even just one, feels about right

It is not very clear at this point how to handle it properly; the django app does not seem to be the culprit here, although even that is unclear.

The text was updated successfully, but these errors were encountered:

parmentelat · 2018-01-16T12:04:36Z

interesting read possibly/probably related here
elastic/elasticsearch#18022

parmentelat · 2018-01-16T16:37:17Z

ulimit -n shows the allowed number of open files; this seems to be on a per-process basis

the odd thing is from within a container we can see some very decent limit in the 10**6, but in the host it shows only 1024

besides docker seems to have a --ulimit option; however as per the above, this does not seem to be our problem

parmentelat · 2018-01-17T07:58:54Z

global kernel setting is through /proc/sys/fs/file-max
right now thurst has 79190755; thermals has 26372286 (exactly one third)
it looks sa if it is proportional to the amount of memory
that does not seem to be the problem here

I'm getting the same numbers from within a container, so again the settings seem fine

parmentelat · 2018-01-17T08:00:19Z

all this boils down to, the culprit seems to be the per-process limit in the root context. Could it affect dockerd ?

parmentelat · 2018-01-19T09:44:41Z

indeed dockerd does complain; as an experiment I did a live change with

root@thurst ~ (master *) # ulimit -n -S
1024
root@thurst ~ (master *) # ulimit -n -H
4096
root@thurst ~ (master *) # ulimit -n 20480
root@thurst ~ (master *) # ulimit -n -H
20480
root@thurst ~ (master *) # ulimit -n -S
20480

which did not seem to fix the problem, at least not without restarting anything

EDIT:

this seems to affect the current shell only - and I guess its descendants
the systemd service file for docker correctly specifies a decent number for that ulimit

root@thurst ~ (master *) # grep LimitNOFILE /usr/lib/systemd/system/docker.service
LimitNOFILE=1048576

parmentelat · 2018-01-19T09:47:53Z

also maybe worth investigating, as it could be related I guess, the absence of /etc/docker/daemon.json triggers this

Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203203015+01:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203267653+01:00" level=error msg="open /etc/docker/daemon.json: no such file or directory"

parmentelat · 2018-01-19T10:13:06Z

made an attempt at creating an empty /etc/docker/daemon.json on thermals
this seems to not create any additional problem when done live; however after a reboot this all ends up in nbhosting emitting an internal error 500, so that's not good at all

parmentelat · 2018-01-19T10:44:50Z

useful comments to monitor the issue

# running containers
root@thurst ~ (master *) # docker ps | grep -v IMAGE | wc -l
41
# number of files open by the docker daemon
root@thurst ~ (master *) # ls  /proc/$(pgrep dockerd)/fd | wc -l
242
# or 
root@thurst ~ (master *) # lsof -p $(pgrep dockerd) | wc -l
287
# total number of open files in the system
root@thurst ~ (master *) # lsof | wc -l
1148699
# cat /proc/sys/fs/file-max
79190755

don't get it...

parmentelat · 2018-01-22T14:15:47Z

as of jan 22 2018
I am adding lsof inside the flotpython3 image to help troubleshoot this leakage

also:

box was rebooted on jan. 20 circa 08:48
and issued its first 'too many open files' around jan. 20 at 23:19

parmentelat · 2018-01-22T14:43:44Z

dockers=$(docker ps --format '{{.Names}}')
for docker in $dockers; do
    echo -n ==================== $docker " "
    docker exec -ti $docker lsof | wc -l
done

parmentelat · 2018-01-25T07:54:30Z

possible breakthrough; when inside a container:

root@f60d53d63043:~# ulimit -n
1048576

BUT

root@f60d53d63043:~# su - uid39096
No directory, logging in with HOME=/
uid39096@f60d53d63043:/$ id
uid=39096(uid39096) gid=39096(uid39096) groups=39096(uid39096)
uid39096@f60d53d63043:/$ ulimit -n
1024

parmentelat · 2018-01-25T09:29:57Z

in the commit below, an attempt to get this right is made in the dockerfile for the mooc in production

parmentelat/flotpython-archive@f141802

the image we use is based on ubuntu

when run as root, processes in the container have a decent limit of 1024*1024 open files
but regular user processes see a limit of 1024 open files which apparently is not enough for, I expect, the jupyter server process

proposed patch is to create a limit file in /etc/security/limits.d/ to increase this

users who need an upper limit might be the ones who play a lot with pdflatex when they download the notebooks as pdf. What is truly amazing though is that some containers have been observed with as many as several hundreds of thousands of entries in lsof.

Putting this deployment in observation; rolling out the new image is done lazily, hence it takes a good day before it's really effective

parmentelat · 2018-02-06T09:27:43Z

unfortunately, I could see the bad symptom again today.

it actually pops out immediately when you look at the CPU load, my conjecture is that when the issue triggers CPU load gets a steady increase. See load chart on nbhosting.inria.fr on Feb. 3 2018, between 22:00 and 23:00, load raises from 0 to 2 and won't get back down.

there were only 2 active containers when I ran this issue just now feb 6 circa 10:20; surprisingly one had been up for 2 days, which should not happen in normal circumstances.

I restarted the docker daemon - which also caused the containers to shutdown; this was - of course - enough to bring the box back to normal : the load is back to 0 and I can't see a 'Too many open files' message in the journal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many open files #58

too many open files #58

parmentelat commented Jan 15, 2018

parmentelat commented Jan 16, 2018

parmentelat commented Jan 16, 2018 •

edited

Loading

parmentelat commented Jan 17, 2018 •

edited

Loading

parmentelat commented Jan 17, 2018

parmentelat commented Jan 19, 2018 •

edited

Loading

parmentelat commented Jan 19, 2018 •

edited

Loading

parmentelat commented Jan 19, 2018

parmentelat commented Jan 19, 2018

parmentelat commented Jan 22, 2018 •

edited

Loading

parmentelat commented Jan 22, 2018

parmentelat commented Jan 25, 2018

parmentelat commented Jan 25, 2018

parmentelat commented Feb 6, 2018

too many open files #58

too many open files #58

Comments

parmentelat commented Jan 15, 2018

parmentelat commented Jan 16, 2018

parmentelat commented Jan 16, 2018 • edited Loading

parmentelat commented Jan 17, 2018 • edited Loading

parmentelat commented Jan 17, 2018

parmentelat commented Jan 19, 2018 • edited Loading

parmentelat commented Jan 19, 2018 • edited Loading

parmentelat commented Jan 19, 2018

parmentelat commented Jan 19, 2018

parmentelat commented Jan 22, 2018 • edited Loading

parmentelat commented Jan 22, 2018

parmentelat commented Jan 25, 2018

parmentelat commented Jan 25, 2018

parmentelat commented Feb 6, 2018

parmentelat commented Jan 16, 2018 •

edited

Loading

parmentelat commented Jan 17, 2018 •

edited

Loading

parmentelat commented Jan 19, 2018 •

edited

Loading

parmentelat commented Jan 19, 2018 •

edited

Loading

parmentelat commented Jan 22, 2018 •

edited

Loading