Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[critical] K or kweb or something opens up too many files on fslweb #19

Open
pdaian opened this issue Nov 3, 2014 · 12 comments
Open

[critical] K or kweb or something opens up too many files on fslweb #19

pdaian opened this issue Nov 3, 2014 · 12 comments

Comments

@pdaian
Copy link
Member

pdaian commented Nov 3, 2014

From Joel in IT:

Grigore,

Fslweb is back up and serving the website.  We are, unfortunately not in control of the network update schedule, but we do apologize for the outage.  That said, here is some technical information regarding what happened.

The logs were recording many messages of this form:

Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error parsing timestamps file '/var/lib/NetworkManager/timestamps': Too many open files
Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error saving timestamp: Failed to create file '/var/lib/NetworkManager/timestamps.F7ZEOX': Too many open files

Days and perhaps months prior to the outage.  This leads me to believe there was an issue already present that the network

Nov  1 05:40:40 fslweb NetworkManager[1795]: <warn> sysctl: failed to open '/proc/sys/net/ipv6/conf/eth0/accept_ra': (24) Too many open files
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4364] [nm-device.c:3486] nm_device_update_ip4_address(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4476] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <info> (eth0): bringing up device.

Is the time of the actual network outage.  You can see that it fails to come back up due to a lack of available file handles.  It then spams the same line repeatedly up until I restarted the machine this morning.

Nov  2 03:45:52 fslweb NetworkManager[1795]: <error> [1414921552.22183] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.

Checking the max open file handles, 1,620,366 is the number of files the system will open concurrently.  That's a million and a half open files.  From checking the backup stats on the machine it looks like the machine itself has almost 7 million files in just 115GB of space.  This leads me to believe that the issue that caused the machine to not come back up after the networking outage was the open files, not something directly related to the network.  I need to run to a meeting, but I'll provide additional information this afternoon.

Joel
@kheradmand
Copy link
Contributor

I guess anon folder is for saving unregistered users' files. If this is the case, we should remove those files once anonymous user's session ends (or for example after 24 hours)

Right now we have ~1M (925975) files in /srv/kweb/kfiles/anon folder.

@kheradmand
Copy link
Contributor

Also we have this function:

def get_file_meta():
...
return open(collection.get_collection_path() + path + file + '.meta').read()

I'm not sure whether python automatically closes these files or not.

@pdaian
Copy link
Member Author

pdaian commented Nov 4, 2014

Python's garbage collector is supposed to close file handles. I'm not sure why that isn't happening if it is a Python issue. The garbage collector is definitely working because our memory isn't infinitely growing.

@kheradmand
Copy link
Contributor

then that is most probably not the the problem :)

@pdaian
Copy link
Member Author

pdaian commented Nov 4, 2014

It still may be the issue 👍 ... could be a bug in Python or maybe we're running a really old version. I agree with you that we need to do file cleanup though, it's been on my list for a while but right now it's just a manual thing. Worth noting the anon folder has reached over 10GB before (~1 year of use) with no problems. I did clean up all the files a few days before the crash so maybe this was somehow a consequence of that. Going to be hard to say without some more investigation.

@kheradmand
Copy link
Contributor

One thing that I've noticed in the file list that Joel sent is that all files in for example ./kweb/kfiles/anon/8a9dfe95-fe26-4883-b974-c239b2db4064/ were open on the server. The only command that I've find so far in your code that touces all those files is shutil.copytree but that does not help

@kheradmand
Copy link
Contributor

oh I just realized that Joel said that those files were 'created' not 'open'. Do you have access to list of currently open files on server ? I do not have.

@pdaian
Copy link
Member Author

pdaian commented Nov 4, 2014

You can do sudo lsof if you have root access. Nothing out of the ordinary there right now, well under 7k files open.

@kheradmand
Copy link
Contributor

Hmmm
Thanks :)

@kheradmand
Copy link
Contributor

So as far I investigated, no new files get opened over time. But just in case, I created a cronjob to look for new files that are open, everyday.

@grosu
Copy link
Member

grosu commented Nov 4, 2014

Guys, if we should get a new fslweb machine/server, please let me know.

Grigore


From: Philip Daian [[email protected]]
Sent: Monday, November 03, 2014 7:18 PM
To: kframework/kweb
Subject: Re: [kweb] [critical] K or kweb or something opens up too many files on fslweb (#19)

It still may be the issue [:+1:] ... could be a bug in Python or maybe we're running a really old version. I agree with you that we need to do file cleanup though, it's been on my list for a while but right now it's just a manual thing. Worth noting the anon folder has reached over 10GB before (~1 year of use) with no problems. I did clean up all the files a few days before the crash so maybe this was somehow a consequence of that. Going to be hard to say without some more investigation.


Reply to this email directly or view it on GitHubhttps://github.com//issues/19#issuecomment-61579000.

@pdaian
Copy link
Member Author

pdaian commented Nov 4, 2014

@grosu I don't think a new server is required. We will eventually definitely need to move kweb to the cloud though, because if we get several people using K at once it's already too much CPU for any single machine to handle well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants