Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

location_report_builder getting stuck on get_filedir_count() #417

Open
marxjohnson opened this issue Apr 23, 2021 · 6 comments
Open

location_report_builder getting stuck on get_filedir_count() #417

marxjohnson opened this issue Apr 23, 2021 · 6 comments

Comments

@marxjohnson
Copy link
Contributor

We have noticed that our cron is running lots of instances of the generate_status_report scheduled task, that never seem to complete.

Doing some digging, I have found that location_report_builder reaches the stage where it runs $filesystem->get_filedir_count(), which runs the following shell command: find /srv/learn2syst.open.ac.uk/www/moodledata/filedir -type f | grep -c /
For some reason, the task hangs at this point and never completes. There is no error output. Our contianer running the cron script remains active, so it obviously doesn't think the cron script is complete.

More strange is that Moodle does seem to think the scheduled task is complete, it continues to run additional scheduled tasks, including further instances of generate_status_report. Watching runningtasks.php shows the task run for about 3-4 minutes, then disappear.

@marxjohnson
Copy link
Contributor Author

Doing a bit more digging, it appears that the generate_status_report task does eventually complete, but this section takes about an hour to complete on our filesystem.

$rowcount = $filesystem->get_filedir_count();
$rowsum = $filesystem->get_filedir_size();

In this case $rowcount is 495263 and $rowsum is 3436428000.

So it's being slow rather than dying, but it's still puzzling why Moodle doesn't seem to think that this task is still running, and continues to run additional instances along with other tasks.

@brendanheywood
Copy link
Contributor

@marxjohnson did you ever get to the bottom of this? It sounds more like a lock factory problem than an objectfs issue?

@marxjohnson
Copy link
Contributor Author

@brendanheywood I didn't get to the bottom of it before I left the OU, and I haven't looked into it since.

@sammarshallou Are you still having problems with this?

@sammarshallou
Copy link
Contributor

@marxjohnson @brendanheywood We still have this task disabled on live server, I guess nobody has needed the status report...

I made it run on acct yesterday - it appeared normally in the 'running tasks' page, and although I had to go home before it finished, it did complete after just over 2 hours, and when I look now it's not still showing on the running tasks page or anything like that. The log from task logs is like this:

Execute scheduled task: Object status report generator task (tool_objectfs\task\generate_status_report)
... started 16:30:10. Current memory use 7.7 MB.
... used 58 dbqueries
... used 7607.4464600086 seconds
Scheduled task complete: Object status report generator task (tool_objectfs\task\generate_status_report)

One strange thing is that we have custom cron logs for each cron runner, which I still use because you can just reload them to monitor progress during a run not after. That log should have a duplicate of this, but it is cut off after the start:

Execute scheduled task: Object status report generator task (tool_objectfs\task\generate_status_report)
... started 14:17:02. Current memory use 4.7 MB.

I can't really understand why this log file would get cut off given that the process obviously didn't crash, but anyway, it's presumably something to do with our infrastructure and not indicating any problem with the task.

So in summary, it would be nice if the task didn't take 2 hours to run obviously, but other than that it looks OK,

@brendanheywood
Copy link
Contributor

ok sounds like there is a few things and this issue should be split up. The issue with generate_status_report due to the sql already has a few issues elsewhere like #596

get_filedir_count should be small, certainly not millions of files. This can depend on what the settings are, like if a large threshold is set for the size of files to be moved to object storage. Is this set high?

@sammarshallou
Copy link
Contributor

The size is default, 10240. I checked a couple of random filedir directories /xx/yy and they both had approx. 20 files in, almost all of which were < that size so I think it's working. So * the 64K directories would give about 1.3 million files total in filedir. So it's not 'millions' but it is a million.

When I mentioned this task in a standup, the developer who knows about infrastructure said it was expensive to run frequently as well due to AWS storage costs or something (I'm not sure if he's right, it's possible he might be thinking of a different task, this was just a quick chat) - anyway we are cool with leaving it disabled and running manually only if required, so it's really ok for us that it takes 2 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants