-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid resource leaking in scanners #220
Comments
If any of the above does not make sense, please drop a comment explaining better options. |
It seems that we can fix most of the mentioned issues (except parallel files parsing) just by replacing the lazy |
By the way, do we need |
Comments for Monday:
Yep, I agree. I wrote a lot of text in the issue description, but likely the best fix is that simple. Cool that you tried it (hope it was relatively quick to do) and it's interesting to see that we won in memory so much. Btw, how did you see the number of open file descriptors?
Yeah, we ended up with a discrepancy between the documentation and the code, and it is confusing, would be nice to address this. But I suppose, if we switch to strict What do you think? |
I used
I tried to parallelize files parsing in my 100-mds repo, and found that we can get a certain speedup from it; in my case it's about one second (but I could easily miss something). So I think, for small files, we can use one thread for IO, read those files strictly and create a spark for parsing, then get to next file; this way one of our threads will always only read files, and others will parse them, freeing memory of processed We can distinguish small files from big by finding out file's size from the filesystem. So, about By the way, in our current program almost all the repo scanning in fact occurs in this line xrefcheck/src/Xrefcheck/Command.hs Line 82 in 997c438
|
Feel free to read on Monday
I see, this concrete approach is worth remembering for me 🤔 You well spotted that we have two desired ways here, I agree. And nice that you tried parsing parallelization.
This is what I wanted to hear when created this ticket 👍
After working on some projects I became a fan of the idea, that instead of providing a concrete narrow interface for the extension's logic (what those projects did), it is better to provide a generic interface (to a reasonable extent) + bridges between that generic interface and particular use cases. In our case, this would mean: leave This way:
Yeah |
Problem: Currently we use lazy `readFile`, and xrefcheck opens all the files immediately for reading, while file descriptors are resource that can be finite. Solution: Replaced lazy `readFile` with strict version.
Btw, this issue is real, see e.g. LIGO repository (failing job) |
Clarification and motivation
Problems
This is an issue that I kept in mind for a long time, and since xrefcheck is becoming a serious tool, I think we should have this issue handled.
It is a known minor issue of
readFile
that until the full produced lazy bytestring/text is processed, the file descriptor remains open.An open file descriptor is a resource, and OSes tend to have a limit on the number of file descriptors that can be opened, and maybe a separate limit per process too. Meaning that we should be cautious to not exceed these limits.
In the markdown scanner, we use
readFile
and seemingly do not enforce the computation (we apply parsing, but its result is not forced), meaning that the file descriptor won't be closed before the result of scanning being requested, and in case of large repositories this may be a severe issue.Another issue similar to that: our markdown parser works on a strict
Text
, meaning that during parsing the entire file's content will be kept in memory, and memory can be treated as a resource too.How much severe this issue is? I'm not sure, documentation usually does not take much space, probably even if it is auto-generated. Documentation size is fundamentally limited by how much a human being can read, so we do not expect dozen of gigabytes here, let's further throw this concern away.
Proposed solution
Conceptually, we want two things:
readFile
start and full parsing, as this is the time when we hold the resource (first the opened descriptor, then file content in memory).So from a
ScanAction
object we probably expect that it reads the file as quickly as possible, maybe putting its full content in the memory, but is free to return a not fully evaluated result ((FileInfo, [ScanError])
pair). Forcing it is the caller's responsibility.What should be done in code
First, let's document the expectations from
ScanAction
, add them to haddock of this type.Next, we should make sure that after
readFile
call finishes, the file is fully read.Since we do not parallelize file reads (in case of I/O against the filesystem this AFAIU does not make much sense), this will automatically resolve our issue with too many opened file descriptors.
Acceptance criteria
ScanAction
is updated to include the mentioned comments about laziness of each inner action (file read and parsing).ScanAction
is updated respectively.The text was updated successfully, but these errors were encountered: