Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid resource leaking in scanners #220

Open
Martoon-00 opened this issue Nov 20, 2022 · 7 comments · May be fixed by #236
Open

Avoid resource leaking in scanners #220

Martoon-00 opened this issue Nov 20, 2022 · 7 comments · May be fixed by #236
Assignees
Labels
good first issue Good for newcomers

Comments

@Martoon-00
Copy link
Member

Martoon-00 commented Nov 20, 2022

Clarification and motivation

Problems

This is an issue that I kept in mind for a long time, and since xrefcheck is becoming a serious tool, I think we should have this issue handled.

It is a known minor issue of readFile that until the full produced lazy bytestring/text is processed, the file descriptor remains open.

An open file descriptor is a resource, and OSes tend to have a limit on the number of file descriptors that can be opened, and maybe a separate limit per process too. Meaning that we should be cautious to not exceed these limits.

In the markdown scanner, we use readFile and seemingly do not enforce the computation (we apply parsing, but its result is not forced), meaning that the file descriptor won't be closed before the result of scanning being requested, and in case of large repositories this may be a severe issue.


Another issue similar to that: our markdown parser works on a strict Text, meaning that during parsing the entire file's content will be kept in memory, and memory can be treated as a resource too.

How much severe this issue is? I'm not sure, documentation usually does not take much space, probably even if it is auto-generated. Documentation size is fundamentally limited by how much a human being can read, so we do not expect dozen of gigabytes here, let's further throw this concern away.

Proposed solution

Conceptually, we want two things:

  1. Minimize the time between readFile start and full parsing, as this is the time when we hold the resource (first the opened descriptor, then file content in memory).
  2. Make parsing of various files as parallel as possible, in theory.

So from a ScanAction object we probably expect that it reads the file as quickly as possible, maybe putting its full content in the memory, but is free to return a not fully evaluated result ((FileInfo, [ScanError]) pair). Forcing it is the caller's responsibility.

What should be done in code

First, let's document the expectations from ScanAction, add them to haddock of this type.

Next, we should make sure that after readFile call finishes, the file is fully read.

Since we do not parallelize file reads (in case of I/O against the filesystem this AFAIU does not make much sense), this will automatically resolve our issue with too many opened file descriptors.

Acceptance criteria

  • Documentation of ScanAction is updated to include the mentioned comments about laziness of each inner action (file read and parsing).
  • The current implementation of ScanAction is updated respectively.
@Martoon-00 Martoon-00 added the good first issue Good for newcomers label Nov 20, 2022
@Martoon-00
Copy link
Member Author

Martoon-00 commented Nov 20, 2022

If any of the above does not make sense, please drop a comment explaining better options.

@YuriRomanowski
Copy link
Contributor

YuriRomanowski commented Dec 9, 2022

It seems that we can fix most of the mentioned issues (except parallel files parsing) just by replacing the lazy readFile with the strict version. I tried to test it on a repo with 100 mds, old version just opens all 100 file descriptors immediately, and the new one (with strict readFile) keeps opened only 4 extra fds. Also run time is improved slightly and so is maximum allocated memory at a moment (more significantly, about 1.5 times).

@YuriRomanowski
Copy link
Contributor

By the way, do we need readFile inside ScanAction at all? Even in the type description we can read Way to parse a file, and from the type signature FilePath -> IO (FileInfo, [ScanError]) we can see that there always will be a readFile. The only thing that we don't know about it is what exactly datatype will be used for text (ByteString, Text, strict/lazy etc.), but it seems being not so important. How about making ScanAction totally pure?

@Martoon-00
Copy link
Member Author

Martoon-00 commented Dec 10, 2022

Comments for Monday:

It seems that we can fix most of the mentioned issues (except parallel files parsing) just by replacing the lazy readFile with the strict version.

Yep, I agree. I wrote a lot of text in the issue description, but likely the best fix is that simple.

Cool that you tried it (hope it was relatively quick to do) and it's interesting to see that we won in memory so much.

Btw, how did you see the number of open file descriptors?

By the way, do we need readFile inside ScanAction at all
Even in the type description we can read Way to parse a file, and from the type signature

Yeah, we ended up with a discrepancy between the documentation and the code, and it is confusing, would be nice to address this.

But I suppose, if we switch to strict readFile, that already neglects the possibility of separating reading and parsing. In the case of Markdown, we can assume that the file content will fit into memory without concerns, but in the general case, this assumption is not true and one may want to process the input on the fly which seems to be impossible with separated Data.Text.IO.readFile and parsing.

What do you think?

@YuriRomanowski
Copy link
Contributor

YuriRomanowski commented Dec 11, 2022

Btw, how did you see the number of open file descriptors?

I used ls -l /proc/$PID/fd, counting the output's lines, and adapted it a bit to get continuous time series.

What do you think?

I tried to parallelize files parsing in my 100-mds repo, and found that we can get a certain speedup from it; in my case it's about one second (but I could easily miss something).

So I think, for small files, we can use one thread for IO, read those files strictly and create a spark for parsing, then get to next file; this way one of our threads will always only read files, and others will parse them, freeing memory of processed ByteStrings with files contents. For big files, we can use lazy reading/parsing pipeline, in fact it's what we are using currently (but here we're not able to get any benefit from the parallelism).

We can distinguish small files from big by finding out file's size from the filesystem.

So, about ScanAction, it will be pure in this scheme, and we'll decide how to read file from its size.

By the way, in our current program almost all the repo scanning in fact occurs in this line

whenJust (nonEmpty $ sortBy (compare `on` seFile) scanErrs) $ reportScanErrs
Haskell! ¯\_(ツ)_/¯

@Martoon-00
Copy link
Member Author

Feel free to read on Monday

I used ls -l /proc/$PID/fd, counting the output's lines, and adapted it a bit to get continuous time series.

I see, this concrete approach is worth remembering for me 🤔


You well spotted that we have two desired ways here, I agree. And nice that you tried parsing parallelization.

and create a spark for parsing

This is what I wanted to hear when created this ticket 👍

So, about ScanAction, it will be pure in this scheme, and we'll decide how to read file from it's size.

After working on some projects I became a fan of the idea, that instead of providing a concrete narrow interface for the extension's logic (what those projects did), it is better to provide a generic interface (to a reasonable extent) + bridges between that generic interface and particular use cases.

In our case, this would mean: leave ScanAction a generic IO thing, and add methods for lifting the interface you described to this ScanAction, letting these methods be sort of smart constructors of this type.

This way:

  • The developer of a new scanner can use unusual solutions if they suddenly need to, like referring to command line utilities or skipping part of the file (this is a weak argument for now though, mostly everything is doable without IO monad).

  • We can provide different helpers like readStrictlyAndParse, readAndParseOnFly, selectScanActionBasedOnFileLength, and the implementor of the scanner will be able to compose those to get what they find most appropriate for their language. We can also provide a recommended smart constructor with the behaviour you described.

    So I think it should be up to the developer to decide on a particular scheme (but we can tell what we expect and give our recommendations). E.g. in Markdown case, we currently cannot benefit from the lazy solution because our parser accepts strict text. Potential problems with memory consumption we will have to mitigate in some other way.

  • Whether to create a spark in the separated approach - I believe this should also depend on a particular scanner. Sometimes creating a spark is justified, sometimes it may be just an overhead. Sometimes even reading the file is unnecessary, given the file extension we know that the file provides no links or anchors.

By the way, in our current program almost all the repo scanning in fact occurs in this line

Yeah :trollface:

YuriRomanowski added a commit that referenced this issue Dec 11, 2022
Problem: Currently we use lazy `readFile`, and xrefcheck opens all
the files immediately for reading, while file descriptors are
resource that can be finite.

Solution: Replaced lazy `readFile` with strict version.
@YuriRomanowski YuriRomanowski linked a pull request Dec 11, 2022 that will close this issue
13 tasks
@Martoon-00
Copy link
Member Author

Btw, this issue is real, see e.g. LIGO repository (failing job)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants