Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing results and memory errors #304

Open
quaquel opened this issue Oct 31, 2023 · 2 comments
Open

Storing results and memory errors #304

quaquel opened this issue Oct 31, 2023 · 2 comments

Comments

@quaquel
Copy link
Owner

quaquel commented Oct 31, 2023

With #299 we get even better support for running on HPC. However, the existing way in which results are stored does not scale well once going to a very large number of experiments or when creating high dimensional data. Presently, the results are stored as a collection of CSVs wrapped in a tarball. The main advantage of this is that the results are easy to unzip and open with any text editor or even Excel. It is also a very convenient way of storing results in a cross-platform, cross-language way. However, it breaks with large outputs because you will run into memory errors.

A short-term solution is to change save_results. It currently builds up the entire tarball in memory before flushing it to disk. A slightly more memory-efficient solution is to create a directory on disk, write each CSV file to it, and then turn the entire directory into a tarball. Some memory profiling is likely needed as to how much of a difference this will make.

A longer-term solution is to add other storage solutions where results are flushed to disk while they are coming in. This avoids having to build up in memory the very large results dataset. The basic machinery for this is in place because of the callback keyword argument that is passed to perform_experiments. It requires, probably, however, a minor rethink of how to capture the serialization of all classes of outcomes (i.e., to_disk and from_disk ). Depending on the chosen storage solution, a slightly different serialization will be required.

@steipatr
Copy link
Contributor

steipatr commented Nov 6, 2023

A benefit of the longer-term solution would also be that an error in the experiments (due to an edge case, divide by zero, etc) doesn't mean you have to re-do all experiments.

@quaquel
Copy link
Owner Author

quaquel commented Nov 10, 2023

I ran a quick test using memray. In my test case, I went from 2.6 GB to 1.9 GB. This is close to a 30% reduction in memory usage. It thus seems that creating a directory, writing all results to this directory, and turning this directory into a tarball is an easy way of getting quite some performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants