Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concatenation of netCDF files with simulation pickup #3818

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

josuemtzmo
Copy link
Collaborator

@josuemtzmo josuemtzmo commented Oct 3, 2024

Hello,

I've worked implementing this automatic concatenation for the output, when a simulation is picked up from a checkpoint, with the flag splitting_files on. This feature addresses the fact that otherwise the simulation output filename needs to be changed manually each time the simulation is restarted.

In other words, if a simulation is run to output in a file.nc with the flag splitting_files, different files will be created such as file_part1.nc, file_part2.nc. If the overwrite_output is true, these files will be rewritten and the data will be deleted. If the overwrite_output is false, the simulation will crash since it will not find the original file.nc. The new code ensures that if overwrite_output is false, then the model will append the output to the last file. i.e. file_part2.nc.

Before working more on this (i.e. including joining output in jld2), I'm wondering if this will be useful to implement in the other schedulers and merge to master. What do you think @glwagner @tomchor @navidcy ?

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

Let me try to reframe the problem and tell me if you agree.

I think its a real issue that the output files are created when the output writers are instantiated. Because of this, we find ourselves having to write overwrite_existing=false when we pickup from a checkpoint, but overwrite_existing=true otherwise.

This is a hack for sure. It's a failure of the checkpointer design --- the whole point of the design is to make checkpointing easy. We should be able to change just one line. I even think we should be able to pick up from a checkpoint with an environment variable so we don't have to change the run script at all. That would be more robust.

What you're describing sounds like an additional problem of this deficiency.

The fix seems straightforward. We just need to introduce the concept of output "initialization". Then we can delay creating the output file to run!. At that point we know if we are picking up from a checkpoint or not, obviously if we are picking from a checkpoint we don't want to delete any existing files.

It seems like if we introduce initialization we can also handle file splitting. What do you think? If you want to help we can get started on it. I was putting it off myself because I don't have immediate needs for checkpointing but that will probably change pretty soon.

Separately I don't like I also don't like overwrite_existing more broadly because I feel its boilerplate, which we are discussing a bit over on #3793 .

@josuemtzmo
Copy link
Collaborator Author

I see the interest of creating a handler of all the outputs of the model. Is the idea to make a wrapper depending on parsing a preexisting writers i.e. NetCDFOutputWriter, JLD2OutputWriter, and Checkpointer. Or do you plan to make a new structure that handles all of them?

I agree that keeping a simple way to create Checkpointers and OutputWriters should be a priority, particularly in the case when the user needs to restart the simulation multiple times in an HPC, keeping in mind manageable file sizes, wall times, and chunks. Currently, this can be handle by the user creating unique output folders or changing the name of the output file for each resubmission of the simulation. Unfortunately, this makes more complex the processing of the output. I think the best solution with outputs and checkpoints will be to automatically append to the last output file from the previous simulation resubmission (currently not supported).

I agree that the initialisation of the writer should not create a file, I find that quite confusing. I believe it will be simple to change by only executting the initialize_nc_file! and initialize_jld2_file! once the model is running, instead of two times, once the writers are initialised and in run time (write_output!).

Despite the implementation of this, I still see value in having a flag overwrite_existing, likely set up to false as default (I will comment in the discussion about this), and ensure that we do not change this flag on run time, since it may result in issues and data loss.

Regarding my changes in this PR, the function is_output_splitted! may still be useful, since it returns the last file to then append the output, necessary to handle file splitting in a general output function.

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

I see the interest of creating a handler of all the outputs of the model. Is the idea to make a wrapper depending on parsing a preexisting writers i.e. NetCDFOutputWriter, JLD2OutputWriter

If you are referring to #3793 my intent is just to introduce an additional wrapper on top of the existing writers. It's merely an alternative to adding output writers manually to simulation.output_writers, which I find inelegant.

I believe it will be simple to change by only executting the initialize_nc_file! and initialize_jld2_file! once the model is running, instead of two times, once the writers are initialised and in run time (write_output!).

I agree that it's simple to design.

Despite the implementation of this, I still see value in having a flag overwrite_existing, likely set up to false as default (I will comment in the discussion about this), and ensure that we do not change this flag on run time, since it may result in issues and data loss.

Why would it result in (unintended) data loss? Because its common to mistakenly re-run a simulation?

Regarding my changes in this PR, the function is_output_splitted! may still be useful, since it returns the last file to then append the output, necessary to handle file splitting in a general output function.

Okay, I'll take a look.

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

Currently, this can be handle by the user creating unique output folders or changing the name of the output file for each resubmission of the simulation.

Trying to understand this workflow. You're saying that for a single run, you would create multiple directories, with a new directory for each time a simulation is restarted?

@josuemtzmo
Copy link
Collaborator Author

josuemtzmo commented Oct 3, 2024

Currently in my workflow, I define an environment variable that contains an string like $init_$end.nc, where init is the initial time and end is the stop_time of the simulation. In the first instantiation of the model for 60 days, this string will be appended to my output filename e.g.vel_field_, resulting on something like vel_field_0_60.nc. For the next pickup the filename will be changed to vel_field_60_120.nc. This works well when the files are not split, i.e. file_splitting = NoFileSplitting(). However when splitting the file, e.g. file_splitting = TimeInterval(30days), the first instantiation will result in the following files:

vel_fields_0_60_part1.nc  vel_fields_0_60_part2.nc  vel_fields_0_60_part3.nc

then the next pickup will result in files:

vel_fields_60_120_part1.nc  vel_fields_60_120_part3.nc  vel_fields_60_120_part5.nc
vel_fields_60_120_part2.nc  vel_fields_60_120_part4.nc

of this files, the first 2 parts (vel_fields_60_120_part1.nc and vel_fields_60_120_part2.nc) are empty, and the vel_fields_60_120_part3.nc contains data from the day 61 until 91. This makes the loading of the data for post-processing a bit tricky, since some of they are empty and the disk storage will make more difficult chunking data.

If I don't change the name each time I pickup, with the overwrite_existing=false it crashes, and with overwrite_existing=true it rewrites all the files.

Another workflow will be to create a new folder for each pickup keeping the same filename, but splitting the files will result in the same issue of file duplication.

I hope this description of the workflow is clearer.

@josuemtzmo
Copy link
Collaborator Author

josuemtzmo commented Oct 3, 2024

Perhaps, there is no need for an environment variable that modify the filename, instead this is something that could be done if the user is using a checkpoint. But rather than using this convoluted workflow that will require renaming and deleting files, I thought the optimal implementation is to append to the previous existing file when using checkpoints. If the filename is vel_fields.nc, the expected files after the second pickup should be:

vel_fields_part1.nc  vel_fields_part3.nc  vel_fields_part5.nc 
vel_fields_part2.nc  vel_fields_part4.nc 

@josuemtzmo
Copy link
Collaborator Author

josuemtzmo commented Oct 3, 2024

A possible issue with this implementation may occur if the simulation crashes after already writing some output to the last file. However, we could test in the first call of the writer if the time to store is smaller than the current time to be append.

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

I hope this description of the workflow is clearer.

It's clear!

I thought the optimal implementation is to append to the previous existing file when using checkpoints.

Of course I agree with this. I think our basic principle should be that there is no difference in workflow, either for running or analysis, between scripts for checkpointing or not. Also, I think that overwrite_existing should not delete your files if you are picking up from a checkpoint (regardless of what it is set to). Do you agree?

If I don't change the name each time I pickup, with the overwrite_existing=false it crashes, and with overwrite_existing=true it rewrites all the files.

If we have a system that does not overwrite files when picking up from a checkpoint, do you think that it will be necessary to have overwrite_existing=false by default? In this case, the only time we risk overwriting files is when we re-run a simulation from the beginning.

Assume for the following discussion that the "overwriting issue" is solved for checkpointing:

In my experience, I typically set up a script with overwrite_existing=true from the very beginning. This is because when we first write a script, we are prototyping. I have never then gone back and changed overwrite_existing=false because of some concern about overwriting data. It's inconvenient, in fact, to set overwrite_existing=false.

My thought is that it makes more sense to ask people to intentionally request overwrite_existing=false in the rare case that this is desired. The key insight is that we spend the vast majority of time prototyping. I would even argue from an economic standpoint that the productivity saved from this setting outweights any rare instances of lost data, if they actually would ever occur due to changing the default.

Also, even for this purpose I don't really want a "non-overwriting guarantee" as a feature. I would prefer something like "unique_filenames = true" or something, which I could then "turn on" if I got into a situation where I was re-running expensive simulations.

@glwagner
Copy link
Member

glwagner commented Oct 3, 2024

Also curious --- do you split files to make them easier to download? I'm wondering if there is a better way. For example it would not be hard to design a utility that would automatically separate a file along some axis (such as the time axis) to produce smaller files for the purpose of downloading.

For example something like

download_chunk(download_spec, time_window=(0, 60days))

where download_spec has the filename, remote directory, remote address, etc.

@josuemtzmo
Copy link
Collaborator Author

Of course I agree with this. I think our basic principle should be that there is no difference in workflow, either for running or analysis, between scripts for checkpointing or not. Also, I think that overwrite_existing should not delete your files if you are picking up from a checkpoint (regardless of what it is set to). Do you agree?

Yes, I completely agree, if the simulation picks up from a checkpoint it should not delete data.

Assume for the following discussion that the "overwriting issue" is solved for checkpointing:

In my experience, I typically set up a script with overwrite_existing=true from the very beginning. This is because when we first write a script, we are prototyping. I have never then gone back and changed overwrite_existing=false because of some concern about overwriting data. It's inconvenient, in fact, to set overwrite_existing=false.

My thought is that it makes more sense to ask people to intentionally request overwrite_existing=false in the rare case that this is desired. The key insight is that we spend the vast majority of time prototyping. I would even argue from an economic standpoint that the productivity saved from this setting outweights any rare instances of lost data, if they actually would ever occur due to changing the default.

Now it is clear. If we handle properly picking up simulations, I completely agree with you that best option is for the user to manually change overwrite_existing=false to ensure their own data. I also follow the same workflow when creating a simulation in which I rely on the overwrite_existing=true until I reach the "production" phase of the simulation.

Also curious --- do you split files to make them easier to download?

I mostly do it to keep a consistent size and chunks across files for post-processing, but supporting both things will make the user workflow more flexible, particularly when running long simulations with multiple pickups. It's likely that the post-processing you suggest will be useful, but in that case it may be more useful to include support to zarr files within the possible simulation outputs (https://github.com/JuliaIO/Zarr.jl).

@josuemtzmo
Copy link
Collaborator Author

Ok, now that this is clear, I clone and modify the PR #3793.

@glwagner
Copy link
Member

glwagner commented Oct 4, 2024

but in that case it may be more useful to include support to zarr files within the possible simulation outputs (https://github.com/JuliaIO/Zarr.jl).

A ZarrOutputWriter would be pretty cool!

@glwagner
Copy link
Member

glwagner commented Oct 4, 2024

I mostly do it to keep a consistent size and chunks across files for post-processing

Ok! Curious how this will evolve, we eventually hope to support FieldTimeSeries for NetCDF which will give us some options to design various productivity features...

@josuemtzmo
Copy link
Collaborator Author

PR #3820 builds upon PR #3793. I will close this PR in favour of #3793 and #3820. I will create an issue for the ZarrOutputWriter.

@glwagner
Copy link
Member

glwagner commented Oct 4, 2024

Sorry, I don't understand. Why are we closing this? I prefer small PRs to large ones and this one seems good.

@josuemtzmo josuemtzmo reopened this Oct 4, 2024
@glwagner
Copy link
Member

glwagner commented Oct 4, 2024

@ali-ramadhan might be good for you to take a look

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test this?

@navidcy navidcy changed the title Improve concatenation of netcdf files with simulation pickup Improve concatenation of netCDF files with simulation pickup Oct 7, 2024
# Update schedule based on user input
is_output_splitted!(schedule, filepath) = nothing

function is_output_splitted!(schedule::TimeInterval, filepath, overwrite_existing)
Copy link
Member

@ali-ramadhan ali-ramadhan Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my only comment is that I find this function name confusing. is_output_splitted! suggests it will return a boolean true or false, but it actually modifies a schedule and returns a part number and a filepath?

@ali-ramadhan
Copy link
Member

Thanks for working on this @josuemtzmo! I do tend to avoid file splitting since one file, even if it is huge, simplifies data analysis. And most of the time, the data analysis can be done on the fly alleviating the need for huge outputs.

Do we test this?

I don't think so. Definitely a good idea to do so since a wrong implementation can result in data loss.

Usually Julia would stop then re-run the entire script so would a test look something like: set up a simulation with a checkpointer, run it for a some iterations with some file splitting output, then set up the exact same simulation (copy paste) and run it for some more iterations and more file splitting, then check that the output is all correct?

There may be some edge cases too, e.g. zero or only one output actuation after picking up, or before the initial simulation ends.

@josuemtzmo
Copy link
Collaborator Author

@ali-ramadhan Thanks for your suggestion, I had finally some time to work again on this. I agree that the name is not ideal, would it be better to call it find_existing_splitted_output?

@glwagner
Copy link
Member

@ali-ramadhan Thanks for your suggestion, I had finally some time to work again on this. I agree that the name is not ideal, would it be better to call it find_existing_splitted_output?

What does the function do, and what does it return?

@josuemtzmo
Copy link
Collaborator Author

josuemtzmo commented Oct 15, 2024

It returns a int with the value of the number of files or file divisions already performed and the file name. if files are file_part_1.nc, file_part_2.nc, file_part_3.nc, the output will be: 3 file_part_3.nc

@glwagner
Copy link
Member

It returns a int with the value of the number of files or file divisions already performed and the file name. if files are file_part_1.nc, file_part_2.nc, file_part_3.nc, the output will be: 3 file_part_3.nc

How about calling it split_files or number_of_split_files?

Could also improve clarity to split it into two functions. One to return the number of split files, and the other to return current_split_filename?

@josuemtzmo
Copy link
Collaborator Author

I've been thinking on how to divide the function, but it will require the duplication of the lines:

    filename = first(split(basename(filepath),".")) * "_part"
    filelist = filter(startswith(filename), readdir(folder))
    existing_files = length(filelist) > 0

within the code, how important is it to avoid duplication vs clarity?

@glwagner
Copy link
Member

I've been thinking on how to divide the function, but it will require the duplication of the lines:

    filename = first(split(basename(filepath),".")) * "_part"
    filelist = filter(startswith(filename), readdir(folder))
    existing_files = length(filelist) > 0

within the code, how important is it to avoid duplication vs clarity?

Clarity matters most. But usually there are not strong trade-offs: duplication improves clarify only if the duplicated code is short and easy to understand.

How is this:

filename = first(split(basename(filepath),".")) * "_part"

different from current_filename.

For clarity I think it's it's best to use more lines for complicated operations rather than concatenating many operations into a single line. That line does a lot of things at once: basename, split with ".", first, and then finally concatenation with _part. That's quite a few things for one line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants