Improve concatenation of netCDF files with simulation pickup #3818

josuemtzmo · 2024-10-03T11:51:59Z

Hello,

I've worked implementing this automatic concatenation for the output, when a simulation is picked up from a checkpoint, with the flag splitting_files on. This feature addresses the fact that otherwise the simulation output filename needs to be changed manually each time the simulation is restarted.

In other words, if a simulation is run to output in a file.nc with the flag splitting_files, different files will be created such as file_part1.nc, file_part2.nc. If the overwrite_output is true, these files will be rewritten and the data will be deleted. If the overwrite_output is false, the simulation will crash since it will not find the original file.nc. The new code ensures that if overwrite_output is false, then the model will append the output to the last file. i.e. file_part2.nc.

Before working more on this (i.e. including joining output in jld2), I'm wondering if this will be useful to implement in the other schedulers and merge to master. What do you think @glwagner @tomchor @navidcy ?

glwagner · 2024-10-03T12:39:49Z

Let me try to reframe the problem and tell me if you agree.

I think its a real issue that the output files are created when the output writers are instantiated. Because of this, we find ourselves having to write overwrite_existing=false when we pickup from a checkpoint, but overwrite_existing=true otherwise.

This is a hack for sure. It's a failure of the checkpointer design --- the whole point of the design is to make checkpointing easy. We should be able to change just one line. I even think we should be able to pick up from a checkpoint with an environment variable so we don't have to change the run script at all. That would be more robust.

What you're describing sounds like an additional problem of this deficiency.

The fix seems straightforward. We just need to introduce the concept of output "initialization". Then we can delay creating the output file to run!. At that point we know if we are picking up from a checkpoint or not, obviously if we are picking from a checkpoint we don't want to delete any existing files.

It seems like if we introduce initialization we can also handle file splitting. What do you think? If you want to help we can get started on it. I was putting it off myself because I don't have immediate needs for checkpointing but that will probably change pretty soon.

Separately I don't like I also don't like overwrite_existing more broadly because I feel its boilerplate, which we are discussing a bit over on #3793 .

josuemtzmo · 2024-10-03T15:19:13Z

I see the interest of creating a handler of all the outputs of the model. Is the idea to make a wrapper depending on parsing a preexisting writers i.e. NetCDFOutputWriter, JLD2OutputWriter, and Checkpointer. Or do you plan to make a new structure that handles all of them?

I agree that keeping a simple way to create Checkpointers and OutputWriters should be a priority, particularly in the case when the user needs to restart the simulation multiple times in an HPC, keeping in mind manageable file sizes, wall times, and chunks. Currently, this can be handle by the user creating unique output folders or changing the name of the output file for each resubmission of the simulation. Unfortunately, this makes more complex the processing of the output. I think the best solution with outputs and checkpoints will be to automatically append to the last output file from the previous simulation resubmission (currently not supported).

I agree that the initialisation of the writer should not create a file, I find that quite confusing. I believe it will be simple to change by only executting the initialize_nc_file! and initialize_jld2_file! once the model is running, instead of two times, once the writers are initialised and in run time (write_output!).

Despite the implementation of this, I still see value in having a flag overwrite_existing, likely set up to false as default (I will comment in the discussion about this), and ensure that we do not change this flag on run time, since it may result in issues and data loss.

Regarding my changes in this PR, the function is_output_splitted! may still be useful, since it returns the last file to then append the output, necessary to handle file splitting in a general output function.

glwagner · 2024-10-03T15:36:21Z

I see the interest of creating a handler of all the outputs of the model. Is the idea to make a wrapper depending on parsing a preexisting writers i.e. NetCDFOutputWriter, JLD2OutputWriter

If you are referring to #3793 my intent is just to introduce an additional wrapper on top of the existing writers. It's merely an alternative to adding output writers manually to simulation.output_writers, which I find inelegant.

I believe it will be simple to change by only executting the initialize_nc_file! and initialize_jld2_file! once the model is running, instead of two times, once the writers are initialised and in run time (write_output!).

I agree that it's simple to design.

Despite the implementation of this, I still see value in having a flag overwrite_existing, likely set up to false as default (I will comment in the discussion about this), and ensure that we do not change this flag on run time, since it may result in issues and data loss.

Why would it result in (unintended) data loss? Because its common to mistakenly re-run a simulation?

Regarding my changes in this PR, the function is_output_splitted! may still be useful, since it returns the last file to then append the output, necessary to handle file splitting in a general output function.

Okay, I'll take a look.

glwagner · 2024-10-03T15:37:23Z

Currently, this can be handle by the user creating unique output folders or changing the name of the output file for each resubmission of the simulation.

Trying to understand this workflow. You're saying that for a single run, you would create multiple directories, with a new directory for each time a simulation is restarted?

josuemtzmo · 2024-10-03T17:26:21Z

Currently in my workflow, I define an environment variable that contains an string like $init_$end.nc, where init is the initial time and end is the stop_time of the simulation. In the first instantiation of the model for 60 days, this string will be appended to my output filename e.g.vel_field_, resulting on something like vel_field_0_60.nc. For the next pickup the filename will be changed to vel_field_60_120.nc. This works well when the files are not split, i.e. file_splitting = NoFileSplitting(). However when splitting the file, e.g. file_splitting = TimeInterval(30days), the first instantiation will result in the following files:

vel_fields_0_60_part1.nc  vel_fields_0_60_part2.nc  vel_fields_0_60_part3.nc

then the next pickup will result in files:

vel_fields_60_120_part1.nc  vel_fields_60_120_part3.nc  vel_fields_60_120_part5.nc
vel_fields_60_120_part2.nc  vel_fields_60_120_part4.nc

of this files, the first 2 parts (vel_fields_60_120_part1.nc and vel_fields_60_120_part2.nc) are empty, and the vel_fields_60_120_part3.nc contains data from the day 61 until 91. This makes the loading of the data for post-processing a bit tricky, since some of they are empty and the disk storage will make more difficult chunking data.

If I don't change the name each time I pickup, with the overwrite_existing=false it crashes, and with overwrite_existing=true it rewrites all the files.

Another workflow will be to create a new folder for each pickup keeping the same filename, but splitting the files will result in the same issue of file duplication.

I hope this description of the workflow is clearer.

josuemtzmo · 2024-10-03T17:33:40Z

Perhaps, there is no need for an environment variable that modify the filename, instead this is something that could be done if the user is using a checkpoint. But rather than using this convoluted workflow that will require renaming and deleting files, I thought the optimal implementation is to append to the previous existing file when using checkpoints. If the filename is vel_fields.nc, the expected files after the second pickup should be:

vel_fields_part1.nc  vel_fields_part3.nc  vel_fields_part5.nc 
vel_fields_part2.nc  vel_fields_part4.nc

josuemtzmo · 2024-10-03T17:42:14Z

A possible issue with this implementation may occur if the simulation crashes after already writing some output to the last file. However, we could test in the first call of the writer if the time to store is smaller than the current time to be append.

glwagner · 2024-10-03T19:23:08Z

I hope this description of the workflow is clearer.

It's clear!

I thought the optimal implementation is to append to the previous existing file when using checkpoints.

Of course I agree with this. I think our basic principle should be that there is no difference in workflow, either for running or analysis, between scripts for checkpointing or not. Also, I think that overwrite_existing should not delete your files if you are picking up from a checkpoint (regardless of what it is set to). Do you agree?

If I don't change the name each time I pickup, with the overwrite_existing=false it crashes, and with overwrite_existing=true it rewrites all the files.

If we have a system that does not overwrite files when picking up from a checkpoint, do you think that it will be necessary to have overwrite_existing=false by default? In this case, the only time we risk overwriting files is when we re-run a simulation from the beginning.

Assume for the following discussion that the "overwriting issue" is solved for checkpointing:

In my experience, I typically set up a script with overwrite_existing=true from the very beginning. This is because when we first write a script, we are prototyping. I have never then gone back and changed overwrite_existing=false because of some concern about overwriting data. It's inconvenient, in fact, to set overwrite_existing=false.

My thought is that it makes more sense to ask people to intentionally request overwrite_existing=false in the rare case that this is desired. The key insight is that we spend the vast majority of time prototyping. I would even argue from an economic standpoint that the productivity saved from this setting outweights any rare instances of lost data, if they actually would ever occur due to changing the default.

Also, even for this purpose I don't really want a "non-overwriting guarantee" as a feature. I would prefer something like "unique_filenames = true" or something, which I could then "turn on" if I got into a situation where I was re-running expensive simulations.

glwagner · 2024-10-03T19:24:12Z

Also curious --- do you split files to make them easier to download? I'm wondering if there is a better way. For example it would not be hard to design a utility that would automatically separate a file along some axis (such as the time axis) to produce smaller files for the purpose of downloading.

For example something like

download_chunk(download_spec, time_window=(0, 60days))

where download_spec has the filename, remote directory, remote address, etc.

josuemtzmo · 2024-10-04T08:09:29Z

Of course I agree with this. I think our basic principle should be that there is no difference in workflow, either for running or analysis, between scripts for checkpointing or not. Also, I think that overwrite_existing should not delete your files if you are picking up from a checkpoint (regardless of what it is set to). Do you agree?

Yes, I completely agree, if the simulation picks up from a checkpoint it should not delete data.

Assume for the following discussion that the "overwriting issue" is solved for checkpointing:

In my experience, I typically set up a script with overwrite_existing=true from the very beginning. This is because when we first write a script, we are prototyping. I have never then gone back and changed overwrite_existing=false because of some concern about overwriting data. It's inconvenient, in fact, to set overwrite_existing=false.

My thought is that it makes more sense to ask people to intentionally request overwrite_existing=false in the rare case that this is desired. The key insight is that we spend the vast majority of time prototyping. I would even argue from an economic standpoint that the productivity saved from this setting outweights any rare instances of lost data, if they actually would ever occur due to changing the default.

Now it is clear. If we handle properly picking up simulations, I completely agree with you that best option is for the user to manually change overwrite_existing=false to ensure their own data. I also follow the same workflow when creating a simulation in which I rely on the overwrite_existing=true until I reach the "production" phase of the simulation.

Also curious --- do you split files to make them easier to download?

I mostly do it to keep a consistent size and chunks across files for post-processing, but supporting both things will make the user workflow more flexible, particularly when running long simulations with multiple pickups. It's likely that the post-processing you suggest will be useful, but in that case it may be more useful to include support to zarr files within the possible simulation outputs (https://github.com/JuliaIO/Zarr.jl).

josuemtzmo · 2024-10-04T08:13:28Z

Ok, now that this is clear, I clone and modify the PR #3793.

glwagner · 2024-10-04T12:23:08Z

but in that case it may be more useful to include support to zarr files within the possible simulation outputs (https://github.com/JuliaIO/Zarr.jl).

A ZarrOutputWriter would be pretty cool!

glwagner · 2024-10-04T12:25:02Z

I mostly do it to keep a consistent size and chunks across files for post-processing

Ok! Curious how this will evolve, we eventually hope to support FieldTimeSeries for NetCDF which will give us some options to design various productivity features...

josuemtzmo · 2024-10-04T13:02:10Z

PR #3820 builds upon PR #3793. I will close this PR in favour of #3793 and #3820. I will create an issue for the ZarrOutputWriter.

glwagner · 2024-10-04T14:31:49Z

Sorry, I don't understand. Why are we closing this? I prefer small PRs to large ones and this one seems good.

src/OutputWriters/netcdf_output_writer.jl

src/OutputWriters/output_writer_utils.jl

glwagner · 2024-10-04T15:00:25Z

@ali-ramadhan might be good for you to take a look

glwagner

Do we test this?

ali-ramadhan · 2024-10-07T18:15:12Z

src/OutputWriters/output_writer_utils.jl

+# Update schedule based on user input
+is_output_splitted!(schedule, filepath) = nothing
+
+function is_output_splitted!(schedule::TimeInterval, filepath, overwrite_existing) 


I think my only comment is that I find this function name confusing. is_output_splitted! suggests it will return a boolean true or false, but it actually modifies a schedule and returns a part number and a filepath?

ali-ramadhan · 2024-10-07T18:21:03Z

Thanks for working on this @josuemtzmo! I do tend to avoid file splitting since one file, even if it is huge, simplifies data analysis. And most of the time, the data analysis can be done on the fly alleviating the need for huge outputs.

Do we test this?

I don't think so. Definitely a good idea to do so since a wrong implementation can result in data loss.

Usually Julia would stop then re-run the entire script so would a test look something like: set up a simulation with a checkpointer, run it for a some iterations with some file splitting output, then set up the exact same simulation (copy paste) and run it for some more iterations and more file splitting, then check that the output is all correct?

There may be some edge cases too, e.g. zero or only one output actuation after picking up, or before the initial simulation ends.

josuemtzmo · 2024-10-15T13:52:18Z

@ali-ramadhan Thanks for your suggestion, I had finally some time to work again on this. I agree that the name is not ideal, would it be better to call it find_existing_splitted_output?

glwagner · 2024-10-15T15:10:24Z

@ali-ramadhan Thanks for your suggestion, I had finally some time to work again on this. I agree that the name is not ideal, would it be better to call it find_existing_splitted_output?

What does the function do, and what does it return?

josuemtzmo · 2024-10-15T16:51:20Z

It returns a int with the value of the number of files or file divisions already performed and the file name. if files are file_part_1.nc, file_part_2.nc, file_part_3.nc, the output will be: 3 file_part_3.nc

glwagner · 2024-10-15T18:45:47Z

It returns a int with the value of the number of files or file divisions already performed and the file name. if files are file_part_1.nc, file_part_2.nc, file_part_3.nc, the output will be: 3 file_part_3.nc

How about calling it split_files or number_of_split_files?

Could also improve clarity to split it into two functions. One to return the number of split files, and the other to return current_split_filename?

josuemtzmo · 2024-10-16T08:39:18Z

I've been thinking on how to divide the function, but it will require the duplication of the lines:

    filename = first(split(basename(filepath),".")) * "_part"
    filelist = filter(startswith(filename), readdir(folder))
    existing_files = length(filelist) > 0

within the code, how important is it to avoid duplication vs clarity?

glwagner · 2024-10-17T04:07:26Z

I've been thinking on how to divide the function, but it will require the duplication of the lines:
    filename = first(split(basename(filepath),".")) * "_part"
    filelist = filter(startswith(filename), readdir(folder))
    existing_files = length(filelist) > 0
within the code, how important is it to avoid duplication vs clarity?

Clarity matters most. But usually there are not strong trade-offs: duplication improves clarify only if the duplicated code is short and easy to understand.

How is this:

filename = first(split(basename(filepath),".")) * "_part"

different from current_filename.

For clarity I think it's it's best to use more lines for complicated operations rather than concatenating many operations into a single line. That line does a lot of things at once: basename, split with ".", first, and then finally concatenation with _part. That's quite a few things for one line.

josuemtzmo added 2 commits October 2, 2024 18:53

improve concatenation of netcdf files with simulation pickup

ce88cce

revert to previous version, since no changes were needed

056da91

josuemtzmo added the output 💾 label Oct 3, 2024

josuemtzmo closed this Oct 4, 2024

josuemtzmo mentioned this pull request Oct 4, 2024

Implement a ZarrOutputWriter #3821

Open

josuemtzmo reopened this Oct 4, 2024