-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush and close output files that are full #3032
Conversation
This commit addresses an issue with empty output files that should have been flushed and closed. If the simulation exits abnormally before a file has been flushed then the file will be empty. Before, we left it up to SCORPIO to decide the optimal time to flush output. In this commit we force a file that is full, i.e. max_snapshots has been reached, to be flushed and closed before moving on. This should ensure that all full files are written before any chance of an abnormal exit.
@bartgol I noticed you already had a Since this PR is dependent on #3031 I am labeling it as WIP until that is merged. But in the meantime I wanted to solicit any comments on this approach. |
|
I'm guessing you wanted to link a different PR? 3032 is this PR... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use snapshot_fits
to accommodate also a different storage type.
@@ -86,7 +86,10 @@ struct IOFileSpecs { | |||
// If positive, flush the output file every these many snapshots | |||
int flush_frequency = std::numeric_limits<int>::max(); | |||
|
|||
// bool file_is_full () const { return num_snapshots_in_file>=max_snapshots_in_file; } | |||
bool file_is_full () const { | |||
return storage.num_snapshots_in_file>=storage.max_snapshots_in_file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if the storage type is based on max number of snapshots. It will not work in case one chooses "one_month" or "one_year" (the former being quite appealing for single month data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this only works for max-snaps based storage may very well be the case why it was commented out. I believe I switched to using snapshot_fits
precisely for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent point. Thank you for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the simulation writes a restart file and the rpointers are all consistent, and then the simulation fails, will a file that is still open (for a year, for example) still have the risk of containing 0s instead of flushed data in a time period that is prior to the valid restart? I.e., could we still end up with 0 data?
Perhaps this is already done, but it seems to me the only way to guarantee things is by flushing all files prior to writing the rpointer file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ambrad we also have a separate todo item: every time we write a rhist file, also flush the corresponding output file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And the other todo item is significantly more important than this fwiw)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ambrad was this just a question to ensure we're not forgetting anything, or do you have a scenario where you think that would/should be the case (so flushing at rhist write would not be enough)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a flush_all
type call that is initiated whenever a restart is written? Wouldn't that remedy the concern of having all 0's if a fail occurs after a restart is written?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol, right, to ensure we're not forgetting anything. But I agree with Aaron that if the AD has a list of open write-mode files, why not just iterate through the list and flush them all before writing the rpointer file? You'd then skip more file-specific flushing. I don't think there can be an open write-mode file that shouldn't be flushed at a restart write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the AD has a handle to all files directly. But each output manger has a handle to its output and rhist files, so when the rhist file is written, it can flush the output one too.
And as Naser said, in EAMxx (for some technical reasons) we always write a rhist file, even if there is no "restart data" (e.g., for INSTANT output) and even if we just wrote in the output file. So flushing the .h file when the corresponding .rhist is flushed, should cover every output file.
@@ -550,6 +550,12 @@ void OutputManager::run(const util::TimeStamp& timestamp) | |||
if (filespecs.file_needs_flush()) { | |||
flush_file (filespecs.filename); | |||
} | |||
|
|||
// Check if we have hit the max number of snapshots and need to close the file | |||
if (filespecs.file_is_full()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using this method, we should use
if (filespecs.storage.snapshot_fits (m_output_control.next_write_ts))
Caveat: you need to ensure that m_output_control.compute_next_write_ts()
has been called before you attempt to use next_write_ts
. I believe at the point where you added these mods, this will be the case, but you may want to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, just confirmed that compute_next_write_ts()
is called before this chunk of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol do you mean the negation of snapshot_fits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, good call. @AaronDonahue see andrew's comment, so don't just copy paste the line I wrote.
One more thing: if you switch to closing the file as soon as it's full, then you can also get rid of these lines if (filespecs.is_open and not filespecs.storage.snapshot_fits(snapshot_start)) {
release_file(filespecs.filename);
filespecs.close();
} since we should never hit this scenario anymore. |
doh! Too many open windows next to each other. Thanks for the correction, I meant 3031 |
@bartgol do I need the |
Aren't you closing the file when you first find out it was full? |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6128 FAILED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5890 FAILED (click to see last 100 lines of console output)
|
Oh duh. Sorry, when I read this line just to delete it I thought it was "closing the FileSpecs" object which I didn't remember doing. But yes, I already have this line... |
Looks like we have a fail. I'll investigate |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6151 PASSED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5908 FAILED (click to see last 100 lines of console output)
|
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, mainly to push you to clarify changes/comments.
Also, would you please rebase this PR if more commits are needed? That way we could get rid of that already merged changes from showing up here...
|
||
void set_dt (const double dt_in) { | ||
EKAT_REQUIRE_MSG (dt==0 or dt==dt_in, | ||
"[IOControl::set_dt] Error! Cannot reset dt once it is set.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above.
// computed next_write_ts=last_write_ts (in terms of date:time, the num_steps is correct). | ||
// This means that at that time we deemed that the next_write_ts definitely fit in the same | ||
// file as last_write_ts (date/time are the same!), which may or may not be true for non NumSnaps | ||
// storage. To fix this, we recompute next_write_ts here, and close the file if it doesn't. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: doesn't ... what? fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this seems a bit too complex? Some questions:
- when did we the "currently open file"?
- why can't we have all the info we need to determine if we can close it after write? If we are deciding to flush and close full files (per title of PR), then can't we deduce that if the number of snaps in file == max number of snaps then close based on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for storage type "NumSnaps". But with type "one_month" (say), we can't say if the file is full unless we know the time stamp of the next write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, at a given time step, we know (can calculate) its time stamp, right? Then, why can't we deduce if one_month or one_day is ending here? Do we not know dt?
I understand the logic can be too convoluted, but it is still doable, no?
We don't have to do it now, but trying to understand if it is doable at all (ignoring the fact that we may choose not to make the code super ugly for some corner case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot compute the next time stamp during t0 output. The driver does not have info about dt
during the init sequence, which is when the OM is setup. When I originally designed the driver, I wanted to separate concerns as much as possible. In my mind, dt was a "run" time param, not an "init" time param (I didn't even know if dt could in principle change dynamically down the road).
If you want to compute next_write_ts
during t0 output (which, again, happens during the init sequence), we need to pass dt to the driver init methods (from the f90 cpl interface). We can of course do that. And all in all, it may make the code simpler. It's a slightly deeper interface change though, so we could do it as a follow up PR.
// In case REST_OPT=nsteps, don't count t0 output as one of those steps | ||
// NOTE: for m_output_control, it doesn't matter, since it'll be reset to 0 before we return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmmm.... I see where things get weird! I wonder if shifting the indexing altogether can help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbc, this small issue is unrelated from closing the file at the right time. To be honest, I think we could flat out rm the line that updates nsamples_since_last_write for checkpoint control: it is not used anyways! And since, as the comment states, for output control it doesn't matter, we may as well remove this if block altogether...
@@ -118,6 +118,10 @@ class OutputManager | |||
void finalize(); | |||
|
|||
long long res_dep_memory_footprint () const; | |||
|
|||
// For debug and testing purposes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind elaborating what debug and testing we mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I want to be able to verify the correctness of the control/filespecs structs during unit tests. The comment was meant to say "this is not really needed at runtime".
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, target_sha=0422f5754bde808b99f738c9dca1af4379634fbe, However Inspection must be performed before merge can occur... |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
|
All Jobs Finished; status = PASSED, target_sha=75ef2edca2837f1f61601a83fba2b559b26df85f, However Inspection must be performed before merge can occur... |
1 similar comment
All Jobs Finished; status = PASSED, target_sha=75ef2edca2837f1f61601a83fba2b559b26df85f, However Inspection must be performed before merge can occur... |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
Status Flag 'Pull Request AutoTester' - Error: Jenkins Jobs - A user has pushed a change to the PR before testing completed. NEW EVENT 'committed', ID C_kwDOCEfuetoAKDAzM2QzMWUxMjg5OGNlN2YwZjdiYjA1NzMyZWEyMzNiYjliOGNhYzM... The Jenkins Jobs will be shutdown; Testing of this PR must occur again. |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6171 ERROR (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5923 ERROR (click to see last 100 lines of console output)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
The merge-base changed after approval.
I contributed to the PR, so I won't be a reviewer anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by order of the peaky blinders
The merge-base changed after approval.
I'm not sure what gh is doing with this weird dismissal of the review. If testing passes, we'll just merge manually. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: AaronDonahue |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, target_sha=41f563d7ec3e4e1727d25267796e9beac13ffb12, However Inspection must be performed before merge can occur... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by order of the peaky blinders
The merge-base changed after approval.
This commit causes the output manager to flush and close a file if the max number of snapshots has been reached.
This is mainly an issue when the simulation exits abnormally before SCORPIO has flushed a file. The result is that the output file exists, but is empty. Now, the output manager will check if the maximum number of snapshots allowable in the file has been reached, and if so it forces the file to be flushed and closed.
Fixes #3026