Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving thread crash logic #406

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

This introduces 3 changes to the behavior revolving around fixing thread-crashes (i.e. density or energy drops to a negative value). For the uninitiated, the default behavior in a thread-crash, when DENSITY_FLOOR isn't defined, is to call Average_Cell_All_Fields, which overrides the values in the crashed cell with the values from the neighboring cells.

The 3 changes include:

  1. A major synchronization bugfix:
    • we moved the call to Average_Cell_All_Fields from the Update_Conserved_Variables_3D kernel to the newly created PostUpdate_Conserved_Correct_Crashed_3D kernel. We always launch PostUpdate_Conserved_Correct_Crashed_3D after Update_Conserved_Variables_3D
    • Previously, we could be averaging the values of neighboring cells while they were in the middle of an update.
    • This was producing large spikes in Energy (1000 times larger than a supernova) in our simulations. This fix seemed to fix a little more than I was expecting (I still need to finish examining that).
  2. A minor bugfix:
    • Previously Average_Cell_All_Fields always looked for neighbors, regardless of where the central cell was located on a grid. So if you were at the edge of the active zone, you might include the values of cells in the ghost-zone. This was problematic when dealing with a crashed-cell at the hydro update when the ghost zones had not been updated yet (it's unclear if this commonly happened).
    • Care was taken to maintain existing behavior in Average_Slow_Cells, which could apply Average_Cell_All_Fields to a slow-cell at identified at any location on the grid (in the ghost zone or active zone).1 My change also happens
  3. A minor improvement:
    • Previously the Average_Cell_All_Fields would ignore any neighboring cells with a non-positive density or pressure.
    • This PR extends the behavior to also ignore any cells that would be considered a "slow" cell and will need to be separately averaged. (When the AVERAGE_SLOW_CELLS macro isn't defined, no cell is considered a "slow" cell)

The implementation of the third change got a little hairy. In principle I think it's a good design choice (especially for reducing ifdef statements). But, with the benefit of hindsight, maybe I should have done something simpler. Definitely open to feedback on this point!

Footnotes

  1. In this case we continue for a cell on the edge of the active zone, we will include cells in the ghost zone in the average. As I write this out, I'm not totally sure the ghost zones are refreshed between the invocation of the Hydro-Solver and the call to Average_Slow_Cells. That is something to followup on!

mabruzzo added 6 commits July 22, 2024 13:36
In slightly more detail, we invoke `Average_Cell_All_Fields` to get the average values from the neighbors to overwrite the value in a crashed cell or (when the `AVERAGE_SLOW_CELLS` macro is defined).

  - We have already been excluding any neighboring cell from this average that had crashed.

  - The key inovation here is that we now exclude a neighboring cell that we would consider a slow cell (when the `AVERAGE_SLOW_CELLS` macro is defined).

To implement this, we pass around the SlowCellConditionCheck object, which implements methods to check whether a cell is considered slow or not. When the `AVERAGE_SLOW_CELLS` macro is defined, that method does the full check. When the macro isn't defined, the logic short-circuits and always returns that the cell isn't slow.
@mabruzzo
Copy link
Collaborator Author

pre-commit.ci autofix

Copy link
Collaborator

@helenarichie helenarichie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! I have a few minor comments, but other than those I think it's ready to go.

// Step 3b: Address any crashed threads
hipLaunchKernelGGL(PostUpdate_Conserved_Correct_Crashed_3D, dim1dGrid, dim1dBlock, 0, 0, dev_conserved, nx, ny, nz,
x_off, y_off, z_off, n_ghost, gama, n_fields, slow_check);
GPU_Error_Check();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very minor, but there's a GPU_Error_Check following this kernel launch in the Simple integrator, but not in Van Leer, so it might be good to either remove this one or add one to VL so that the behavior is consistent between the two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I explicitly thought I fixed that (but I must have seen it and not acted on it)

template <bool verbose = false>
__device__ bool is_slow(Real E, Real d, Real d_inv, Real vx, Real vy, Real vz, Real gamma) const
{
return this->max_dti_if_slow(E, d, d_inv, vx, vy, vz, gamma) >= 0.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor question/comment, but is it possible for max_dti_if_slow to ever be zero here? If not, just for the sake of clarity, it might be good to change the condition to max_dti_if_slow > 0 instead of max_dti_if_slow >= 0.

I know that hydroInverseCrossingTime can return zero, but I still don't think it's possible for (max_dti > max_dti_slow) ? max_dti : -1.0; to evaluate to zero unless someone sets max_dti_slow to a negative number. Again, very minor, but I found it a little confusing.

In line with my comment about setting max_dti_if_slow, requiring that it's positive would make it so that (max_dti > max_dti_slow) ? max_dti : -1.0; never evaluated to zero and clarify this ambiguity.

@@ -152,7 +153,9 @@ void Grid3D::Initialize(struct Parameters *P)

#ifdef AVERAGE_SLOW_CELLS
H.min_dt_slow = 1e-100; // Initialize the minumum dt to a tiny number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think this needs to happen with this PR, but Evan and I have talked about how it would be good to set min_dt_slow from the input parameter file instead of at compile time. Additionally, as I mentioned below, I think it would be good to require that it's a positive number. This is definitely beyond the scope of this PR, so feel free to ignore-just wanted to make a note of it while we're overhauling cell averaging!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Especially since most of the logic that actually sets min_dt_slow is currently buried in the gravity functions.


if (max_dti > max_dti_slow) {
if (max_dti >= 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think max_dti > 0 would be clearer here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style-related comment since this is a solitary bit of new code-it would be good for the naming of the physics variables in here to follow the style guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants