Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test examples in parallel #417

Merged
merged 36 commits into from
Jul 26, 2024
Merged

Test examples in parallel #417

merged 36 commits into from
Jul 26, 2024

Conversation

JDBetteridge
Copy link
Member

Given that most users will probably try and run an example from the examples directory in parallel, enable parallel testing for all examples.

I do not anticipate this adding significant time to CI runs as the parallel tests will benefit from having a hot cache.

Any test in unit-tests or integration-tests can be made to run in parallel by using the Python decorator:

@pytest.mark.parallel(nprocs=4)

The above example would run the test with 4 MPI ranks.

Fixes #377.

Requires #416 to be addressed.

@JDBetteridge JDBetteridge marked this pull request as ready for review August 2, 2023 16:29
@JDBetteridge
Copy link
Member Author

Tests take ~3min longer, seems like a win

Copy link
Contributor

@tommbendall tommbendall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks! It raises the question of which of the other unit tests or integration tests should be in parallel. The nice thing about the examples is that they cover most of the capabilities so this should give us reasonable coverage.

Just one small suggestion...

gusto/io.py Outdated Show resolved Hide resolved
@JDBetteridge JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 4882454 to 5ad6f86 Compare August 3, 2023 10:45
@JDBetteridge
Copy link
Member Author

I agree, it's probably worth having a discussion at some point which of the unit/integration tests should be run in parallel. Using the pytest marker makes it very easy to update existing tests.

A short term suggestion would be: If you find any parallel bugs add a parallel test, or update an existing test to run in parallel.

@JDBetteridge JDBetteridge marked this pull request as draft August 7, 2023 13:57
@JDBetteridge JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 29987b9 to 6d477e7 Compare August 11, 2023 11:15
@JDBetteridge JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 39206c6 to 30b8f5d Compare August 29, 2023 15:38
@JDBetteridge JDBetteridge marked this pull request as ready for review October 9, 2023 10:08
@JDBetteridge
Copy link
Member Author

Twice in a row! Nice!

I will pull out my debugging code and then this is ready for review

@tommbendall
Copy link
Contributor

I'd really like us to start parallel testing by default, so I thought it would be helpful to update the branch.

But as we saw a few months ago, we can still get failures. On this latest run it was the skamarock_klemp_hydrostatic test, and it looks like other failures we've seen:

On the first call to the mixed solver, one rank seems to fall behind the others and not begin the ImplicitSolver_solve. Then all the ranks hang. Output from most log files:

2024-03-13 08:36:20,335 INFO     Semi-implicit Quasi-Newton: Mixed solve (0, 1)
2024-03-13 08:36:20,335 INFO     Compressible linear solver: rho average solve
2024-03-13 08:36:20,340 INFO     Compressible linear solver: Exner average solve
2024-03-13 08:36:20,345 INFO     Compressible linear solver: hybridized solve
2024-03-13 08:36:20,359 DEBUG            Residual norms for ImplicitSolver_condensed_field_ solve
2024-03-13 08:36:20,359 DEBUG            0 KSP unpreconditioned resid norm 4.413507124086e+07 true resid norm 4.413507124086e+07 ||r(i)||/||b|| 1.000000000000e+00
2024-03-13 08:36:20,362 DEBUG            1 KSP unpreconditioned resid norm 5.900637824684e+01 true resid norm 5.900637824491e+01 ||r(i)||/||b|| 1.336949881035e-06
2024-03-13 08:36:20,364 DEBUG            2 KSP unpreconditioned resid norm 5.041852127746e-05 true resid norm 5.042033608880e-05 ||r(i)||/||b|| 1.142409758753e-12
2024-03-13 08:36:20,376 DEBUG        Residual norms for ImplicitSolver_ solve
2024-03-13 08:36:20,376 DEBUG        0 KSP no resid norm 4.765315244582e+07 true resid norm 5.046385477193e-05 ||r(i)||/||b|| 1.058982505498e-12

Output from one log file:

2024-03-13 08:36:20,335 INFO     Semi-implicit Quasi-Newton: Mixed solve (0, 1)
2024-03-13 08:36:20,335 INFO     Compressible linear solver: rho average solve
2024-03-13 08:36:20,340 INFO     Compressible linear solver: Exner average solve
2024-03-13 08:36:20,345 INFO     Compressible linear solver: hybridized solve
2024-03-13 08:36:20,359 DEBUG            Residual norms for ImplicitSolver_condensed_field_ solve
2024-03-13 08:36:20,359 DEBUG            0 KSP unpreconditioned resid norm 4.413507124086e+07 true resid norm 4.413507124086e+07 ||r(i)||/||b|| 1.000000000000e+00
2024-03-13 08:36:20,362 DEBUG            1 KSP unpreconditioned resid norm 5.900637824684e+01 true resid norm 5.900637824491e+01 ||r(i)||/||b|| 1.336949881035e-06
2024-03-13 08:36:20,364 DEBUG            2 KSP unpreconditioned resid norm 5.041852127746e-05 true resid norm 5.042033608880e-05 ||r(i)||/||b|| 1.142409758753e-12

@tommbendall
Copy link
Contributor

In the failing example that I posted above, the hanging comes between the condensed_field solve (the preconditioning step for the hybridized solver) -- which suggests to me that it could be worth reviewing whether the solver options are valid in parallel for our domains:

    solver_parameters = {'mat_type': 'matfree',
                         'ksp_type': 'preonly',
                         'pc_type': 'python',
                         'pc_python_type': 'firedrake.SCPC',
                         'pc_sc_eliminate_fields': '0, 1',
                         # The reduced operator is not symmetric
                         'condensed_field': {'ksp_type': 'fgmres',
                                             'ksp_rtol': 1.0e-8,
                                             'ksp_atol': 1.0e-8,
                                             'ksp_max_it': 100,
                                             'pc_type': 'gamg',
                                             'pc_gamg_sym_graph': None,
                                             'mg_levels': {'ksp_type': 'gmres',
                                                           'ksp_max_it': 5,
                                                           'pc_type': 'bjacobi',
                                                           'sub_pc_type': 'ilu'}}}

@JDBetteridge
Copy link
Member Author

I wonder if we're caching something we shouldn't on a mesh hierarchy. We recently fixed failing tests in Firedrakeland by ensuring a fresh mesh hierarchy each time. That approach wouldn't work here but might be a good place to start. It would be good to whittle this down to a minimal failing example to work on.

@tommbendall
Copy link
Contributor

I don't think we actually use a mesh hierarchy in any tests -- we only use algebraic multigrid and not geometric, although this is definitely something that we need to do to improve performance.

I agree about the minimum failing example, the hard thing is that sometimes it will pass!!

@JDBetteridge
Copy link
Member Author

Some recent changes in Firedrake my have fixed some parallel bugs. I'm running CI again and crossing my fingers 🤞 .

If this still doesn't work we should try tackling this at the hackathon.

Copy link
Contributor

@tommbendall tommbendall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been talked about extensively offline. Thanks Jack and I approve!

@tommbendall tommbendall merged commit d4622d6 into main Jul 26, 2024
4 checks passed
@tommbendall tommbendall deleted the JDBetteridge/parallel_tests branch August 26, 2024 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

there is no test for running in parallel
2 participants