Monitor #4

parcollet · 2020-08-07T19:08:49Z

Draft of the little monitor class.
To be reviewed.
More tests may be needed. TBD.

phdum

Hi!

I looked through the code in this merge in detail again this evening. There are a couple of minor improvements I have. I've implemented these as four commits on my fork https://github.com/phdum/mpi/tree/monitor. (Not sure how best to add these commits to the merge request)

The only actual problem I found was the case where one could call request_emergency_stop() multiple times from the same mpi rank. This would trying and send multiple mpi messages, even though only one was registered. I've fixed this so that request_emergency_stop() does not do anything if local_stop is already set.
I've expanded the tests. This includes checking the above point. Also, it adds multiple new test cases that were not covered before: if the request_emergency_stop occurs on the root itself, and if there are multiple (2) request_emergency_stop calls.
I've polished the comments to hopefully make the logic easier to follow

Apart from that everything looks good to me. At least the logical paths I could think of testing all hold together.

I've not looked at the changes to macros.hpp and comm_split.cpp which are a part of this merge request, but not part of the monitor code.

Thanks!

phdum · 2020-09-14T15:26:26Z

I've now pushed my commits to the branch, which updated the pull request.

- same as nda, with a guard, not pragma once, for multiple inclusion

- A simple class to monitor failure/exception on mode used mainly for MC classes in TRIQS at this stage.

- Clean a bit API

* Has different code logic for communication when root fails * Small clean up it readability in code and console output

* further calls should not send extra signals * add to test function

* test for 0, 1, 2 failures on nodes

-rename finalize -> finalize_communications -rename should_stop -> emegency_occured -use emergency_occured() to get final result also after finalization

parcollet · 2020-09-22T16:19:00Z

Ok, fine with the finalize change. Fine to merge.

parcollet requested a review from Wentzell August 7, 2020 19:08

Wentzell requested review from phdum and removed request for Wentzell September 8, 2020 14:55

phdum reviewed Sep 12, 2020

View reviewed changes

parcollet and others added 13 commits September 21, 2020 15:22

[macros] Add the macros file

f97898f

- same as nda, with a guard, not pragma once, for multiple inclusion

[monitor] First Draft

be9d81f

- A simple class to monitor failure/exception on mode used mainly for MC classes in TRIQS at this stage.

[monitor] add test

b6aa881

Correction. PhD comments.

be355c8

- Clean a bit API

ADd license

11abe54

Add abort for communicator

a24b788

[monitor] add test case when root itself fails

87c726c

* Has different code logic for communication when root fails * Small clean up it readability in code and console output

[monitor] emergency_stop should be called once

35301c6

* further calls should not send extra signals * add to test function

[monitor] polish of comments; no code change

b47d593

[monitor] change tests to allow variable failures

cffd8ad

* test for 0, 1, 2 failures on nodes

Remove .gitignore from .gitignore

fa7cd39

Change monitor.finalize() to void return type

f220aa6

-rename finalize -> finalize_communications -rename should_stop -> emegency_occured -use emergency_occured() to get final result also after finalization

Fix Copyright header for comm_split test

57b35a8

Wentzell force-pushed the monitor branch from dfed420 to 57b35a8 Compare September 22, 2020 15:29

Wentzell merged commit dbf2ac1 into TRIQS:unstable Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor #4

Monitor #4

parcollet commented Aug 7, 2020

phdum left a comment

phdum commented Sep 14, 2020

parcollet commented Sep 22, 2020

Monitor #4

Monitor #4

Conversation

parcollet commented Aug 7, 2020

phdum left a comment

Choose a reason for hiding this comment

phdum commented Sep 14, 2020

parcollet commented Sep 22, 2020