Skip to content

Commit

Permalink
Draft for JOSS (#90)
Browse files Browse the repository at this point in the history
* Add first draft of joss paper

* Modify CI

* Remove API specific text

* Add sentence regarding mpi collectives in summary

* Move MPI collectives up

* Another readthrough

* Some doi fixes and to author list and cursive
  • Loading branch information
jorgensd authored Mar 6, 2024
1 parent d2cf316 commit 4046323
Show file tree
Hide file tree
Showing 5 changed files with 229 additions and 0 deletions.
22 changes: 22 additions & 0 deletions .github/workflows/build_joss_paper.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Build JOSS paper
on: [push]

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
paper-path: ./joss-paper/paper.md

- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
path: ./joss-paper/paper.pdf
6 changes: 6 additions & 0 deletions joss-paper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# How to generate paper

```python
docker run --rm --volume $(pwd):/data --user $(id -u):$(id -g) --env JOURNAL=joss openjournals/inara
```

36 changes: 36 additions & 0 deletions joss-paper/joss-checklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
### Conflict of interest

- [x] I confirm that I have read the [JOSS conflict of interest policy](https://joss.readthedocs.io/en/latest/submitting.html#conflict-of-interest-policy-for-authors) and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

### Code of Conduct

- [x] I confirm that I read and will adhere to the [JOSS code of conduct](https://joss.theoj.org/about#code_of_conduct).

### General checks

- [x] **Repository:** Is the source code for this software available at the <a target="_blank" href="https://github.com/DistrictDataLabs/yellowbrick">repository url</a>?
- [x] **License:** Does the repository contain a plain-text LICENSE file with the contents of an [OSI approved](https://opensource.org/licenses/alphabetical) software license?
- [x] **Contribution and authorship:** Has the submitting author made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

### Functionality

- [x] **Installation:** Does installation proceed as outlined in the documentation?
- [x] **Functionality:** Have the functional claims of the software been confirmed?
- [ ] **Performance:** If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

### Documentation

- [x] **A statement of need:** Do the authors clearly state what problems the software is designed to solve and who the target audience is?
- [x] **Installation instructions:** Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
- [x] **Example usage:** Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
- [x] **Functionality documentation:** Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
- [x] **Automated tests:** Are there automated tests or manual steps described so that the functionality of the software can be verified?
- [x] **Community guidelines:** Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

### Software paper

- [x] **Summary:** Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
- [x] **A statement of need:** Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
- [x] **State of the field:** Do the authors describe how this software compares to other commonly-used packages?
- [x] **Quality of writing:** Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
- [x] **References:** Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper [citation syntax]( https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html#citation_syntax)?
78 changes: 78 additions & 0 deletions joss-paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
@unpublished{Baratta:2023,
author = {Baratta, Igor A. and Dean, Joseph P. and Dokken, Jørgen S. and Habera, Michal and Hale, Jack and Richardson, Chris N. and Rognes, Marie E. and Scroggs, Matthew W. and Sime, Nathan and Wells, Garth N.},
title = {DOLFINx: The next generation FEniCS problem solving environment},
language = {English},
year = {2023},
doi = {10.5281/zenodo.10447666}
}


@article{Godoy:2020,
title = {ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management},
journal = {SoftwareX},
volume = {12},
pages = {100561},
year = {2020},
issn = {2352-7110},
doi = {10.1016/j.softx.2020.100561},
author = {William F. Godoy and Norbert Podhorszki and Ruonan Wang and Chuck Atkins and Greg Eisenhauer and Junmin Gu and Philip Davis and Jong Choi and Kai Germaschewski and Kevin Huck and Axel Huebl and Mark Kim and James Kress and Tahsin Kurc and Qing Liu and Jeremy Logan and Kshitij Mehta and George Ostrouchov and Manish Parashar and Franz Poeschel and David Pugmire and Eric Suchyta and Keichi Takahashi and Nick Thompson and Seiji Tsutsumi and Lipeng Wan and Matthew Wolf and Kesheng Wu and Scott Klasky}
}


@conference{Habera:2018,
author = {Habera, Michal and Zilian, Andreas and Hale, Jack and Richardson, Chris N. and Blechta, Jan and Dave, Demarle},
year = {2018},
title = {{XDMF and ParaView: checkpointing format}},
booktitle = {{FEniCS Confernce 2018: Book of Abstracts}},
url = {https://hdl.handle.net/10993/35848}
}

@misc{Ham:2024,
title = {Efficient N-to-M Checkpointing Algorithm for Finite Element Simulations},
author = {David A. Ham and Vaclav Hapla and Matthew G. Knepley and Lawrence Mitchell and Koki Sagiyama},
year = {2024},
eprint = {2401.05868},
archiveprefix = {arXiv},
doi = {10.48550/arXiv.2401.05868}
}


@article{Rathgeber:2016,
author = {Rathgeber, Florian and Ham, David A. and Mitchell, Lawrence and Lange, Michael and Luporini, Fabio and Mcrae, Andrew T. T. and Bercea, Gheorghe-Teodor and Markall, Graham R. and Kelly, Paul H. J.},
title = {Firedrake: Automating the Finite Element Method by Composing Abstractions},
year = {2016},
issue_date = {September 2017},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {43},
number = {3},
issn = {0098-3500},
doi = {10.1145/2998441},
journal = {ACM Trans. Math. Softw.},
month = {dec},
articleno = {24},
numpages = {27}
}

@article{Scroggs:2022,
author = {Scroggs, Matthew W. and Dokken, J\o{}rgen S. and Richardson, Chris N. and Wells, Garth N.},
title = {Construction of Arbitrary Order Finite Element Degree-of-Freedom Maps on Polygonal and Polyhedral Cell Meshes},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {48},
number = {2},
issn = {0098-3500},
doi = {10.1145/3524456},
journal = {ACM Trans. Math. Softw.},
month = {may},
articleno = {18},
numpages = {23}
}

@misc{MPI-Forum:2012,
author = {MPI-Forum},
year = {2012},
title = {{MPI: A Message-Passing Interface Standard. Version 3.0}},
url = {https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf}
}
87 changes: 87 additions & 0 deletions joss-paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: 'ADIOS4DOLFINx: A framework for checkpointing in FEniCS'
tags:
- Python
- finite element simulations
- checkpointing
authors:
- name: Jørgen Schartum Dokken
orcid: 0000-0001-6489-8858
corresponding: true
affiliation: 1
affiliations:
- name: Simula Research Laboratory
index: 1
date: 6 March 2024
bibliography: paper.bib

---

# Summary

We introduce a checkpointing framework for the latest version of the FEniCS project, known as DOLFINx.
The framework leverages the data-centric approach of DOLFINx along with a state of the art adaptable Input/Output system called ADIOS2.
Several variations of checkpointing are supported, including *N-to-M* checkpointing of function data, storage of mesh partitioning information for *N-to-N* checkpointing and snapshot checkpointing for RAM reduction during simulation.
All MPI operations are using MPI-3 Neighborhood collectives.

# Statement of need

The ability to start, stop and resume simulations is becoming increasingly important with the growing use of supercomputers for solving scientific and engineering problems.
A rising number of large scale problems are deployed on high performance, memory distributed computing systems and users tend to run more demanding simulations.
These are often non-linear and time-dependent, which typically amounts to thousands of CPU hours.
As it might uncover bugs and unphysical solutions, the ability to run parts of the simulation, inspect the result and then resume simulation becomes a key factor to enable efficient development.
If this is discovered early on, the simulation can be terminated saving the developer time, money and energy-usage.

The proposed framework enables users of the FEniCS project [@Baratta:2023] to store solutions during simulation, and read them in at their convenience to resume simulations at a later stage.
Several checkpointing methods are implemented, including *N-to-M* checkpointing, which means saving data from a program executed with N processes, and loading it back in on M processes.

Functionality for *N-to-M* checkpointing was implemented for the old version of DOLFIN by [@Habera:2018].
However, this functionality is not present in the newest version of the FEniCS Project [@Baratta:2023].
The storage principles in the ADIOS4DOLFINx are based on the ideas present in this implementation.
However, the implementation for non-Lagrangian finite element spaces vastly differs, due to the usage of dof-permutations [@Scroggs:2022].
Additionally, all global MPI-calls in the old implementation have been reimplemented with scalable MPI-communication using the MPI-3 Neighborhood Collectives [@MPI-Forum:2012].

The framework introduces several new methods for storing partitioning information for *N-to-N* checkpointing with arbitrary ghosting, as well as very lightweight snapshot checkpoints.
A similar framework for *N-to-M* checkpointing was implemented by [@Ham:2024] for the finite element framework Firedrake [@Rathgeber:2016].
This frameworks differs from the one used in ADIOS4DOLFINx in several ways due to the different internal structures of DOLFINx and Firedrake.

# Functionality

The software is written as a Python-extension to DOLFINx, which can be installed using the Python Package installer `pip` directly from the Github repository or using the [ADIOS4DOLFINx](https://pypi.org/project/adios4dolfinx/) from the Python Package Index.
The following features are supported:

- Snapshot checkpointing
- *N-to-M* checkpointing with mesh storage
- *N-to-M* checkpointing without mesh storage
- *N-to-N* checkpointing storing partitioning information

A *snapshot checkpoint* is a checkpoint that is only valid during the run of a simulation.
It is lightweight (only stores the local portion of the global dof array to file), and is stored using the *Local Array* feature in ADIOS2 [@Godoy:2020] to store data local to the MPI process.
This feature is intended for use-cases where many solutions have to be aggregated to the end of a simulation to some post-processing step, or as a fall-back mechanism when restarting a diverging iterative solver.

A *N-to-M* checkpoint is a checkpoint that can be written with N processes and read back in with M processes.
Two versions of this checkpoint is supported; One where storage of the mesh is required and without mesh storage.
The reasoning for such a split is that when a mesh is read into DOLFINx and passed to an appropriate partitioner, the ordering mesh nodes (coordinates) and connectivity (cells) is changed.
Writing these back into *global arrays* requires MPI communication to ensure contiguous writing of data.

The *N-to-M* checkpoint with mesh storage exclusively writes contiguous chunks of data owned by the current process to an ADIOS2 *Global Array* that can be read in with a different number of processes at a later stage.
This operation requires no MPI-communication.

In many cases, the input mesh might stem from an external mesh generator and is stored together with mesh entity markers in an external file, for instance an XDMF-file.
To avoid duplication of this mesh data, a stand-alone file that can be associated with the XDMF file for a later restart can be created.
This method requires some MPI neighborhood collective calls to move data from the process that currently owns it to the relevant process for that stores it as a *Global Array* in contiguous chunks.
Both *N-to-M* checkpoint routines uses the same API to read in checkpoints at a later instance.

In certain scenarios, mesh partitioning might be time-consuming, as a developer is running the same problem over and over again with the same number of processes.
As DOLFINx supports custom partitioning [@Baratta:2023], we use this feature to read in partition data from a previous run.
As opposed to the checkpoints in the old version of DOLFIN, these checkpoints handle any ghosting, that being a custom ghosting provided by the user, or the shared-facet mode provided by DOLFINx.

# Examples
A large variety of examples covering all the functions in adios4dolfinx is available at [https://jorgensd.github.io/adios4dolfinx](https://jorgensd.github.io/adios4dolfinx).

# Acknowledgements

We acknowledge the valuable feedback on the documentation and manuscript by Thomas M. Surowiec and Halvor Herlyng.
Additionally, we acknowledge the scientific discussion regarding feature development and code contributions by Henrik N. Finsberg and Francesco Ballarin.

# References

0 comments on commit 4046323

Please sign in to comment.