Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Resilience #42

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
de96fcd
Initial resilience branch
engelmannc Oct 28, 2024
0bb7e0f
Initial resilience description
engelmannc Nov 2, 2024
1490a55
Minor fix.
engelmannc Nov 2, 2024
ae44641
Added stub for first resilience design pattern
engelmannc Nov 5, 2024
e320eec
Added stub for first resilience design pattern
engelmannc Nov 5, 2024
85b3429
Transferred first resilience pattern (Rollback).
engelmannc Nov 15, 2024
5c3abca
Updated rollback pattern and added reinitialization pattern.
engelmannc Dec 3, 2024
9cd7ee0
Added n-modular redundancy pattern.
engelmannc Dec 3, 2024
6de2dc3
Added active/standby pattern and fixed some other patterns.
engelmannc Dec 9, 2024
a36b87c
Minor corrections.
engelmannc Dec 9, 2024
b38c04b
Added more resilience design patterns and corrected existing ones.
engelmannc Dec 12, 2024
4fd70d6
Added rollforward resilience pattern and fixed others.
engelmannc Dec 18, 2024
368dbb8
Added rollforward resilience pattern and fixed others.
engelmannc Dec 18, 2024
aa040d9
Added n-version design resilience pattern and fixed others.
engelmannc Dec 18, 2024
93d6a2a
Added recovery block resilience pattern.
engelmannc Dec 18, 2024
2c047c5
Fixed recovery block resilience pattern.
engelmannc Dec 18, 2024
7db5831
Added resilience design patterns intro.
engelmannc Jan 13, 2025
fd79c8d
Added introductory descriptions to each resilience design pattern.
engelmannc Jan 13, 2025
379df90
Minor corrections
engelmannc Jan 13, 2025
4573df8
Added forward error correction pattern
engelmannc Jan 13, 2025
cb07945
Added forward error correction pattern
engelmannc Jan 13, 2025
2e6cf00
Added statement about exclusiion of Self Stabilization patterns
engelmannc Jan 13, 2025
46c735f
Added citation and additional resilience pattern composition text.
engelmannc Jan 14, 2025
f0a69ee
Added additional introductory text to each resilience design pattern.
engelmannc Jan 14, 2025
b740c91
Added the remaining missing descriptions to the resilience design pat…
engelmannc Jan 16, 2025
efb96c9
Revisited performance/reliability/availability for Monitoring, Predic…
engelmannc Jan 16, 2025
ba8819b
Revisited performance/reliability/availability for Monitoring, Predic…
engelmannc Jan 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
338 changes: 338 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -957,3 +957,341 @@ @misc{olcf:ace
year = "2024",
url = "https://docs.olcf.ornl.gov/ace_testbed/"
}

@article{daly06higher,
author = "John T. Daly",
title = "A higher order estimate of the optimum checkpoint interval for restart dumps",
journal = "Future Generation Computer Systems",
volume = "22",
number = "3",
pages = "303--312",
year = 2006,
issn = "0167-739X",
doi = "10.1016/j.future.2004.11.016"
}

@inproceedings{Bautista-Gomez:2011,
author = {Bautista-Gomez, Leonardo and Tsuboi, Seiji and Komatitsch, Dimitri and Cappello, Franck and Maruyama, Naoya and Matsuoka, Satoshi},
title = {{FTI}: High Performance Fault Tolerance Interface for Hybrid Systems},
booktitle = {Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis},
series = {SC '11},
year = {2011},
isbn = {978-1-4503-0771-0},
location = {Seattle, Washington},
pages = {32:1--32:32},
articleno = {32},
numpages = {32}
}

@inproceedings{ansel2009dmtcp,
title={{DMTCP}: Transparent Checkpointing for Cluster Computations
and the Desktop},
author={Ansel, Jason and Arya, Kapil and Cooperman, Gene},
booktitle={2009 IEEE International Symposium on Parallel
\& Distributed Processing (IPDPS'09)},
pages={1--12},
year={2009},
organization={IEEE},
address = "Rome, Italy",
}

@inproceedings{Fiala:2012,
author = "David Fiala
and Frank Mueller
and Christian Engelmann
and Kurt Ferreira
and Ron Brightwell
and Rolf Riesen",
title = "Detection and Correction of Silent Data Corruption for
Large-Scale High-Performance Computing",
booktitle = "Proceedings of the $25^{th}$ IEEE/ACM International
Conference on High Performance Computing, Networking, Storage
and Analysis (SC) 2012",
pages = "78:1--78:12",
month = nov # "~10-16, ",
year = "2012",
address = "Salt Lake City, UT, USA",
publisher = "ACM Press, New York, NY, USA",
isbn = "978-1-4673-0804-5",
doi = "10.1109/SC.2012.49"
}

@article{he09symmetric,
author = "Xubin (Ben) He
and Li Ou
and Christian Engelmann
and Xin Chen
and Stephen L. Scott",
title = "Symmetric Active/Active Metadata Service for High
Availability Parallel File Systems",
journal = "Journal of Parallel and Distributed Computing (JPDC)",
volume = "69",
number = "12",
pages = "961-973",
month = dec,
year = "2009",
publisher = "Elsevier B.V, Amsterdam, The Netherlands",
issn = "0743-7315",
doi = "10.1016/j.jpdc.2009.08.004"
}

@inproceedings{yu06benefits,
author = "Weikuan Yu and Ranjit Noronha and Shuang Liang and Dhabaleswar K. Panda",
title = "Benefits of High Speed Interconnects to Cluster File Systems: {A} Case Study with {Lustre}",
booktitle = "Proceedings of the $20^{st}$ {IEEE} International Parallel and Distributed Processing Symposium ({IPDPS}) 2006",
pages = "8-15",
month = apr # "~25-29, ",
year = "2006",
address = "Rhodes Island, Greece",
publisher = "IEEE Computer Society",
isbn = "1-4244-0054-6"
}

@inproceedings{yoo03slurm,
author = "Andy B. Yoo and Morris A. Jette and Mark Grondona",
title = "{SLURM}: {S}imple {Linux} Utility for Resource Management",
booktitle = "Lecture Notes in Computer Science: Proceedings of the $9^{th}$ International Workshop on Job Scheduling Strategies for Parallel Processing ({JSSPP}) 2003",
volume = "2862",
pages = "44--60",
month = jun # "~24, ",
year = "2003",
address = "Seattle, WA, USA",
publisher = "Springer",
isbn = "978-3-540-20405-3, ISSN 0302-9743"
}

@misc{IPMI,
title = "Intelligent Platform Management Interface ({IPMI}), v2.0 specification",
howpublished = "\url{http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-home.html}",
year = "2015",
}

@article{Ganglia,
author = {Massie, Matthew L and Chun, Brent N and Culler, David E},
title = {The ganglia distributed monitoring system: design, implementation, and experience},
journal = {Parallel Computing},
volume = {30},
number = {7},
pages = {817 - 840},
year = {2004},
issn = {0167-8191},
doi = {http://dx.doi.org/10.1016/j.parco.2004.04.001},
url = {http://ganglia.info/},
}

@article{pla06drbd,
author = "Pedro Pla",
title = "DRBD in a Heartbeat",
journal = "Linux Journal (LJ)",
month = sep,
year = "2006",
url = "http://www.linuxjournal.com/article/9074",
abstract = "How to build a redundant, high-availability system with DRBD and Heartbeat."
}

@Misc{vampir,
title = "Vampir - Performance Optimization",
howpublished = "\url{https://vampir.eu}",
}

@inproceedings{engelmann09proactive,
author = "Christian Engelmann
and Geoffroy R. Vall\'ee
and Thomas Naughton
and Stephen L. Scott",
title = "Proactive Fault Tolerance Using Preemptive Migration",
booktitle = "Proceedings of the $17^{th}$ Euromicro International
Conference on Parallel, Distributed, and network-based
Processing (PDP) 2009",
pages = "252--257",
month = feb # "~18-20, ",
year = "2009",
address = "Weimar, Germany",
publisher = "IEEE Computer Society, Los Alamitos, CA, USA",
isbn = "978-0-7695-3544-9",
issn = "1066-6192",
doi = "10.1109/PDP.2009.31"
}

@article{wang12proactive,
author = "Chao Wang
and Frank Mueller
and Christian Engelmann
and Stephen L. Scott",
title = "Proactive Process-Level Live Migration and Back Migration in
{HPC} Environments",
journal = "Journal of Parallel and Distributed Computing (JPDC)",
volume = "72",
number = "2",
pages = "254--267",
month = feb,
year = "2012",
publisher = "Elsevier B.V, Amsterdam, The Netherlands",
issn = "0743-7315",
doi = "10.1016/j.jpdc.2011.10.009"
}

@inproceedings{nagarajan07proactive,
author = "Arun B. Nagarajan
and Frank Mueller
and Christian Engelmann
and Stephen L. Scott",
title = "Proactive Fault Tolerance for {HPC} with {Xen}
Virtualization",
booktitle = "Proceedings of the $21^{st}$ ACM International
Conference on Supercomputing (ICS) 2007",
pages = "23--32",
month = jun # "~16-20, ",
year = "2007",
address = "Seattle, WA, USA",
publisher = "ACM Press, New York, NY, USA",
isbn = "978-1-59593-768-1",
doi = "10.1145/1274971.1274978"
}

@inproceedings{liang06blue,
author = "Yinglung Liang
and Yanyong Zhang
and A. Sivasubramaniam
and M. Jette and R. Sahoo",
title = "{Blue Gene/L} Failure Analysis and Prediction Models",
booktitle = "Proceedings of the $36^{th}$ IEEE/IFIP International Conference
on Dependable Systems and Networks (DSN) 2006",
pages = "425-434",
month = jun,
year = "2006",
publisher = "IEEE Computer Society, Los Alamitos, CA, USA",
issn = "1530-0889",
doi = "10.1109/DSN.2006.18"
}

@inproceedings{sahoo03critical,
author = "Sahoo, R. K.
and Oliner, A. J.
and Rish, I.
and Gupta, M.
and Moreira, J. E.
and Ma, S.
and Vilalta, R.
and Sivasubramaniam, A.",
title = "Critical Event Prediction for Proactive Management in Large-scale
Computer Clusters",
booktitle = "Proceedings of the $9^{th}$ ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD) 2003",
pages = "426--435",
year = "2003",
address = "Washington, DC, USA",
publisher = "ACM Press, New York, NY, USA",
isbn = "1-58113-737-0",
doi = "10.1145/956750.956799"
}

@misc{Nvidia:DPR,
author={Nvidia},
title={Dynamic Page Retirement, Reference Guide vR352},
howpublished={\url{https://docs.nvidia.com/deploy/pdf/Dynamic_Page_Retirement.pdf}},
year={2015},
}

@article{Bland:2013:IJHPCA,
author = {Bland, Wesley and Bouteiller, Aurelien and Herault, Thomas and Bosilca, George and Dongarra, Jack},
title = {Post-failure recovery of {MPI} communication capability: Design and rationale},
volume = {27},
number = {3},
pages = {244-254},
year = {2013},
journal = {International Journal of High Performance Computing Applications}
}

@article{Chien:2016,
author = {A Chien and P Balaji and N Dun and A Fang and H Fujita and K Iskra and Z Rubenstein and Z Zheng and J Hammond and I Laguna and D Richards and A Dubey and B van Straalen and M Hoemmen and M Heroux and K Teranishi and A Siegel},
title = {Exploring versioned distributed arrays for resilience in scientific applications: global view resilience},
journal = {The International Journal of High Performance Computing Applications},
volume = {0},
number = {0},
pages = {1094342016664796},
year = {0},
doi = {10.1177/1094342016664796},
}

@article{ltaief08fault,
author = "Hatem Ltaief and Edgar Gabriel and Marc Garbey",
title = "{F}ault Tolerant Algorithms for Heat Transfer Problems",
journal = "Journal of Parallel and Distributed Computing (JPDC)",
volume = "68",
number = "5",
pages = "663--677",
year = "2008",
publisher = "Elsevier",
issn = "0743-7315",
doi = "10.1016/j.jpdc.2007.09.004"
}

@inproceedings{Chung:2011:SC,
author = {Chung, Jinsuk and Lee, Ikhwan and Sullivan, Michael and Ryoo, Jee Ho and Kim, Dong Wan and Yoon, Doe Hyun and Kaplan, Larry and Erez, Mattan},
title = {Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems},
booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
year = {2012},
location = {Salt Lake City, Utah},
pages = {58:1--58:11},
numpages = {11},
}

@Misc{Mellanox:2011,
authors={Mellanox Technologies},
title={Mellanox {I}nfiniBand {FDR} 56{G}b/s For Server and Storage Interconnect Solutions},
howpublished={\url{http://www.mellanox.com/related-docs/whitepapers/WP_InfiniBand_FDR.pdf}},
year={2011},
}

@book{Moon:2005,
title={Error Correction Coding: Mathematical Methods and Algorithms},
author={Moon, Todd K},
year={2005},
publisher={Wiley-Interscience},
isbn={9780471648000},
doi={10.1002/0471739219}
}

@article{Huang:1984,
author={Kuang-Hua Huang and J. A. Abraham},
journal={IEEE Transactions on Computers},
title={Algorithm-Based Fault Tolerance for Matrix Operations},
year={1984},
volume={C-33},
number={6},
pages={518-528},
month={June},
}

@inproceedings{jeong203d,
author = "Haewon Jeong
and Yaoqing Yang
and Christian Engelmann
and Vipul Gupta
and Tze Meng Low
and Pulkit Grover
and Viveck Cadambe
and Kannan Ramchandran",
title = "{3D} Coded {SUMMA}: {C}ommunication-Efficient and Robust
Parallel Matrix Multiplication",
booktitle = "Lecture Notes in Computer Science: Proceedings of the
26th European Conference on Parallel and Distributed Computing
(Euro-Par) 2020",
volume = "12247",
pages = "392--407",
month = aug # "~24-28, ",
year = "2020",
address = "Warsaw, Poland",
publisher = "Springer Verlag, Berlin, Germany",
isbn = "978-3-030-57674-5",
doi = "10.1007/978-3-030-57675-2\_25"
}

@misc{ibm:chipkill,
author = "Timothy J. Dell",
title = "A White Paper on the Benefits of Chipkill-Correct {ECC} for {PC} Server Main Memory",
year = "1997",
howpublished = "{IBM Microelectronics Division}",
url = "http://www.ece.umd.edu/courses/enee759h.S2003/references/chipkill_white_paper.pdf"
}
4 changes: 3 additions & 1 deletion sos/logical/errors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,9 @@ User-defined response
user-defined response would request additional resources ahead of time and
perform user-defined actions when required. Additionally, delegation may
also be an appropriate user-defined response, such as when another task is
better equipped to handkle an error or failure.
better equipped to handkle an error or failure. See
:ref:`intersect:arch:sos:logical:resilience` for different options for a
user-defined response.

There is also the aspect of error and failure handling in the
:ref:`intersect:arch:sos:operational`, specifically with
Expand Down
1 change: 1 addition & 0 deletions sos/logical/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ between the different components. The logical view defines
systems/index
adapters
errors
resilience/index
Loading