Incorporating Fault-Tolerance Awareness into System-Level Modeling and Simulation
Event Type
Workshop
Online Only
Extreme Scale Comptuing
Reliability and Resiliency
W
TimeSunday, 14 November 20213:35pm - 4pm CST
LocationOnline
DescriptionAs the design space for supercomputers grows, modeling and simulation (MODSIM) becomes more important to facilitate system-level design space exploration (DSE). Furthermore, extreme-scale systems and newer technologies can lead to higher fault rates, which negatively affects system performance. Therefore, it is important for system performance predictions to include the effects of faults and fault-tolerance (FT) techniques, to facilitate system design.
BE-SST is an existing MODSIM methodology and workflow based on using abstraction to simplify and accelerate MODSIM for DSE. This paper documents the incorporation of fault-tolerance awareness into BE-SST, which adds the ability to model the checkpointing costs of a full system using the previously validated MODSIM approach of BE-SST . We present the process used to extend BE-SST, enabling the creation and validation of fault-tolerance aware (FT aware) performance models, which can be used in BE-SST to predict the effects of system and application parameters on FT overhead.
Additionally, this paper presents a case study where full system performance, comprised of application, hardware, and FT technique, is simulated using BE-SST. We validate both FT aware and non FT aware performance models against actual system performance, finding an average percent error of less than 17\% for individual functions and 20\% for full application runs, demonstrating an acceleration of the exploration and reduction of a large design space with an acceptable level of accuracy. Finally, we analyze how the FT aware and non-FT aware models differ, and suggest DSE use cases based on their differences.
BE-SST is an existing MODSIM methodology and workflow based on using abstraction to simplify and accelerate MODSIM for DSE. This paper documents the incorporation of fault-tolerance awareness into BE-SST, which adds the ability to model the checkpointing costs of a full system using the previously validated MODSIM approach of BE-SST . We present the process used to extend BE-SST, enabling the creation and validation of fault-tolerance aware (FT aware) performance models, which can be used in BE-SST to predict the effects of system and application parameters on FT overhead.
Additionally, this paper presents a case study where full system performance, comprised of application, hardware, and FT technique, is simulated using BE-SST. We validate both FT aware and non FT aware performance models against actual system performance, finding an average percent error of less than 17\% for individual functions and 20\% for full application runs, demonstrating an acceleration of the exploration and reduction of a large design space with an acceptable level of accuracy. Finally, we analyze how the FT aware and non-FT aware models differ, and suggest DSE use cases based on their differences.

