Statistical Framework for Two-Party Acceptance Testing of HPC Systems for Reliability
TimeSunday, 14 November 202111:35am - 12pm CST
DescriptionHPC clusters and supercomputers are capital investments and undergo great scrutiny to be sure that the system meets agreed upon metrics of performance, reliability, and usability. As such, careful evaluation of a system occurs once delivered to evaluate the agreed upon specifications have been met. This evaluation is referred to as an "acceptance test." Both the HPC vendor and the data center buying the system have a vested interest in passing this test, though their goals sometimes are at odds. While the buyer wants the system agreed upon, the vendor wants the test to pass in a timely manner so that they can be paid. This creates a delicate balance where both parties agree upon testing parameters to satisfy their goals and optimize their objectives.

This paper focuses on the reliability testing aspect of an acceptance test and outlines how test parameters are set up for length of test and number of acceptable failures. Several statistical approaches are presented that illuminate the relationships among salient model parameters and both vendor and buyer constraints. Additionally, techniques for accepting a less reliable machine and those ramifications are presented as well. Finally, simulations are performed that analyze six different HPC workloads and how impactful accepting a less reliable machine will be on the buyer. The techniques presented in this paper can be used by data center operators and procurement teams to evaluate systems during acceptance testing as well as HPC vendors to minimize their risk of failure.
