SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Operational Data Analytics

Authors: Alessio Netti (Leibniz Supercomputing Centre), Michael Ott (Leibniz Supercomputing Centre), Rachel Palumbo (Oak Ridge National Laboratory (ORNL)), Torsten Wilde (Hewlett Packard Enterprise), Keiji Yamamoto (RIKEN)

Abstract: Many HPC sites are developing and deploying systems for operational data analytics (ODA) to help them understand and optimize their HPC operations. The complexity and sophistication of those systems as well as the components of the HPC operations covered vary significantly and there is ambiguity in the terminology used. In this BoF we want to discuss the current state-of-the-practice in ODA and investigate future developments. We will introduce and leverage a new conceptual framework that establishes common terminology and scope that will help to fuel the discussion with the audience and paint a meaningful picture of ODA.

Long Description: The goal of ODA is to allow for the continuous monitoring, archiving, and analysis of near real-time performance data, providing immediately actionable information for multiple operational uses. As more and more HPC sites are endeavouring to roll out ODA systems in order to optimize their operations, there is a need to coin common terminology and categories in order to understand, compare, and discuss ODA activities.

To this end, a conceptual framework has been proposed by members of the EEHPCWG (A. Netti, W. Shin, M. Ott, T. Wilde and N. Bates. "A Conceptual Framework for HPC Operational Data Analytics." In Proceedings of the Energy Efficient HPC State of the Practice Workshop. IEEE, 2021) that can provide both a holistic integrative picture and an actionable subdivision of ODA to help the HPC community. This is achieved by employing a staged model of data analytics capabilities that consists of four types: descriptive, diagnostic, predictive, and prescriptive. These capabilities are framed within the "4-Pillar Framework for Energy-Efficient HPC Data Centers", which consists of building infrastructure, system hardware, system software and applications. Combining these two models creates a two-dimensional four-by-four spatial grid that enables users to perform a mapping of various data analytics techniques, HPC ODA systems and tools in the context of their scope (four pillars) and type (four types of analytics). While the scope indicates the comprehensiveness of an ODA capability, the type of analytics indicates a degree of sophistication that helps establish staged roadmaps in planning for HPC ODA systems.

In this BoF, we will use the proposed HPC ODA framework as a means of establishing common terminology and categories to better understand, compare and discuss ODA activities in production ODA deployments within leading-edge HPC facilities. Furthermore, we plan to engage with the audience and stimulate discussion on their own ODA use cases. This structured discussion will allow us, as well as the audience, to gain a valuable and understandable picture about the state of the practice in ODA highlighting trends, requirements and common pitfalls.


Back to Birds of a Feather Archive Listing