EAR
EAR provides four main added values:
Power and environmental system monitoring and job accounting.
Transparent runtime application performance and power monitoring
Dynamic application and cluster energy optimization through simple energy policies
Smart cluster energy and power capping to ensure your cluster does not consume more than whay you decide
EAR 4.2 main features
Runtime Energy Optimization
Transparent, dynamic and lightweight runtime library with no user intervention
Automatic energy savings according to energy policies
Support for multiple jobs sharing a node
Application perfomance and energy accounting
Granularity: jobid, stepid, loop, user, node
Energy and Power Capping
System monitoring
Nodes temperature, power, effective frequency ...
Automatic reporting of run time hardware issues
Hints for system analysis and optimization
Hints of application analysis and optimization
Application signature and traces
Energy savings estimates reported to the DB
Application phases reported to the DB
Initial support of MPI load balance
Support of relational and non-relational DB
Error detection/correction of wrong power readings
EAR 4.2 main values
Monitor the system and the application while running through simple commands
Energy accounting and power monitoring
Automatic and Dynamic monitoring / reporting of HPC and AI applications performance and power characteristics
Ensure nodes are performing as expected through periodic checks
Reduce the cluster power consumption by about 10% minmizing performance penalty through the dynamic runtime energy optimization
Ensure the cluster does not consume more than what you decide through energy and power capping
Robust, Reliable and Operational since
August 2019 at LRZ on SuperMUC NG 6480 Intel node cluster
May 2022 at SURF on Snellius hybrid cluster with Intel/NVIDIA GPUS + AMD partitions
Transparent job submission through a SLURM plugin
Support on Intel/AMD CPUs and NVIDIA GPUs