The International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Variability Due to Job Placement on Edison.
Student: Dylan Wang (University of California, Davis)
Supervisor: Abhinav Bhatele (Lawrence Livermore National Laboratory)
Abstract: Some applications running on machines like the Edison Supercomputer can suffer from high variability in run-time. This leads to debugging and optimization difficulties and less accurate reservation times. Supercomputers with high performance variability end up being less useful for end users and less efficient for the supercomputing facility. The objective of this research is to characterize the application run-time performance and identify the root cause of the variability on machines with the Aries network. We approach this problem by running two applications, MILC and AMG, while collecting the set of logical coordinates for every job’s nodes in the system and hardware counters on routers connected to our application’s nodes. Running these applications at various sizes once per day will gives us many unique allocations and system states with corresponding execution times. We then use statistical analysis on the gathered data to try and correlate performance with placement and interference.