Join us on April 30, 2025, for a webinar on messaging software implementation on Aurora and how to choose the right program environment. ALCF's Vitali Morozov will discuss the affinity and process placements to make sure that the CPU cores, GPUs, NICs, and memory domains are interacting over the shortest path with maximum efficiency.
Aurora is an Exascale supercomputer located at Argonne Leadership Computing Facility. The system has 10624 compute nodes with each node having two 52-core Intel Xeon CPUs, six discrete Intel Xeon MAX GPUs, two DDR5 memory domains, two high-bandwidth memory domains, and eight Slingshot network cards. We use message passing interface (MPI) to program this machine; however, the complexity and the scale of the system require special considerations to achieve expected stability and performance. Intel has made significant contribution to improved interaction of various system components within an MPI context, and the speaker will present the results of some of those contributions for users to use. Some time will be spent on discussing the problems of choosing the number of processes on a node, the distribution of processes on a node, compact or distributed location of nodes in a system, collective operations, and recommended environment variables. It is expected that the attendees are going to be using a large fraction of Aurora for production computations, therefore, we will be paying particular attention to a large scale multi-node environment.
Vitali Morozov is a Senior Software Engineering at the Argonne Leadership Computing Facility. He received his M.S. in Mathematics and M.S. in Computer Science from Novosibirsk State University, a Ph.D. in Engineering from Ershov’s Institute for Informatics Systems, Novosibirsk, Russia. At Argonne since 2001, he has been working on computer simulation of plasma generation, plasma material interactions, plasma thermal and optical properties, and applications to laser and discharge-produced plasmas. At the ALCF, he has been working on performance projections, performance analysis and simulation, studying the hardware trends and evaluates experimental and non-conventional hardware. He is also porting and tuning applications to large-scale supercomputers.