ALCF Training Team

Join us on May 28, 2025, for a webinar on Accelerating AI Training and Inference for Science on Aurora: Frameworks, Tools, and Best Practices.

In this developer session, we will provide an overview of the key AI frameworks, toolkits, and strategies on Aurora to achieve high performance training and inference for science. We'll cover examples of using PyTorch, TensorFlow on Aurora followed by distributed training at scale using PyTorch with Distributed Data Parallel (DDP) and TensorFlow with Horovod - all driven using the oneCCL communication library.  We will cover topics on effectively using Python on Intel's GPUs using  Data Parallel Extensions for Python (DPEP).  To get the most out of the GPUs, we will cover best practices to profile codes and understand bottlenecks. We will also cover topics on pre-training, finetuning and inference on Aurora and associated best practices.

Starts
Ends
America/Chicago
Online
Virtual meeting information to follow