Join us on October 23, 2024, for an overview of Distributed Asynchronous Object Storage (DAOS) and Application’s Best Practices. DAOS is the primary filesystem on Aurora with 230 PB and ~31 TB/s. Integrating your application to use DAOS for I/O and checkpointing is crucial. In this tutorial, ALCF's Gordon McPheeters, Paul Coffman, and Kaushik Velusamy will provide a technical overview of DAOS, how it differs from other Parallel File Systems like Lustre and GPFS, tools offered by DAOS, and how your application can integrate with DAOS and best practices to follow. The speakers will also present results from other applications currently using DAOS.
Gordon McPheeters
Paul Coffman is a team member of ALCF's Performance Engineering group. After a 25-year technical career with IBM culminating with the Blue Gene/Q Software System Test team he left IBM in 2015 to join the Catalyst group at the ALCF optimizing scientific HPC applications on BG/Q Mira. In 2017 he transferred to the Performance Engineering group where he focused on HPC I/O and messaging until he went on hiatus in 2018. He has returned to his former position in 2023 to work on DAOS File System I/O performance and scalability on HPE-Intel Aurora.
Kaushik Velusamy is an Assistant Computer Scientist in the data science group with the Argonne Leadership Computing Facility at Argonne National Laboratory. His research focuses on optimizing data access performance for machine learning and exploring novel computer architectures. His current projects include scientific data management, DAOS, deep learning I/O (DLIO), HDF5, collective communications, distributed memory systems, parallel I/O, and large-scale distributed deep learning. He received his Ph.D. in Computer Science from the University of Maryland in 2021 under the guidance of Dr. Milton Halem and Dr. John Dorband.