Training Deep Learning Models Using ThinkSystem SR680a V3, SR780a V3, SR685a V3 Compute Nodes with DDN AI400X2 Storage Nodes

Top

Authors

Published

8 Sep 2024

Form Number

LP2021

PDF size

42 pages, 3.3 MB

Rate & Provide Feedback

Download PDF

Abstract

Training of deep learning models, including Generative AI (GenAI) and its subset Large Language models (LLMs), require data movement through the network to be efficient and rapid with little to no data backups. Lenovo approaches this challenge through powerful high performance data center appliances (SR680a V3, SR780a V3, and SR685a V3) that support NVIDIA’s 8-way GPUs and high-performance storage using DDN’s AI optimized storage appliances.

This reference architecture considers data movement during training and GPU utilization as the primary design considerations. This architecture uses the latest NVIDIA H100 and H200 GPUs along with an InfiniBand network topology to deliver the speeds necessary to train large comprehensive models. The components of the architecture are described, an example of a scalable unit is provided, and the bill of materials for this design are included.

1. Introduction
2. Architectural Overview
3. Compute Layer
4. DDN Storage Layer
5. Neptune Water Cooled Technology
Appendix: Lenovo Bill of Materials
Resources

To view the document, click the Download PDF button.

Related product families

Product families related to this document are the following:

Lenovo Press