We use cookies to personalize your site experience.
Privacy policyCase Studies
Introduction
The research AI centre of one of the most advanced universities in computer science in Germany needed to upgrade its storage infrastructure, to cope with the increased demand from over 20 world-class Machine Learning research groups.
The AI centre, in collaboration with NEC Deutschland GmbH, implemented a high-performance, multi-petabyte storage solution based on NVMe servers, leveraging xiRAID software and the Lustre file system. This case study explores the transition, benefits, and impact of this solution on the centre’s operations.
Challenge
As the centre’s AI research accelerated, so did the volume of data generated and the demand for speed in accessing the data. The research groups are active in various fields of Machine Learning, the foundations of Artificial Intelligence, computer vision, AI in medicine, continual learning or sustainability, to name a few. Each research group has different workloads, creating the need for a more powerful and flexible storage infrastructure, capable of handling both sequential and random read and write peaks concurrently.
The centre’s previous set-up consisted of:
- 5 servers connected to 48-bay JBOF with 1.92 TB SATA SSD from Samsung.
- the drives were protected by Broadcom HW RAID controllers.
- Lustre File System
This configuration proved inadequate for the steadily increasing usable capacity, throughput and IO demands of the various AI workloads. The AI centre’s compute infrastructure consists of 15 NVIDIA DGX H100 supercomputers, 28 nodes with 8 x NVIDIA GeForce RTX 2080Ti each, and 40 nodes with 8 x NVIDIA A100 each. These complex and expensive compute nodes were often waiting for the storage, leading to a waste of precious compute resources, delays in the project’s executions and reduced number of projects that the AI centre could handle.
xiRAID+Lustre Solution
To address this challenge, the centre partnered with NEC Deutschland GmbH, one of the Europe’s leading providers of HPC and AI solutions.
The AI center already had advanced competence in using Lustre file systems and didn’t want to invest resources to develop new expertise to manage other file systems. Indeed, adopting a different file system would have greatly delayed the deployment of the solution, because of the need to train the storage administrators. To gain the needed performance to handle the new workload, NEC recommended to transition from SATA SSDs to NVMe SSDs.
At that point, hardware RAID was not an option anymore, due to its limited performance.
For this reason, NEC recommended to implement Xinnor’s xiRAID Classic, a software RAID solution specifically designed to handle the high level of parallelism of NVMe devices. Thanks to its architecture, xiRAID protects NVMe drives from possible hardware failures, without penalizing the performance.
The deployed solution consists of:
-
First Location:
- Storage: 1.7 PB on Micron Max 7450 12.8 TB PCIe 4.0 U.3 NVMe drives.
- Nodes: 14 nodes (1 MDT with 4 drives, 13 OSTs with 10 drives).
- Connectivity: NDR 200.
- AI Hardware: 15 x NVIDIA DGX H100 supercomputers.
- Software: xiRAID+Lustre.
-
Second Location:
- Storage: 2.8 PB on Micron Max 7450 12.8 TB PCIe 4.0 U.3 NVMe drives.
- Nodes: 21 nodes (3 MDTs, 18 OSTs with 10 drives each).
- Connectivity: 100 Gbps (one port per server), 40 Gbps (one port per compute node, Intel x710).
- AI Hardware: 28 nodes with 8 x NVIDIA GeForce RTX 2080Ti each, 40 nodes with 8 NVIDIA A100 each.
- Software: xiRAID+Lustre.
On the OSTs, xiRAID was implemented in RAID level 6, with 2 drives for parity over 10 drives (8+2). One of the advantages of xiRAID is its flexibility to select the ideal RAID level and geometry based on the specific customer requirement. In this case, the AI centre wanted to implement a protection level to guarantee continuous operation in case of up 2 drives failure in the same RAID group. Another feature of xiRAID is the ability to modify the chunk size, allowing to select the one that best aligns with the RAID geometry and expected workload. This way, xiRAID allows to minimize the number of read-modify-write operations, maximizing performance and extending the life of the drives, by minimizing write amplification. In this specific case, to get the best performance in write operations for handling small files typical of the Machine Learning projects (64—256kB), the chunk size was set to 32kB.
NEC Deutschland has played a crucial role in the design, implementation, and ongoing support of the storage infrastructure. The company's expertise in large-scale HPC and storage solutions and deep understanding of the centre’s requirements were instrumental in delivering a successful solution.
Scalability and Performance
The xiRAID+Lustre solution offers several key advantages for large-scale AI deployments:
- Performance: The combination of NVMe SSDs and xiRAID delivers exceptional performance, enabling rapid data access and analysis.
- Reliability: xiRAID's robust RAID protection ensures data integrity and fault tolerance, minimizing downtime and data loss.
- Flexibility: the Lustre parallel file system architecture provides flexible data access and sharing across multiple nodes and users.
- Scalability: The solution can be easily scaled to accommodate future growth by adding more storage nodes and expanding the Lustre filesystem.
- Ease of management: the integration of xiRAID with Lustre is straightforward as Lustre is mounted on the block devices generated by xiRAID. RAID creation, management and destruction is extremely easy and well documented in the user manuals.
Resiliency of the Solution
The AI centre had the chance to prove the resilience of the cluster when, due to a hardware problem, one of the metadata server started rebooting regularly. In this circumstance, it was critical to fix the problem while minimizing the service interruption. Xinnor’s engineers immediately took the call and provided the instructions on how to move the RAID drives and system drive from the failed server to the new one and re-install the xiRAID license to the new hardware. This operation proved to work smoothly and in around one hour the cluster was up and running again without any data loss.
Real-World Impact
The deployment of xiRAID+Lustre was up and running in just a few hours and it has proven to deliver significant boost on the center's AI research capabilities. Specific models taking tens of minutes to load with the previous storage infrastructure are now available within a few seconds. The new infrastructure not only meets the current demands but is also scalable for future growth. With improved performance, stability, and ease of management, xiRAID has proven to be a crucial asset in supporting the university's ambitious AI projects.