Xinnor recently conducted independent validation testing of its xiRAID storage solution in NVIDIA's laboratory, focusing on performance optimization for AI and ML workloads. This validation aimed to demonstrate xiRAID's capability to deliver consistently high throughput without becoming a bottleneck during large-scale training or checkpointing operations.
One of the primary challenges in designing high-performance systems for AI and ML workloads is creating a storage subsystem that can deliver consistently high throughput without becoming a bottleneck during large-scale training or checkpointing operations.
For systems tailored to AI/ML workloads, the following characteristics are critical:
- High Throughput – to sustain demanding data-intensive training processes.
- Efficient Checkpoint Write Operations – ensuring fast and reliable model state preservation.
- Minimal File System and RAID Overhead – to maximize performance and reduce latency.
In this validation, we examine the Xinnor xiRAID storage solution configured and tested in NVIDIA's laboratory environment.
Testing methodology
To demonstrate efficient checkpointing and data transfer into the model by leveraging a fast storage tier that maintains maximum GPU utilization during model training, we applied the following:
-
Raw NVMes Device Testing
We ran initial benchmarking of the raw devices to assess the baseline performance capabilities of the specific hardware platform. -
Xinnor xiRAID Performance Evaluation
We tested the backend at the xiRAID level both with and without the XFS file system to ensure that the RAID and file system layers introduce minimal overhead and that the achieved throughput remains close to that of raw NVMe drives. -
Model Training Data Transfer Testing
We evaluated data transfer performance during training using the MLPerf Storage 2.0 benchmark on a GPU configuration that sustains over 90% GPU utilization across all operational modes, without becoming CPU-bound. -
Checkpointing Performance Evaluation
We assessed checkpointing performance using MLPerf Storage 2.0 Checkpointing, demonstrating throughput metrics across different RAID operation modes — normal mode, degraded mode*, and rebuild mode*.
- Degraded mode — operation with reduced fault tolerance due to a disk failure.
- Rebuild mode — operation during data recovery after a failure.
- MLPerf Storage 2.0 — an industry-standard benchmark developed by the MLCommons consortium to evaluate storage system performance for AI/ML workloads. It measures end-to-end I/O performance in scenarios such as model training, checkpointing, and data loading, using realistic deep learning workflows.
System settings
Platform specification
- Server: Supermicro SYS-521GE-TNRT
- Memory: 2 TiB
- Processors: 2 × Intel Xeon Platinum 8580
- Storage: 8 × KIOXIA CD8P-R Series, 7.68 TB each
- Operating System: Ubuntu 24.04.3
- Linux Kernel: 6.11.0-1027-oem
Storage configuration
We logically split all eight NVMe drives into two namespaces:
- Namespace 1: 1 GB, allocated for the file system journal, with RAID 0 array (stripe size 16 KB).
- Namespace 2: 5.5 TB, allocated for the main data, with RAID 5 array (stripe size 128 KB).
We created the RAID arrays with the following commands:
xicli raid create -n log -l 0 -ss 16 -d /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 xicli raid create -n data -l 5 -ss 128 -d /dev/nvme2n2 /dev/nvme3n2 /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2
Output of xicli raid show:
RAID in reconstruction mode
We formatted the file system with XFS, with the journal placed on a separate RAID device.
The mkfs.xfs command was as follows:
mkfs.xfs -d su=128k,sw=7 -l logdev=/dev/xi_log,size=1G /dev/xi_xiraid5 -f -ssize=4k
Thus, the main data array (RAID 5) operated under XFS with a stripe unit of 128 KB and a stripe width of 7, while the journal (logdev) was located on a separate RAID device (separate namespace).
This design minimized the impact of journaling on active data, particularly during intensive write operations.
Performance results for raw NVMe drives, RAID array and xfs file system
In the table below, the results for raw devices, RAID levels, and the XFS file system are provided.
Before the measurements, each NVMe drive was overwritten twice using 128 KB blocks.
The FIO configuration files can be found in the appendix.
| Workload pattern | 8 drives | RAID | xfs |
|---|---|---|---|
| Sequential write | 45.3 GB/s | 37.6GB/s | 33.0GB/s |
| Sequential Read | 53.1 - 83.4 GB/s* | 49.2 - 71.4GB/s* | 49.0 - 70.9GB/s* |
We observed no issues with sequential write performance. Under conditions close to the theoretical maximum throughput for the RAID configuration — approximately 39.6 GB/s (RAID 5 performance = 8 drives − 1 parity drive) — we achieved 94% of theoretical maximum for RAID 5.
* For sequential reads, we observed performance fluctuations on the raw devices, as well as on the RAID and XFS filesystem layers. The performance was unstable. However, since we had limited time to perform tests on this system, we decided to move on.
MLPerf Storage 2.0 performance measurements results
MLPerf Storage checkpointing tests
These types of benchmarks evaluate the performance of saving and loading model checkpoints in large-scale training environments. They measure how quickly a system can write checkpoints (Save) and restore them (Load) while training massive models.
For MLPerf Storage checkpointing, we ran the tests using the following parameters in normal, degraded and reconstruction RAID operational modes:
mlpstorage checkpointing run -rd ch_r3 -m llama3-405b --client-host-memory-in-gb 2048 -np 38 -cf /mnt/mlperf/cp --allow-run-as-root --param parameters.checkpoint.fsync=true parameters.framework=pytorch parameters.model.parallelism.pipeline=32 parameters.model.parallelism.tensor=16
-
-rd ch_r3
Run directory (or run descriptor) named ch_r3. Used to group or label this run. -
-m llama3-405b
Model selection — here it specifies the model LLaMA 3, 405B parameters. -
--client-host-memory-in-gb 2048
Amount of host memory allocated to the client: 2048 GB (2 TB). -
-np 38
Number of processes. -
-cf /mnt/mlperf/cp
Checkpoint folder: /mnt/mlperf/cp. All checkpoints will be stored here. -
--param parameters.checkpoint.fsync=true
Ensures that after writing a checkpoint, a filesystem sync is forced — guarantees data integrity in case of crash/failure. -
parameters.model.parallelism.
pipeline=32
Pipeline parallelism degree: 32 stages. The model will be split into 32 sequential partitions across devices. -
parameters.model.parallelism.
tensor=16
Tensor parallelism degree: 16-way. Each layer’s tensor operations are sharded across 16 devices.
The results were the following*:
| RAID operational mode | Save Throughput | Time | Relative efficiency to Normal mode |
|---|---|---|---|
| Normal RAID | 32.18 GB/s | 14.15 sec. | 100% |
| Degraded RAID | 31.27 GB/s | 14.55 sec. | 97% |
| Reconstruction RAID | 29.78 GB/s | 15.47 sec. | 93% |
*Unverified (Result not verified by MLCommons Association)
Taking into account the previously measured file system performance of 33 GB/s, we observe minimal performance degradation across all RAID operational modes, achieving 90%–97% of the file system’s performance. When disregarding the file system abstraction layer and comparing with the theoretical maximum throughput of 39.6 GB/s for RAID 5, the obtained Save results are exceptionally good. Overall, the system maintains high efficiency even under degraded and reconstruction conditions, demonstrating only minor performance loss regardless of the RAID configuration.
MLPerf Storage Training tests
These benchmarks measure storage performance during end-to-end model training, focusing on how efficiently training data can be read and fed to accelerators. They evaluate I/O throughput, parallel data access, and prefetching under realistic deep learning workloads, showing how well a system sustains high-performance training at scale.
Since the Supermicro SYS-521GE-TNR platform can support up to 10 GPUs, but the recommended number is 8, taking into account the need for a high-speed network, we conducted MLPerf Storage training tests using 8 and 12 emulated GPUs.
At the beginning, we generated a 75,000-file training dataset for the 3D U-Net model, using 12 emulated GPUs on the local machine, with the following command:
mlpstorage training datagen --hosts=127.0.0.1 --model unet3d --param dataset.num_files_train=75000 --num-accelerators 12 --results-dir /mnt/mlperf/results --data-dir /mnt/mlperf/ --allow-run-as-root
Then we ran the mlstorage training tests using the following options:
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory 2048 --num-accelerators 12 --accelerator-type h100 --model unet3d --data-dir /mnt/mlperf/ --results-dir /mnt/mlperf/results --allow-run-as-root --param dataset.num_files_train=75000 reader.odirect=true reader.read_threads=8 reader.prefetch_size=4
-
--num-client-hosts 1
Uses 1 client host. -
--client-host-memory 2048
Allocates 2048 GB of memory for the client host. -
--num-accelerators 8 (12)
Training run on 8 (12) accelerators (GPUs). -
--accelerator-type h100
GPU type is NVIDIA H100. -
--model unet3d
Trains the 3D U-Net model. -
--data-dir /mnt/mlperf/
Location of the training dataset. -
--results-dir /mnt/mlperf/results
Directory where training results will be stored. -
Training parameters (--param):
- dataset.num_files_train=75000 → Uses 75,000 training files.
- reader.odirect=true → Enables O_DIRECT I/O (direct disk access, bypassing OS cache for efficiency).
- reader.read_threads=8 → Uses 8 threads for reading data.
- reader.prefetch_size=4 → Prefetches 4 batches into memory for training speedup.
The main output parameter for MLPerf Training is Training Utilization. This metric indicates how efficiently resources are being used during model training on a given system. Training utilization for 8 and 12 GPUs is shown in the table below*:
| Number of Accelerators | Training Utilization Normal RAID |
Training Utilization Degraded RAID |
Training Utilization Reconstruction RAID |
|---|---|---|---|
| 8 | 99.87% | 99.69% | 98.97% |
| 12 | 99.68% | 99.65% | 82.88% |
*Unverified (Result not verified by MLCommons Association)
In the MLPerf Training test, unlike MLPerf Checkpointing, most of the CPU resources are used for emulating the GPU accelerators, while the training itself mainly performs read operations without computations on the xiRAID side. As shown in the results, with the optimal configuration of 8 GPUs for this platform, Training utilization remains consistently high across all RAID operational modes—reaching 99.87% in normal mode, 99.69% in degraded mode, and 98.97% in reconstruction mode. With 12 GPUs, utilization remains excellent in normal (99.68%) and degraded (99.65%) modes, but drops significantly to 82.88% during reconstruction due to the combined load of CPU emulation and RAID rebuild operations competing for system resources. These results demonstrate that xiRAID maintains exceptional training performance even under fault conditions when the GPU count is optimally matched to the platform's CPU capabilities.
Conclusion
Overall, the test showed that Xinnor xiRAID delivers exceptionally efficient checkpoint write throughput, operating close to the limits of the storage system itself, with minimal performance loss even in degraded or rebuild modes. Training performance further demonstrated that the storage system is capable of sustaining very high GPU utilization, close to 100%, and will not become a bottleneck for systems equipped with multiple GPUs.
Appendix
FIO configuration file for drives preconditioning
rw=write
bs=128K
iodepth=64
direct=1
ioengine=libaio
group_reporting
numa_cpu_nodes=0
numa_mem_policy=local
#runtime=600
loops=2
[job1]
filename=/dev/nvme2n2
...
[job8]
filename=/dev/nvme9n2
FIO configuration file for RAID tests
rw=read/write
bs=896k
iodepth=64
direct=1
numjobs=8
verify=0
offset_increment=12%
ioengine=libaio
group_reporting
numa_cpu_nodes=0
numa_mem_policy=local
runtime=60
[job1]
filename=/dev/xi_data
FIO configuration file for filesystem tests
rw=write
bs=896k
iodepth=64
direct=1
numjobs=16
verify=0
offset_increment=5%
ioengine=libaio
group_reporting
numa_cpu_nodes=1
numa_mem_policy=local
runtime=120
size=50G
[job1]
directory=/mnt/mlperf