Case Studies

About Hochschule Aalen

With five faculties, 60 diverse and future-oriented degree programs, more than 130 collaborations worldwide and over 4,500 students, Aalen University of Applied Sciences–Technology and Business is one of the largest universities of applied sciences in Baden-Württemberg in Germany.

Hochschule Aalen

Download the case study (PDF, 4MB)

Challenge

The university required a high-performance, reliable, and scalable storage solution to accommodate its growing data needs and ensure business continuity. Aalen University faced the following challenges:

  • High-Performance Requirements: The university needed a storage system capable of delivering high throughput and low latency to support its demanding machine and deep learning applications.
  • Data Availability and Integrity: The storage solution had to be highly reliable and fault-tolerant to protect critical data from hardware failures.
  • Scalability: The solution needed to be easily scalable to accommodate future growth in data and compute resources.
  • Geographical Distribution: The university's infrastructure is spread across two physically separated locations, requiring a solution that could synchronize data between the sites.

Solution

To address these challenges, ABC Systems implemented a storage solution based on Xinnor's xiRAID and IBM Storage Scale, as described in this layout:

ABC Systems implemented a storage solution based on Xinnor's xiRAID and IBM Storage Scale

Key components of the solution

The solution consists of the following components:

  • 2 Supermicro E2E NVMe Servers each with:
    • single AMD Genoa CPU 9224 with 24 Cores, customized for ABC Speedway Storage Design
    • 10x 7.68TB NVMe PCIe 5.0 Kioxia CD8P KCD8XPUG7T68
    • XINNOR xiRAID Classic 4.1 in RAID level 6
    • IBM Storage Scale with Active File Management (AFM)
  • 1x Nvidia DGX A100 connected to the first node over 40Gbs Ethernet Network
  • 2 Workstations connected via 100Gbs Ethernet to the second node
Key components of the solution

The solution is based on IBM Storage Scale in its “Data Access” Edition. One of the benefits of the Data Access Edition is the support of Active File Management (AFM), that enables to organise asynchronous incremental data replication between storage nodes according to user policies, ensuring data availability, consistency and integrity.

The other benefit is that with the Data Access Edition, ABC Systems was given the freedom to select its preferred hardware and data protection scheme.

When it comes to the data protection mechanism, the options for ABC systems were to use:

  1. IBM GPFS Native RAID (GNR). This option is not available for open architecture solutions. The Data Access and Data Management Editions that run with GNR are exclusively restricted to proprietary hardware.
  2. IBM Erasure Code Edition (ECE). Besides the increased cost of the license, ECE requires a minimum of 4 NVMe storage nodes, with 6 being preferrable. Implementing ECE would have significantly increased the complexity and the cost of the solution, without any expected benefit in term of performance.
  3. Use a third-party RAID engine. ABC Systems decided not to use hardware RAID as it is connected to the CPU by using 16 PCIe lanes. These 16 lanes would have restricted the number of lanes available for NVMe SSD or the network cards. Moreover, with 16 PCIe lanes, hardware RAID can only address 4 NVMe drives at maximum performance, as each drive has 4 lanes.

To overcome all these limitations, ABC systems selected Xinnor’s xiRAID Classic. xiRAID Classic is a software RAID designed for NVMe drives, that leverages AVX technology of modern x86 CPUs and combines it with Xinnor’s lockless data path, the ability to dynamically allocate the stripes to CPU cores, to avoid bottlenecks. This way, xiRAID achieves maximum performance in both normal operations and degraded mode, while minimizing the usage of the system resources.

xiRAID Classic exposes a block device that seamless integrate with any file system, so after creating the RAID in RAID level 6 (8 drives for data and 2 for parity), it has been very straight forward to mount IBM Storage Scale on top of it.

Local Performance test on a single server

After installing the servers at the University, ABC tested the performance of the file system over xiRAID using 4MB Block size.

RAID configuration:

xicli raid create -n stg01 -l 6 -d /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 -ss 128 -mwe 1

GPFS FS with gpfsperf:

/usr/lpp/mmfs/samples/perf/gpfsperf write  seq 120Gfile -n 120g -r 4m -th 10 -fsync -dio
  recSize 4M nBytes 120G fileSize 120G
  nProcesses 1 nThreadsPerProcess 10
  file cache flushed before test
  using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
  fsync at end of test
    Data rate was 42836998.24 Kbytes/sec, Op Rate was 10213.14 Ops/sec, Avg Latency was 0.977 milliseconds, thread utilization 0.997, bytesTransferred 128849018880

/usr/lpp/mmfs/samples/perf/gpfsperf read   seq 120Gfile -n 120g -r 4m -th 10 -fsync -dio
  recSize 4M nBytes 120G fileSize 120G
  nProcesses 1 nThreadsPerProcess 10
  file cache flushed before test
  using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
    Data rate was 80189171.64 Kbytes/sec, Op Rate was 19118.59 Ops/sec, Avg Latency was 0.513 milliseconds, thread utilization 0.980, bytesTransferred 128849018880

Single node performance summary:

  Raw drive performance GPFS gpfsperf -dio measured performance Overall efficiency *
Sequential write (GB/s) 55 (specification performance) 42.8 78%
Sequential read (GB/s) 92 (measured performance) 80.2 87%

* The efficiency is calculated at file system level, and it includes the RAID overhead. In calculating the efficiency, we have considered all the 10 drives, including the 2 drives used for parity in RAID6. As such, the achieved efficiencies are very close to the maximum theoretical limits.

These numbers exceed the installed network bandwidth to the clients, leaving ample room for expansion when new compute systems will be required.

Conclusion

The combination of well balanced and cost-optimized NVMe servers, Xinnor xiRAID software RAID and IBM Storage Scale file system provides Aalen University with a very fast storage system to serve its Machine Learning and Deep Learning research projects.

The solution meets the University requirements in terms of:

  1. Performance: The combination of NVMe drives, Xinnor xiRAID, and IBM Storage Scale enables the storage system to deliver high throughput. Additionally, using software RAID frees precious PCIe lanes, removing the need of a switch, with a positive impact on latency.
  2. Data Integrity and Availability: Xinnor's xiRAID 6 configuration provides robust data protection, while IBM Storage Scale's asynchronous replication ensures data availability and disaster recovery.
  3. Scalability: The solution is provisioned to handle more compute clients and can be easily scaled by adding more drives per node or more storage servers.
  4. Cost-Effectiveness: The use of Xinnor software-based RAID allows to use the Data Access Edition of IBM Storage Scale on flexible hardware choices, minimizing the cost of the high-performance storage solution.