Introduction
Over the past year, the economics of NVMe storage have changed significantly. Since the summer of 2025, enterprise SSD price levels have more than doubled, and industry forecasts suggest that this trend will continue through 2026. According to TrendForce, strong AI-driven server demand and constrained supply are expected to drive high double-digit quarter-over-quarter price increases for both DRAM and enterprise SSDs throughout 2026. At the same time, lead times for high-capacity drives are stretching to six months or more. For many organizations running AI, HPC, and data-intensive workloads, expanding storage capacity is becoming slower and significantly more expensive. This sustained SSD price increase combined with ongoing SSD shortage, is forcing data centers to rethink how how they design and scale data center storage and enterprise storage solutions.
With SSD prices rising and lead times increasing, simply adding more drives is no longer a practical solution. The focus is shifting toward using existing hardware more efficiently i.e. getting more usable capacity, higher performance, and stronger reliability from the same NVMe infrastructure.
In practice, however, many deployments still fall short of this goal. A large portion of available performance and capacity is often lost due to inefficient data protection schemes and RAID implementations that were not designed for today’s ultra-fast storage devices. In many environments this creates a hidden storage bottleneck that prevents organizations from fully exploiting modern NVME performance.
In this article, we look at how rising NVMe prices are changing the way organizations think about storage design. We examine how different RAID approaches affect performance, capacity, and system efficiency, why traditional hardware and software RAID solutions struggle to scale with modern drives, and how Xinnor’s xiRAID architecture addresses these limitations through a lockless, CPU-efficient design.
Hardware RAID: Limited by PCIe Architecture
One of the first options many organizations consider for NVMe storage is a traditional hardware RAID controller. On paper, this seems like a straightforward solution: offload parity calculations to a dedicated card and keep storage management separate from the CPU.
However, hardware RAID was never designed for today’s ultra-fast NVMe drives. Most RAID controllers connect to the CPU through a single PCIe x16 interface, which caps bandwidth at roughly 63 GB/s. Since each NVMe drive typically requires four PCIe lanes to operate at full speed, this means a controller can effectively support only four drives before becoming a bottleneck. Beyond raw bandwidth limits, hardware RAID introduces additional constraints. It consumes a valuable PCIe slot, adds another ASIC into the data path, and imposes IOPS limitations defined by the controller itself. As a result, even when multiple high-performance SSDs are installed, the controller often becomes the choke point between applications and the storage. In modern NVMe servers, this architecture prevents systems from reaching anything close to their theoretical performance potential.
This makes hardware RAID a poor fit for high performance storage deployments in modern server storage and data center storage architectures, where PCIe efficiency and scalability directly affect data center efficiency.
Software RAID (MDRAID and VROC): CPU and Memory Bottlenecks
Linux software RAID solutions such as MDRAID and Intel VROC avoid some of the hardware limitations by running directly on the host CPU. By removing the external RAID card, drives connect straight to the processor, eliminating the PCIe bottleneck introduced by hardware controllers.
However, this approach introduces a different set of problems. MDRAID relies on locking mechanisms to coordinate access to RAID stripes. Under high I/O load, these locks cause uneven CPU utilization, where some cores are fully saturated while others remain underused. As shown in the diagram, a small number of cores often become bottlenecks, limiting overall throughput. At the same time, software RAID frequently requires additional memory copies between RAM and CPU, wasting memory bandwidth and increasing latency. These extra data movements become especially costly at NVMe speeds.
In real deployments, Software RAID on Linux can consume 30–40% of available CPU resources, this directly impacts data center efficiency, increases SSD cost per usable terabyte, and raises the overall cost of storage for servers, working against long-term data center cost optimization. Also, in degraded mode this figure can climb up to 75%. For systems running AI, analytics, or HPC workloads, this means valuable CPU cores are diverted from applications to storage management. While software RAID is flexible and widely available, its locking and memory overhead make it difficult to scale efficiently with modern NVMe drives.
GPU-Assisted RAID: Why Offloading Creates More Problems Than It Solves
To address CPU limitations, some vendors have attempted to offload RAID processing to a dedicated GPU. In this model, parity calculations and data handling are performed on the GPU, while SSDs connect directly to the GPU-based RAID engine. On paper, this promises high compute throughput and reduced host CPU load, and in controlled benchmarks it can deliver strong headline numbers.
In practice, however, this approach introduces several structural trade-offs. First, the GPU becomes a single point of failure. If the card fails, access to the entire storage array may be lost. Second, GPU-based RAID consumes a full PCIe slot and sixteen PCIe lanes, reducing the resources available for networking, accelerators, or high-speed interconnects. It also adds around 70 W of additional power consumption, increasing operational cost and cooling requirements. From a planning perspective, this additional power draw and PCIe consumption work against data center cost optimization and complicates server deployment at scale and works against IT infrastructure cost reduction goals in modern data centers. GPU-assisted RAID also increases latency. Data must travel from the application to the CPU, then to the GPU, and finally to the SSDs. This extra hop adds delay and becomes especially visible in random and small I/O workloads. As a result, random write performance is often weaker than expected, and in some cases can be lower than with optimized CPU-based solutions. In larger deployments, dedicating PCIe slots and lanes to RAID GPUs reduces system flexibility and often leads to installing additional cards, further wasting bandwidth and expansion capacity. Despite the added hardware, the overall performance gains are limited, while system complexity increases.
As a result, while GPU-assisted RAID can appear attractive in benchmarks, it often complicates system design and resource planning without delivering proportional benefits in production environments.
This has led to a different design approach: keeping RAID processing close to the data path and tightly integrated with the CPU, while removing unnecessary synchronization and data movement. Rather than offloading work to external devices or centralized subsystems, this model focuses on maximizing the efficiency of each core and each drive. Against this background, Xinnor developed xiRAID Classic as a CPU-native alternative to both hardware controllers and traditional software RAID. Instead of adding new layers to compensate for architectural limits, xiRAID focuses on removing them. The following section looks at how this design works in practice.
xiRAID Classic: A CPU-Native, Lockless RAID Architecture
As NVMe prices and DRAM costs continue to rise, storage efficiency is no longer just about performance, it is increasingly about how much useful work a system can deliver per unit of hardware investment. In 2025 alone, DRAM prices increased by more than 300%, and further increases are expected in 2026. Under these market conditions, architectures that rely heavily on large caches and memory buffering become increasingly expensive to operate.
Instead of relying on dedicated hardware controllers, GPUs, or large memory caches, xiRAID runs directly on the host CPU and uses modern vector instructions to accelerate RAID processing. All parity and checksum calculations are performed using patented AVX2-based algorithms, allowing the system to achieve high throughput without specialized hardware. A key design principle of xiRAID is the elimination of unnecessary data movement. The engine operates without a write-back cache and avoids memory-to-memory copying. Data flows directly between applications, CPU cores, and NVMe drives, reducing latency and minimizing memory bandwidth consumption. By eliminating centralized locks and unnecessary memory copies, this design helps reduce storage latency under both sequential and highly parallel I/O workloads.
Unlike traditional software RAID, xiRAID does not rely on centralized stripe locks. Instead, stripe ownership is distributed dynamically across available CPU cores. This prevents hot spots, eliminates blocking, and allows all cores to participate evenly in I/O processing. As seen in the diagram below, workload is balanced across cores, with no single core becoming overloaded. Each core operates at low utilization while maintaining high aggregate throughput. This parallel, non-blocking design enables direct CPU-to-SSD data paths and sustained performance even under heavy load. With AVX2 acceleration, a lockless architecture, and zero-copy data handling, xiRAID Classic removes the bottlenecks seen in hardware RAID, traditional software RAID, and GPU-assisted solutions. By removing locking, reducing memory traffic, and keeping data on a direct CPU-to-drive path, xiRAID Classic enables practical storage performance optimization and helps organizations improve server performance without additional hardware.
To understand how this architectural approach translates into real performance, we compared xiRAID Classic with widely used hardware and software RAID solutions under identical NVMe configurations.
Turning Capacity into Performance
Many deployments today still rely on MDRAID or VROC in RAID10. The problem is simple: RAID10 immediately cuts usable capacity in half. This makes RAID layout decisions a central element of storage optimization, especially when NVMe capacity expansion is constrained by rising drive prices. Fifty percent of the raw storage is lost to mirroring. Let's take an example of a server equipped with 24 × 30 TB NVMe drives.
Capacity Comparison: MDRAID vs. xiRAID
| Configuration | Data Layout | Raw Capacity | Usable Capacity |
|---|---|---|---|
| RAID10 (MDRAID) | 12 data + 12 mirroring | 720 TB | 360 TB |
| RAID5 (xiRAID) | 23 data + 1 parity | 720 TB | 690 TB |
With xiRAID almost the entire raw capacity becomes usable. Compared to RAID10, that is close to 2x more usable space from the same hardware. This is not a trade-off where capacity is gained at the expense of speed. In this case, performance improves at the same time.
Workload Performance
| Workload | MDRAID (RAID10) | xiRAID (RAID5) |
|---|---|---|
| Sequential Read | 303 GB/s | 300 GB/s |
| Sequential Write | 84.5 GB/s | 150 GB/s |
| Random Read | 1.5M IOPS | 24M IOPS |
| Random Write | 0.9M IOPS | 4.8M IOPS |
Read throughput is similar in both cases, but xiRAID delivers nearly twice the write bandwidth. For random workloads, xiRAID shows a much larger advantage, with xiRAID showing much higher parallelism and write efficiency.
The capacity advantage becomes even more meaningful when you look at total system cost. Let’s assume your target is 360 TB of usable storage using 30 TB NVMe drives. At current market prices, a single drive costs roughly $5,000, and in many cases even more.
Cost Comparison For 360 TB Usable Capacity
| Configuration | Raw Capacity | Usable Capacity | Drive Cost |
|---|---|---|---|
| RAID10 (MDRAID) | 720 TB | 360 TB | $120,000 |
| RAID5 (xiRAID) | 390 TB | 360 TB | $65,000 |
With MDRAID in RAID10, half of the installed capacity is lost to mirroring, which means reaching 360 TB of usable storage requires 720 TB of raw capacity and 24 drives. Using xiRAID in RAID5, the same usable capacity can be achieved with only 390 TB of raw storage and 13 drives. As a result, the drive cost per server is reduced by about $55,000, or roughly 46%, without sacrificing performance or reliability. In multi-rack AI clusters, this difference quickly scales into hundreds of thousands or millions of dollars, making storage architecture a key lever for infrastructure cost optimization rather than just a performance decision.
Real-World Deployments: From Ceph to xiRAID + Lustre
Beyond synthetic benchmarks, the impact of storage architecture becomes much clearer in production environments. One example comes from Friedrich Alexander University (FAU), which worked with Xinnor and MEGWARE to modernize its NVMe-based storage cluster.
In 2023, FAU deployed seven NVMe servers, each populated with 24× 7.68 TB Gen4 SSDs, using Ceph with double replication. While the setup provided resilience, it came at a high capacity cost: out of 768 TB of raw storage, only 580 TB was usable.
In 2024, part of the cluster was migrated to xiRAID combined with Lustre, using RAID10 for metadata targets and RAID6 for object storage targets. After validating the results, the university converted the full system to the new architecture.
With the same hardware and no other changes, the new configuration delivered both higher capacity and significantly better performance. After the migration, usable storage increased from 580 TB under Ceph’s double-replication scheme to 775 TB with xiRAID and Lustre. This represents a 33% increase in usable capacity without adding any drives, allowing FAU to make substantially better use of its existing NVMe infrastructure.
Performance Comparison
| Metric | CephFS | xiRAID + Lustre | Gain |
|---|---|---|---|
| Write throughput | 26.4 GB/s | 117.3 GB/s | 4.4× |
| Read throughput | 54.6 GB/s | 158.6 GB/s | 2.9× |
Source: https://xinnor.io/case-studies/fau/
The migration from Ceph to xiRAID resulted in three to four times higher throughput, depending on workload, while preserving the existing hardware investment.
After the initial FAU migration, the same approach was later applied at a much larger scale. This led to the deployment of the Helma cluster at NHR@FAU.
Real-World Deployments: The Helma Cluster at NHR@FAU
FAU later deployed a much larger AI storage platform to support one of Germany’s largest university GPU clusters. The Helma cluster at NHR@FAU was built using xiRAID and Lustre in a highly available configuration, protecting against both drive and server failures. Apart from the storage software stack, the system relies primarily on commodity hardware and open-source components.
Key characteristics of the cluster include:
- 5 PB of highly available storage
- Support for 768 GPUs
- Lustre-based architecture powered by xiRAID
- Commodity servers and open software stack
The test results was submitted to the IO500 benchmark, the industry-standard performance ranking for HPC and AI storage clusters. According to the official IO500 results, the Helma cluster ranks the 3rd fastest production storage cluster worldwide and the fastest Lustre-based system at the time of submission.
The FAU Helma cluster is not only strong in absolute performance. It also stands out when compared directly with other large-scale deployments built on the same hardware platform.
Both the Helma system and the Leonardo cluster use the Celestica SC6100 server (branded as DDN ES400NVX in DDN deployments). This makes the comparison especially relevant, since the underlying hardware is essentially the same.
IO500 Results Comparison
| # | System | Solution (Vendor) | Score | Number of systems | Score / storage systems | Number of NVMe drives | Score / drive |
|---|---|---|---|---|---|---|---|
| 3 | Helma | xiRAID + Lustre (Xinnor) | 838.99 | 10 | 83.9 | 240 | 3.5 |
| 7 | Leonardo* | ExaScaler (DDN) | 648.96 | 29 | 22.4 | 688 | 0.9 |
| 1/3 | 4x | 1/3 | 3.7x |
Note: In addition to the 29 NVMe systems, the Leonardo cluster also includes 180 HDDs for OST
Looking at the overall IO500 score, the Helma cluster already performs better than the comparable Leonardo deployment. The difference becomes even more clear when hardware usage is taken into account. Helma achieved its results using only 10 systems, compared to 29 in the DDN-based cluster, while relying on roughly one-third of the NVMe drives. Despite this, it delivers close to four times higher performance per server and nearly four times more performance per drive. This efficiency comes from how data protection is implemented. By minimizing overhead and avoiding architectural bottlenecks, xiRAID allows more of the available hardware to be used for real application workloads instead of parity processing and synchronization. These large-scale deployments highlight the benefits of efficient data protection. At the same time, data protection is only one part of overall system optimization.
Further Ways to Reduce Overall deploying Storage Costs
Optimizing data protection is only one part of the overall cost equation, there are several additional areas where modern storage systems can be made more efficient.
- Minimize CPU consumption. CPU prices continue to rise, and storage stacks that consume a large share of compute resources directly increase system cost. Reducing RAID overhead and avoiding excessive locking and memory copies helps preserve CPU cycles for applications.
- Share NVMe across multiple clients. Instead of dedicating high-performance NVMe drives to a single server, organizations can improve utilization through NVMe-oF. This makes it possible to expose the same storage pool to multiple clients, improving return on investment.
- Offload storage processing. Another approach is to move parts of the storage stack away from the host CPU. Running storage services on platforms such as NVIDIA BlueField DPUs can reduce CPU load and lower dependency on expensive DRAM and HBM.
xiRAID Opus: Optimized Performance in User Space
While xiRAID Classic focuses on efficient host-based RAID, some environments require even tighter control over latency, networking, and resource isolation. Alongside xiRAID Classic, Xinnor also offers xiRAID Opus, a user-space storage engine designed for environments that require the lowest possible latency and the highest levels of throughput.
xiRAID Opus is built on top of the SPDK framework and operates entirely in user space, including both the data path and drive connectivity. Instead of relying on kernel interrupts, it uses a polled-mode architecture to check for data availability. This approach reduces context switching and allows the system to make more efficient use of CPU resources.
To achieve maximum performance, xiRAID Opus dedicates one or more CPU cores exclusively to storage processing. RAID checksum and parity calculations are handled through Xinnor’s vectorized computation engine. On x86 platforms, this is implemented using AVX2 instructions, while ARM-based systems rely on vector registers. This allows xiRAID Opus to deliver consistent performance across different processor architectures, including DPUs and other accelerator-based platforms. In addition to RAID functionality, xiRAID Opus also integrates NVMe-oF initiator and target capabilities.
Minimizing CPU Consumption with xiRAID Opus
One of the main design goals of xiRAID Opus is to reduce the CPU resources required for data protection and I/O processing. In AI and HPC systems, CPU cores are shared between storage, data preprocessing, orchestration, and application logic. Storage software that consumes a large portion of these resources increases power usage and limits overall system efficiency.
To evaluate this, Xinnor conducted joint testing with Intel on Intel® Xeon® 6 processors that combine performance cores (P-cores) and energy-efficient cores (E-cores). The results show that xiRAID Opus can deliver high throughput even when constrained to a small number of CPU cores, with per-core performance approaching NVMe backend limits in user space. The full benchmark results include detailed performance and efficiency comparisons between E-cores and P-cores across RAID10 and RAID6 configurations and can be found in link below.
More importantly, the tests demonstrated that xiRAID Opus runs efficiently on E-cores. While P-cores still provide higher peak performance, xiRAID Opus allows E-cores to drive parity-protected NVMe storage close to saturation, making it possible to build high-performance storage nodes using lower-power CPUs. This improves performance per watt and reduces both power and infrastructure costs, while keeping high-performance cores available for application workloads.
Read more here: xiRAID Opus: maximizing performance and data protection on low-power Intel CPU
Sharing NVMe Across Multiple Clients with NVMe-oF
Improving local storage efficiency is only part of the challenge. In many modern data centers, NVMe drives are distributed across multiple servers, leading to what is often referred to as “dark flash”: high-performance storage capacity that remains underutilized because it is tied to a specific node. xiRAID Opus addresses this problem through integrated NVMe-oF support. By acting as both an initiator and a target, Opus allows protected NVMe volumes to be exposed over high-speed networks and shared among multiple clients.
In testing conducted with Supermicro systems and 400 Gb/s RoCE networks, xiRAID Opus demonstrated that remote access to protected volumes can closely match local performance. In these experiments, a single server acted as an NVMe-oF target, while a second server accessed the storage over the network.
Test Specifications:
- Sequential Read: 14GB/s (one drive) * 7 drives = 98GB/s
- Sequential Write: is calculated taking in account penalties for RAID6 with 2 parity drives (5 drives * 10GB/s) = 50 GB/s
- Maximum Network throughput = 400Gb (~49GB/s)
- xiRAID Opus was tested with varying numbers of cores
- FIO workload pattern: 7 jobs, 32 iodepth
Sequential Performance over NVMe-oF (RAID6, 7× Micron 9550)
| Pattern | Bandwidth | Network Utilization | Drive Utilization |
|---|---|---|---|
| Read, 1 Core | 49.0 GB/s | 100% | Network-limited |
| Write, 1 Core | 21.7 GB/s | 44% | 43% |
| Write, 2 Cores | 33.9 GB/s | 70% | 68% |
| Write, 4 Cores | 35.1 GB/s | 71% | 70% |
| Write, 8 Cores | 47.6 GB/s | 97% | 95% |
With just one core, sequential reads already saturate the 400 Gb/s network link. For writes, near-maximum throughput is achieved with eight cores. In other words, performance over the network is primarily limited by network capacity rather than by the storage stack itself.
This makes it possible to pool NVMe resources and dynamically allocate them to workloads that need them most, improving overall utilization and reducing the need for overprovisioning.
Random I/O Scaling and RAID6 Efficiency
Random workloads are typically more challenging for parity-based RAID systems due to read-modify-write penalties. For RAID6, theoretical maximum random write performance is approximately one-third of raw drive throughput.
Based on Micron 9550 specifications:
- Per-drive random write: ~380k IOPS
- 7 drives × 33% ≈ 877k IOPS theoretical maximum
xiRAID Opus closely approaches this limit as CPU resources increase.
Random Write Performance over NVMe-oF (RAID6)
| Cores | IOPS | Latency (µs) | Drive Utilization |
|---|---|---|---|
| 1 | 292k | 765.03 | 33% |
| 2 | 470k | 475.52 | 53% |
| 4 | 669k | 333.37 | 76% |
| 8 | 851k | 261.86 | 97% |
Random write performance and latency
Random write performance drive usage ratio
As more cores are added, both throughput and drive utilization increase steadily, while latency decreases. At eight cores, the system operates close to the theoretical RAID6 limit, indicating that software overhead is minimal.
Latency Benefits of User-Space Initiators
Throughput and IOPS are only part of the performance equation. For many AI and analytics workloads, predictable and low latency is equally important.
xiRAID Opus supports a user-space NVMe-oF initiator, allowing applications to bypass the kernel storage stack entirely. This removes multiple layers of buffering and scheduling, reducing end-to-end response times.
Random Read Latency: Kernel vs. User Space Initiator
| Cores | Kernel IOPS | Kernel Latency (µs) | Opus IOPS | Opus Latency (µs) |
|---|---|---|---|---|
| 1 | 662k | 6108.85 | 821k | 1168.43 |
| 2 | 1245k | 3150.22 | 1676k | 571.90 |
| 4 | 2281k | 1719.82 | 3246k | 295.02 |
| 8 | 4831k | 810.13 | 5342k | 178.98 |
| 16 | 7859k | 497.48 | 7296k | 130.82 |
| 32 | 8010k | 488.17 | 7171k | 133.04 |
Across all core counts, the xiRAID Opus user-space initiator delivers substantially lower latency than the kernel-based alternative. This difference becomes especially important for workloads that rely on frequent small I/O operations, such as model training, inference pipelines, and metadata-heavy applications. Beyond running in user space on the host, xiRAID Opus also enables a more radical deployment model: moving storage processing entirely off the CPU.
Offloading Storage Processing to DPUs
While xiRAID Opus already minimizes CPU overhead when running on the host, the architecture also enables a further level of optimization: moving data protection entirely onto a DPU. In collaboration with NVIDIA, Xinnor implemented xiRAID Opus on BlueField-3 DPUs. In this configuration:
- Network drives are visible through BF3 network 200/400 Gbs ports
- xiRAID Opus implements the RAIDs in BF3 DPU
- The RAIDs are exposed to the host by SNAP
- 5 Samsung PM9A3 3.84TB NVMe drives
- The drives are attached to Host system via one IB port
- xiRAID Opus is built on BF3 DPU
- On the Host, fio sequential write jobs are executed using CPU NUMA0
- bs=256k, numjobs=4 per RAID, iodepth=64
NVIDIA's BlueField DPU + xiRAID Reduces reliance on limited HBM/DRAM by acting as a fast, shared, offloaded Inference Context Memory Storage Platform, extending the memory hierarchy for long-context AI, offloading CPU/GPU tasks, and providing high-speed KV cache storage directly in the compute node.
The benchmarks compared three different deployment scenarios:
- MDRAID running on the host CPU
- xiRAID Classic running on the host CPU
- xiRAID Opus running on the BlueField-3 DPU
All three setups used the same drives and network configuration. The only difference was where RAID processing was executed. This makes it possible to isolate the effect of the storage stack and the offload model.
Sequential Write Performance (5× Samsung PM9A3, RAID)
MDRAID delivers the lowest throughput and consumes a noticeable share of host CPU resources. xiRAID Classic more than doubles write performance while keeping CPU usage moderate. When RAID processing is moved to the DPU, performance remains comparable to xiRAID Classic, but host CPU consumption drops to almost zero. This confirms that offloading does not come at the cost of throughput, while substantially reducing compute overhead.
Sequential Read Performance
A similar pattern appears in sequential read workloads. xiRAID Classic again delivers the highest raw throughput when running on the host. The DPU-based configuration reaches lower peak read bandwidth in this setup, mainly due to limitations in the current BlueField-3 software stack. However, even in this case, CPU utilization on the host remains below 1%, which is significantly lower than with either MDRAID or xiRAID Classic. These limitations are related to the current DPU software environment and are expected to improve with future BlueField generations.
Taken together, these three scenarios illustrate the trade-offs between different deployment models. MDRAID offers neither strong performance nor efficient resource usage. xiRAID Classic delivers the best overall throughput on the host, but still consumes CPU cycles that could otherwise be used by applications. xiRAID Opus on BlueField shifts data protection entirely off the host. While peak performance depends on the maturity of the DPU software stack, this approach provides three important benefits:
- First, it frees host CPU resources almost completely. Storage processing no longer competes with application workloads.
- Second, it improves isolation. RAID operations are separated from the host operating system, reducing interference and limiting the impact of failures or security issues in the application layer.
- Third, it enables more flexible system design. Protected volumes can be exposed locally or over NVMe-oF, allowing the same infrastructure to support both direct-attached and shared storage use cases.
Conclusion
Rising NVMe prices and longer lead times are changing how organizations approach storage design. Simply adding more drives is no longer the easiest way to scale. Efficiency in capacity usage, CPU consumption, and overall system utilization has become the key factor.
The results presented in this article show that many traditional approaches struggle in this environment. Hardware RAID is limited by PCIe bandwidth. Kernel-based software RAID suffers from locking and memory overhead. GPU-assisted RAID adds complexity and latency without consistently improving real-world performance.
xiRAID follows a different model. By keeping data protection close to the CPU, eliminating centralized locks, and minimizing data movement, xiRAID Classic delivers high throughput while preserving usable capacity. This allows systems to scale with modern NVMe hardware without introducing new bottlenecks.
For environments that require even lower latency, network-based access, or stronger isolation, xiRAID Opus extends this approach into user space and across fabrics. With support for NVMe-oF and DPU offloading, it enables flexible deployment models that reduce CPU load and improve resource utilization.
Production deployments at Friedrich Alexander University and in large IO500-ranked systems demonstrate how these architectural choices translate into practice: higher performance, better capacity efficiency, and lower total cost of ownership on standard hardware.
As NVMe becomes the default storage medium for AI, HPC, and data-intensive workloads, efficient software architectures will play an increasingly important role. Solutions that can deliver performance, reliability, and scalability without excessive hardware overhead will be best positioned to meet these demands.
By focusing on efficiency rather than brute-force scaling, organizations can apply proven IT cost reduction strategies to optimize IT spending, and significantly lower the total cost of ownership storage in modern NVMe-based data centers.