xiRAID Opus: Maximizing Performance and Data Protection on Low-Power Intel CPU

October 31, 2025

Download the white paper
Back to all posts

PCIe Gen 5 drives are exceptionally fast, but traditional parity RAID configurations require top-of-the-line and power-hungry processors to keep pace. As power is becoming one of the most precious resource in data centers, it would be more preferrable to be able to utilize low-power CPU to run the storage servers.

xiRAID Opus aims to disrupt the status quo by drastically lowering CPU requirements. In this solution brief, we collaborated with Intel to demonstrate how Intel low-power, energy-efficient CPU (E-cores) in the Xeon® 6 processor achieves nearly maximum storage backend saturation. While P-cores CPU remains faster, xiRAID Opus closes the performance gap and enables the usage of E-cores CPU to build fast storage solutions for AI and the most exigent workloads.

In this solution brief, we compare xiRAID Opus RAID6 and RAID10 performance on E-cores and P-cores.

About xiRAID Opus, the NVMe composer with high-speed data protection

xiRAID Opus is a Linux user-space software solution that unifies local and network-attached NVMe drives into a high-performance, energy-efficient storage platform. It maximizes speed and reliability for demanding applications while minimizing hardware overhead, reducing power costs, and simplifying infrastructure management.

xiRAID Opus extends data protection functionality with:

  • Integrated network storage for seamless scaling
  • Native NVMe-oF initiator and target support across drives, volumes, and RAID arrays
  • Built-in VHOST virtualization
  • End-to-end QoS controls for predictable performance in shared environments

At its foundation, xiRAID Opus employs a Linux user-space datapath engine that bypasses the kernel I/O stack using polling mode to reduce latency and eliminate OS dependencies. This not only improves performance but also ensures frictionless Linux distribution updates—no kernel tuning required or compatibility issues expected.

Its polling-mode architecture delivers unbeatable performance:

  • 1M IOPS and 100 GB/s per CPU core with ultra-low latency
  • Linear scaling across cores and nodes
  • Consistently high performance in both normal and degraded/rebuild modes

This unique design unlocks the full potential of today's PCIe 5.0 NVMe drives and provides a future-ready path to PCIe 6.0 and 7.0, where traditional CPU locks and kernel overhead would otherwise become critical bottlenecks.

With its gRPC API, xiRAID Opus integrates seamlessly into automated, modular environments—whether in private or public clouds, HPC systems, or AI infrastructures.

This architecture enables fault-tolerant storage infrastructure using both local and remote media, delivers optimized performance in virtualized environments via VHOST, and provides network-accessible storage with comprehensive Quality of Service (QoS) controls. By disaggregating storage from compute resources, xiRAID Opus enhances resource utilization and scalability for enterprise, high-performance computing (HPC), and AI/ML workloads.

About Intel® Xeon® 6 E-cores CPU

The Intel® Xeon® 6 E-core CPUs lineup, and the ones we tested specifically, Intel® Xeon® 6780E, are the latest and greatest products of Intel’s pursuit of efficiency.

In the following workloads the CPU we tested provides up to 1.17 times higher performance and 1.46 times higher performance per watt when compared to 5th Gen. Xeon® Platinum:

  • Databases: MySQL OLTP (HammerDB)
  • Web hosting/backend/CDN: Java server-side throughput (w/ and w/o SLA)
  • Network forwarding and firewall: 5G UPF and NGFW
  • Media transcoding: SVT-HEVC, AVC, x265 codecs

Each CPU among those broadly available on the market can pack up to 144 cores, and with 2 sockets you can get 288 cores per compute node. Custom made Cloud providers’ platforms’ E-core CPUs deliver up to 288 cores per socket.

In general compute they outperform the aforementioned Xeon Platinum 5th Gen. CPUs by 2.4 times and deliver 1.6 times better performance per watt on average.

All of the above makes Intel® Xeon® 6 series a great choice for a lot of highly parallelized workloads, both mentioned here and more.

About SanDisk DC SN861

Engineered for the future of mission-critical workloads, the SanDisk DC SN861SSD is a cutting-edge PCIe Gen5 NVMe SSD that delivers exceptional performance tailored for enterprise applications. With capacity options of up to 15.36TB, the drive is optimized for compute-intensive AI and machine learning environments by offering high-speed random read capabilities and extremely low latency, all while maintaining minimal power consumption to maximize IOPs per Watt. The DC SN861 SSD is also enriched with a robust set of enterprise features such as Flexible Data Placement (FDP), support for OCP 2.0, and integrated safeguards like Power Loss Protection and End-to-End Data Path Protection. This comprehensive feature set makes the DC SN861 ideally suited for hyperscale, cloud, and enterprise data centers that demand both high performance and operational efficiency.

Test Environment

Server 1 (E-cores)

  • Platform: Hyperscalers Quanta S7Q
  • CPUs: 2x Intel® Xeon® 6780E
  • RAM: 16x 32GB Micron MTC20F1045S1RC64BDY (512GB)
  • Networking: 4x Mellanox ConnectX-5 MT4119 (100G)

Server 2 (P-cores)

  • Platform: Hyperscalers Quanta S7Q
  • CPUs: 2x Intel® Xeon® 6747P
  • RAM: 16x 32GB Micron MTC20F1045S1RC64BDY (512GB)
  • Networking: 4x Mellanox ConnectX-5 MT4119 (100G)

Drives

  • Model: 8x SanDisk SDS6BA176PSP9X3
  • Capacity: 8x 7.68 TB (61.44 TB total)

Testing Methodology

We used the industry-standard fio tool to test storage performance. Complete fio job configurations are available in the Appendix.

In our testing, the term Jobs refers to both the number of fio testing threads and the number of CPU cores dedicated to the xiRAID Opus engine. We deliberately synchronized these values to accurately measure resource efficiency. For example, when testing with numjobs=4, xiRAID Opus was bound to the same 4 CPU cores as fio with no access to additional system resources.

When considering the results presented below, it is important to note that xiRAID Opus runs on the same CPU threads as the fio plugin. Therefore, the actual resource consumption of xiRAID Opus is significantly lower than the reported values.

For sustained performance testing, we performed drive preconditioning before taking final measurements. The preconditioning process involved overwriting all drives twice using block sizes matched to the planned workload:

  • Random I/O testing: 4K blocks
  • Sequential throughput testing: Chunk size blocks (128K)

With preconditioning eliminating performance variables, sustained performance tests were run for 5 minutes.

The peak number of Jobs presented in the tables below is the optimal value, meaning it provides the best scaling.

It's worth emphasizing that achieving absolute, world-record performance is not always necessary. In many practical scenarios, particularly in production environments, what truly matters is attaining performance close to the theoretical maximum while making efficient use of available resources. Striving for the last few percentage points of performance often comes at a disproportionate cost in terms of complexity, energy consumption, and hardware utilization.

Instead, optimizing for a balanced and sustainable approach—where computational resources are used judiciously and performance remains within an acceptable margin of peak theoretical limits—tends to offer better long-term value. This is especially relevant in constrained environments or large-scale systems, where maximizing throughput per watt, per dollar, or per unit of hardware is far more impactful than achieving record-breaking benchmarks that may not translate into practical benefits. In essence, the goal should often be smart efficiency, not brute-force speed.

Conducted tests

Interrupt coalescing (IC, see detailed description in Appendix) can ease the load on CPU cores to achieve higher peak performance. However, it does have a significant drawback—latencies can grow by up to a factor of 10.

To investigate the extent of this, we conducted sequential and random performance measurements with IC disabled, IC enabled, RAID6, and RAID10. We measured different RAID levels to represent optimal system configurations for different workloads. Striped RAID levels (5, 6, and so on) perform well for sequential workloads but suffer under random writes. In contrast, RAID10 is less optimal for sequential workloads, as it effectively writes to half the drives, but excels under random writes because it doesn't need to perform Read-Modify-Write operations. (You can read more about this concept in the beginning of our Performance Guide Pt. 3: Setting Up and Testing RAID.)

Results & Analysis

Initial challenges when working in kernel space

Before each test, we performed a benchmark of raw NVMe drives to compare their performance with official specifications and ensure there were no bottlenecks. While measuring raw drives with E-cores CPU, we were unable to extract full drive performance on a random read pattern and decided to investigate this further.

Important note: Intel Xeon E-cores CPUs are not designed for cutting-edge storage performance. They excel in asynchronous, easily parallelized tasks like CDN hosting and network monitoring. Storage, on the other hand, is still significantly dependent on single-core performance. It can be brute-forced with high core counts, but much less efficiently—at least that was the case until xiRAID Opus.

Random read performance problem

We were able to achieve specification performance on the random read pattern only when using Interrupt Coalescing.

Load Numjobs E-cores without IC, M IOPS E-cores with IC, M IOPS
Random Read 32 4 24.8
Random Read 24 3.9 24.8
Random Read 8 7.6 24.8
Random Read 4 7.5 14.8
Random Read 1 2.7 2.5

Even though random writes on raw drives take a small hit from enabling IC, random read numbers shoot through the roof with IC enabled. It seems obvious that kernel-space with Interrupt Coalescing is the way to go, right? Unfortunately, the trade-off becomes apparent in the next table.

Random I/O completion latency

In this test, we examined the potential consequences of using Interrupt Coalescing and made comparisons with xiRAID Opus, which operates in polling mode inside user space.

This test was run with 1 job (CPU thread) and an I/O depth of 1 to measure combined system latency for synchronous operations.

Test Latency, μs
E-cores (without IC) E-cores (with IC) E-cores (polling) P-cores (polling)
RAW RAW RAID10 RAID10
Avg 99th percentile Avg 99th percentile Avg 99th percentile Avg 99th percentile
Random Write 8.64 9.79 97.56 100 9.69 12.09 9.56 11.46
Random Write 78.86 251 187.53 318 63.23 75.26 62.78 74.24

Here’s the catch: these drives are capable of sub-10μs latencies for random writes, and both the default configuration of the kernel NVMe driver and xiRAID Opus deliver it. The kernel driver with Interrupt Coalescing, however, raises random write latencies by approximately 10x and random read latencies by approximately 2x.

This table lists RAID10 only, as RAID6 latencies contain time required for checksum calculations, thus are not directly comparable with raw drives. RAID10 does simple data mirroring without parity calculations, so it can be effectively compared for latency.

We see that xiRAID Opus does not have the same penalty here as kernel drivers with IC enabled. But how much performance does it provide with those latencies?

RAID Performance Comparison: P-Cores vs. E-Cores

Next, using the same set of drives and identical platforms, we performed disk subsystem performance measurements on energy-efficient cores (E-cores) and high-performance cores (P-cores). The testing was carried out according to the methodology described in the appendix. The measurements were taken from RAID 6 and RAID 10 arrays running in user space via xiRAID Opus.

Random Pattern Performance

RAID 6 results and analysis

For RAID 6 with small random I/O, each write operation requires additional overhead. A single logical random write typically triggers 3 write operations (one for the new data and two for parity) and 3 read operations (to retrieve the old data and parity). Since reads are generally less limiting for performance than writes, the effective impact is dominated by those three write operations.

Therefore, the effective random write performance can be approximated as one-third of the raw capability—about 33% (100% ÷ 3).

Based on raw drive performance of 3,440K IOPS on these platforms, the effective random write performance of RAID 6 will be 1,135K IOPS.

Theoretical maximum random read performance for RAID 6 is 100% of the performance of 8 raw drives and equals 26.4M IOPS.

CPU Cores Theoretical maximum, M IOPS* E-cores, M IOPS E-Cores Efficiency P-cores, M IOPS P-cores Efficiency
Random Writes
8 1.13 1.1 96% 0.99 86%
4 0.74 65% 0.91 79%
1 0.19 17% 0.28 24%
Random Read
32 26.4 19 71% 20.1 76%
24 15.8 59% 62.5 66%
8 6 24% 6.4 25%
4 3.2 22.7% 3.5 13%
1 0.86 3.2% 0.84 3.2%

RAID 10 results and analysis

RAID 10 arrays are striped across mirrors of 2 drives, so their performance limits are calculated as follows:

For random writes, each operation is mirrored to a partner drive, so only half of the raw write bandwidth is effectively available. Based on raw performance of 8 drives at 3,440K IOPS, the theoretical maximum random write performance is 3,440K ÷ 2 = 1,720K IOPS.

For random read operations, all drives can be accessed in parallel, thus the full raw read performance of 26.4M IOPS for 8 drives can be utilized.

CPU Cores Theoretical maximum, M IOPS* E-cores, M IOPS E-Cores Efficiency P-cores, M IOPS P-cores Efficiency
Random Writes
8 1.72 1.79 104% 1.78 103%
4 1.38 80% 1.83 106%
1 0.85 49.5% 0.85 49.5%
Random Read
32 26.4 19.6 74% 20.3 77%
24 15.8 60% 16.7 63%
8 6.01 23% 6.65 25%
4 3.18 12% 3.52 13%
1 0.85 3% 0.92 3.5%

Some write numbers are even higher than the theoretical maximum. This is possible because the drives we used are new and fresh, so they can perform slightly above their specifications. This could be negated with additional preconditioning. However, since this study focuses on CPU performance rather than drives, we have allowed this variance. This demonstrates that for random writes on RAID10, the drives are the bottleneck, not the CPUs.

Read performance, while below the drives' limit, is still very respectable—even more so considering:

  • xiRAID Opus in this testing setup shares each CPU thread with fio, which:
    • generates test data stream
    • sends I/O commands and monitors their execution by Opus
    • gathers performance statistics
    • etc.
  • xiRAID Opus still provides native-level latencies

Sequential Pattern Performance

RAID 6 results and analysis

Maximum performance of 8 drives in this environment is:

  • 60.1 GB/s for sequential write with a 128 KB block size and a queue depth of 64
  • 109 GB/s for sequential read with a 128 KB block size and a queue depth of 64

Considering that for each operation in RAID 6, two drives are used for parity data, the theoretical maximum write performance is calculated excluding these two drives: (60.1 / 8) × 6 = 45 GB/s.

For read operations, no parity calculation is performed; therefore, the theoretical maximum for RAID 6 remains the same as raw drives at 109 GB/s.

CPU Cores Theoretical maximum, GB/s* E-cores, GB/s E-Cores Efficiency P-cores, GB/s P-cores Efficiency
Sequential write
8 45 42.3 94% 42.3 94%
4 35.6 81% 42.2 93%
1 13.5 30% 13.9 31%
Sequential read
1 109 106 97% 105 96%

RAID 10 results and analysis

For RAID 10, half of the drives are used for data mirroring, so only 4 out of 8 drives contribute to effective write bandwidth. The theoretical maximum write performance is calculated as: 60.1 / 2 = 30 GB/s.

For read operations, all drives can be used in parallel, since data can be read from any drive in a mirror pair. Based on this, the theoretical maximum read performance for RAID 10 is also 109 GB/s.

CPU Cores Theoretical maximum, GB/s* E-cores, GB/s E-Cores Efficiency P-cores, GB/s P-cores Efficiency
Sequential write
8 30 25.1 83% 25.0 82%
4 25.1 83% 25.1 83%
1 28.0 93% 28.0 93%
Sequential read
1 109 106 97% 105 96%

Key Observations & Conclusions

As shown in our results tables, out of the box, E-cores become a bottleneck for sequential writes and random reads. However, as previously mentioned, they are not designed for this scenario to begin with, so this limitation is expected.

Interrupt Coalescing helps move their upper limit higher, but at the cost of very significant latency penalties—to the point where this trade-off is completely not worth it for the majority of use cases.

xiRAID Opus, on the other hand, provides the best of both worlds: high IOPS with native latency levels at the same time, avoiding the compromise of one in favor of the other. Our user-space RAID engine helps Intel E-core Xeon CPUs become a competent, cost- and power-efficient solution for a task that used to be out of their league.

Appendix

What is Interrupt Coalescing?

An interrupt is a signal sent by a peripheral device to a CPU indicating that it has data to process. As the name suggests, it interrupts whatever the CPU is currently doing (in most cases; exceptions are outside the scope of this solution brief) and forces it to switch context to the task the interrupt is related to. Switching context requires unloading the old context to and loading the new context from system memory—an overhead that quickly adds up. Slower CPUs paired with faster Gen 5 NVMe drives can drown in interrupts under load, sinking performance.

Interrupt Coalescing is one of the relatively popular methods used to ease the load on CPUs with fast PCIe devices. By default, NVMe drives send an interrupt for every single operation, which lowers latency but bombards CPUs with expensive context switches. IC groups those interrupts to process together, so a CPU needs to drop everything it's doing and switch context only once per group of, e.g., 10 interrupts, instead of 10 times.

But as usual, there is a catch. The Linux kernel NVMe driver only adjusts the IC timeout (time to wait for a full interrupt group to assemble) in 100μs increments, making 100μs the smallest possible timeout when not enough interrupts arrive on time (which happens quite often with multi-threaded workloads). This can be 10-20x higher than actual Gen 5 drive latency (sub-10μs) without IC, making it a very difficult trade-off. This is where xiRAID Opus comes into the picture.

fio Job Configurations

Drive Preconditioning

[global]
ioengine=./spdk_bdev
spdk_json_conf=<config.json>
thread=1
numjobs=1
group_reporting=1
direct=1
verify=0
numa_cpu_nodes=0
numa_mem_policy=bind:0
time_based=0
rw=write
blocksize=4k
iodepth=64
size=100%
loops=2

<jobs>

Random I/O

I/O depth and thread count were defined at cmdline:

./fio --cpus_allowed=0-47,96-143 --iodepth=${iod} --numjobs=${jobs} <jobfile> -o ../../logs/<logfile>-${jobs}-${iod}.log
[global]
ioengine=./spdk_bdev
spdk_json_conf=<config.json>
thread=1
group_reporting=1
direct=1
verify=0
randrepeat=0
norandommap=1
gtod_reduce=1
gtod_cpu=143
filename=xnraid
cpus_allowed_policy=split
numa_cpu_nodes=0
numa_mem_policy=bind:0

[xnraid]
rw=rand[read|write]
blocksize=4k
time_based=1
ramp_time=2m
runtime=5m

Random Latency

[global]
ioengine=./spdk_bdev
spdk_json_conf=<config.json>
thread=1
group_reporting=1
direct=1
verify=0
filename=xnraid
numa_cpu_nodes=0
numa_mem_policy=bind:0

[xnraid]
rw=rand[read|write]
blocksize=4k
iodepth=1
time_based=1
ramp_time=2m
runtime=5m