xiRAID is one of the key products in Xinnor product portfolio. It’s a software RAID module for Linux optimized to work with NVMe devices. In the previous blog we’ve gave some insights on how it works under transactional workloads, usually random in nature and consisting of small size IO operations. However, unlike a few years ago, today NAND storage is very often used for sequential workloads. We’ve seen many such sequential patterns in the field and optimized xiRAID to handle sequential large block writes with the maximum efficiency. We are loosely positioning this feature for the M&E workloads, but as you will see below, there a quite a few other use cases that fit the pattern.
Let’s have some real-world cases as examples:
Our partner builds solutions to gather driving data onboard prototype self-driving cars. The challenge of storing a 100Gbps real-time stream of data is considerable even in a data center environment. Imagine a solution that needs to work inside a trunk of a constantly moving sedan. Imagine that it not only needs the performance, but also high availability and data protection.
Thanks to xiRAID core functionality and some fine tuning the capture device reached its storage performance using space-saving RAID5 and RAID6 groups.
Another case is with a company making high-performance network packet capture devices. These devices need to write 100 and 200 Gbit/s streams directly from the network to local NVMe without any packet loss. xiRAID solution on the platform gave them parity RAIDs capable of 24GB per second.
Yet one more example is a AFA solution for video post-production. The customer wanted an all-flash solution with the key requirement of getting as close to hardware capability limits as possible. AFA powered by xiRAID got the best performance rating under industry-standard frametest and TestIO benchmarking tools.
Another interesting case was when we had to create a backend for a storage system used in a high-performance cluster. The task was to build a huge scratch space with dozens of GBps sequential performance and capable of a failover in case a part of parallel file system went down. Xinnor has created a multinode solution, connected to disaggregated NVMeoF storage with total bandwidth exceeding 100GB per second.
There are several tools that we are using inside xiRAID to get top performance with large IO sizes:
- Flexible chunk size and array geometry. It allows us to create a stripe of a size optimal both for the host IO size and for the NVMe device page size.
- Tuneable merging mechanism for writes, which helps us reduce the number of RMW (read-modify-write) operations. To ensure consistently high performance we're not using write-back or read-ahead caching.
- Our special scheduler that is improving system behavior under workloads that have low number of parallel IO queues, which is often seen in highly sequentialized write workloads.
In this blog, we'll see how this allows us to service the Media and Entertainment storage market.
In our experience, there are 2 types of requests that are frequently seen in the industry:
- Compact SMB systems with capabilities up to 10GB/s
- Enterprise-grade solutions that can perform on par with storage industry leaders.
In both cases to get the maximum performance you need a finely tuned backend and a fitting front-end interface.
We’ll start with a backend based on Xinnor xiRAID.
Building a compact SMB storage array (that can handle 10GB/s write to a RAID5)
Smaller systems are build using single-socket servers and 4 NVMe devices. We recommend lower-end AMD EPYC CPUs. We’ll need at least 8 free PCIe 4.0 lanes (or 16 PCIe 3.0, depending on drive interface) and at least one free PCIe slot to install a network card that matches our performance requirements.
To get good RAID5 write performance, we’ll need fast NVMe drives. Something along the lines of Samsung PM1733 or Western Digital Ultrastar® SN840 in a U.2 form-factor.
For more read-intensive workloads we could use SN650 from WD or consumer-grade Samsung 990 Pro or WD Black® SN750/SN850.
In our labs, we have a testing bench with 4x Samsung PM1733 drives. They are specced for 7GB/s reads and 3.8GB/s writes. Let’s see what they can do and put some FIO load on them.
Sequential reads from 4 drives is almost exactly as per specification – 27.8 GB/s from 4 drives.
Now we’ll test the writes.
Again, close to the specs. 14.6 GB/s from 4 drives.
Now let’s RAID them together:
We are using 128kb for strip size, as it’s optimal for our drives with sequential workloads.
Let’s check the performance.
4 threads give us 24GB/s and at 8 threads we are hitting the maximum 27.8GB/s.
Now let’s try writing to the RAID:
We are hitting close to 8 GB/s, which is very good if we take RAID5 write penalty into account, but let’s see if we can do better.
So, here’s how writes work when you’re using RAID and large block sizes.
One stipe in our case is 128k (strip size) times 4 (number of drives). 128k*4=0.5MB, but only 384kb is actual data, the rest is parity. When we’re trying to write a 1MB block of actual data, not only we overwrite 2 full stipes, but we also partially overwrite a third stripe. This partial write leads to an unwelcome RMW (read-modify-write) cycle. We have to read the old data and parity, recalculate new parity and write new data and new checksums again. This is much less effective than a full overwrite and leads to extra IOs, that we can see using iostat.
Now we’ll use some of our secret sauce and try merging several IOs to reduce partial stripe overwrites and extra IOs. We can enable it using the following xicli command:
mm Maximum wait time (in microseconds) for stripe accumulation for the Merge functions.
mw Wait time (in microseconds) between requests for the Merge functions.
Sometimes, depending on RAID geometry and IO intensity, tuning merges can be a lengthy process of trial and error, but in the end, it’s always worth it.
Note that we don’t recommend manually tuning RAID geometry as it can negatively affect IO alignment to stripes.
Let’s see how our drives perform after we enabled merges:
As you can see, we got back a whooping 3+ GB/s, which is almost a whole NVMe drive’s worth of performance. Now our RAID5 performs at ~75% of maximum write performance of 4 drives. This is an incredible result! Usually, for RAID5 the write penalty is considered to be 4, which means that 4 drives with a combined write performance of 14.6 GB/s should in theory operate at ~3.7GB/s in a RAID5. XiRAID is doing three times that.
Still, enabling IO merging is not the only tuning that can help us build the system.
Smaller installations often can’t generate many IO threads for the storage system to process. We have developed a special scheduler to take care of this issue. Let’s see it in action.
[root@lustre01 ~]# mkdir /test
We’ll create a test file system and mount it as /test with the following options:
To benchmark we’ll use the frametest utility, which is one of the most popular tools for testing frame-based performance.
Here’s what we get with 1 thread and xiRAID special scheduler off.
And here is how we can improve the situation by enabling the scheduler:
Now we have a solid backend for an SMB system that is capable of driving 25+ GB/s reads and 10+ GB/s writes from just 4 NVMe drives in a RAID5. Let’s follow this with an enterprise example and then we’ll get to the frontend interface.
When our system scales to 12 or more drives, we can expect much higher performance, but this system is harder to build. On the hardware side we need to balance the numbers of PCIe lanes and their speed with the number of drives and leave enough free PCIe slots to install NICs.
Assuming we have a balanced hardware configuration, which is the case in our lab, let's see what we can get from 12 PCIe 4.0 drives, similarly to the previous test.
First we get our baseline performance, measuring what drives can do without RAID.
74 GB/s reads
44 GB/s writes
Next, we’ll create a RAID50 volume with group size 6, effectively creating 2 RAID5’s and striping across them.
We’ve already tuned for optimal IO merging and the load now is multithreaded, as is normal for larger systems.
The results are self-explanatory. xiRAID is extremely effective in this setup.
One of the key features of xiRAID is that it handles degraded mode very well. Usually a failed drive in a RAID leads to the whole volume’s performance dropping drastically until rebuilding is done. Let’s see how xiRAID compares to mdraid when a drive failure occurs.
And below is what CPU load looks like during rebuild for both:
Some cores are overloaded and some are idle when mdraid rebuilds a volume.
With 20x performance during rebuild, xiRAID is using a very modest amount of CPU cycles spread evenly across the cores.
Now, as promised, we'll get to the frontend.
Protocols and interfaces
We’ve seen what xiRAID backend can do, and we’re really proud of it, but there’s one more thing that the recipe needs – frontend. In order to get all these sweet gigabytes per second to the user, we need to have an interface that is fast enough. With Ethernet being probably the most common and accessible transport, we usually consider the following protocols for the frontend based on 100/200GbE:
- NVMf over RDMA
- SMB with SMB Direct and SMR Multichannel support.
The only issue here is lack of MacOS support. For our Apple customers, we recommend Fibre Channel.
We’ve done a lot of market research looking for a perfect SMB server and found two commercial solutions for Linux:
After careful consideration, we’ve decided that we liked Tuxera solution the most, and its features and advantages will have a blog post of their own.