Reperio case study: high-performance small block synchronous write solution for databases and data applications

June 2, 2025

Back to all posts

In this blog post we detail how we helped our partner Reperio optimize high-performance storage solutions for database workloads that rely heavily on small block synchronous writes. Working together, we tested and compared different RAID configurations to address the critical I/O performance challenges that databases like CockroachDB, PostgreSQL, and CouchDB face with their 4k synchronous write operations.

About Reperio

Reperio helps businesses improve technology and collaboration through foundational solutions. They offer flexible, no-commitment services for clients in the US ranging from home offices to large organizations. Their expertise includes cloud virtualization, high-availability system design, databases, horizontal scaling, open source infrastructure (preventing vendor lock-in), and delivering enterprise-grade performance within budget constraints.

Background and Challenges

Many essential software platforms (including relational databases like PostgreSQL, MySQL, and Microsoft SQL Server, as well as distributed databases like CockroachDB and CouchDB) rely heavily on small block synchronous writes to ensure data durability and consistency. These systems depend on the fsync() system call to persist critical data to disk, making sure that each transaction is safely recorded. The size of these writes is often as small as 4k, and production data from CockroachDB environments confirms that such small, random synchronous writes make up a significant portion of total I/O. This I/O pattern represents the most demanding and costly test for any storage system, both in terms of raw performance and hardware resource consumption. Without properly optimized storage, these expensive IOPs can throttle overall application throughput and reliability. Testing and tuning for random 4k synchronous writes is not optional, it’s essential for maximizing database performance. For more details, see CockroachDB’s guidance on disk stalls, PostgreSQL's WAL configuration, and how CouchDB uses fsync to prevent data corruption.

We are using CockroachDB as an example database here, but the results apply to most of them. Reperio have measured their CDB deployments and defined the test loads used here.

CockroachDB’s writes block sizes distribution looks like the following, ordered from most to least observed:

  • 4k: 55%
  • 8k: 23%
  • 16k: 16.8%
  • 32k+: 5.2%
iostat block size pie chart

iostat block size pie chart

iostat block size histogram

iostat block size histogram

Double replication

Storage systems replicate data to multiple nodes (an example of this would be CEPH or VMware vSAN), so every write operation is effectively multiplied on a network layer, compounding I/O load and drastically inflating both performance overhead and infrastructure costs.

However, this is not actually required for many modern popular data applications. Technologies like CockroachDB, CouchDB, and Elasticsearch deliver just that by embedding replication and high availability directly into the application layer. These self-replicating, self-balancing platforms are engineered to handle fault tolerance and data distribution on their own. Therefore a simple local RAID array handles node-level redundancy sufficiently, and the need for cluster-level redundancy of a complicated storage system is eliminated by using more efficient application level features.

Reperio helps clients avoid this unnecessary duplication by architecting solutions that respect the intelligence built into today's software stacks—ensuring optimal IOPS efficiency, lower latency, and significant savings on both hardware and operational expenses.

Final load definition

Based on all of the above, we tested the load in the worst case scenario: 4k synchronous IO. This means that the numbers listed in this study are the guaranteed minimums for the used setup, and in real-world scenarios they might actually be even better.

At the same time, we are making smart architecture choices to avoid unnecessary overhead without sacrificing production reliability.

Testing environment

Platform: Supermicro AS -2125HS-TNR
CPU: 2 * AMD EPYC 9454 48-Core (192 threads) Processor
RAM: 24/24 SAMSUNG M321R4GA3BB6-CQKET
Drives: 8 * SanDisk DC SN861 SDS6BA138PSP9X3
OS: Rocky Linux 9.4
RAID Engine: xiRAID Classic, MDADM

Some findings:

  • Single thread matters: Reperio first tested the same load on a system with 2.1GHz CPUs, getting roughly half of our initial Random Write performance even before any system tuning. Results listed here were obtained on a 3.8GHz CPUs, so almost 2 times higher clock speed. After manually limiting our CPU frequency (using the cpupower utility), performance dropped to the same level as on the first system. This shows that synchronous Random Writes are often single thread bound, requiring a relatively high clock speed and IPC more than a multitude of cores. Intel vs AMD IPC differences don’t seem to matter much, as shown by our tests above (Reperio used Intel Xeon GOLD 6530, we used AMD EPYC 9454, but clock-to-clock they performed very similarly), meaning it mostly boils down to clock speed. Nonetheless, IPC varies generation to generation, so the last point might not stand forever.
    Also beware of Turbo boost conditions your CPUs require to sustain their high speed mode of operation. Some CPUs may advertise very high boost clock speeds, but in reality provide them for a limited time with very specific requirements, e.g. temperature below 80 degrees Celsius, only for a single threaded load and so on. The number that matters is the All Cores Boost.
  • Memory throughput matters: our initial tests were done with only 8 out of 24 memory channels populated (4 out of 12 per CPU), which created a big Random I/O bottleneck (not 3 times less performance, but still quite a lot worse results). Adding memory to utilize all supported channels greatly increased the IOPS numbers and produced the results below.

xiRAID configuration

RAID10 has more storage overhead than parity RAID levels (like 5 or 6) because it sacrifices half of raw capacity for redundancy, but that gives it superb reliability: half of the drives in the array can fail before data loss and service interruption, as long as they are not all in the same group. However, RAID10 also has “just“ 2 times penalty for Random Writes, as there are no checksums to calculate (50% of the theoretical maximum performance of the drives). In RAID 6, each small random write typically involves 3 write operations (1 for data and 2 for parity) and 3 read operations (to fetch the old data and parity). In this case, for RAID6 only about 33% of the raw drive performance is achieved for random writes due to read-modify-write operations and the overhead of excessive parity calculations.

Here is an array we tested:

  • 8 drives
  • 2 drives per group (effectively in RAID1)
  • 4 groups (effectively RAID0 on 4 * RAID1)
  • Chunk Size: 16K

Command:

xicli raid create -n media10 -l 10 -ss 16 -d /dev/nvme11n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme1n1 /dev/nvme3n1 /dev/nvme5n1 /dev/nvme7n1

MDADM configuration

  • 8 drives
  • layout=n2, we are mirroring our xiRAID setup with 4 groups of 2 drives each
  • Chunk Size: 16K

Command:

mdadm --create --verbose --chunk=16 /dev/md0 --level=10 --layout=n2 --raid-devices=8 /dev/nvme11n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme1n1 /dev/nvme3n1 /dev/nvme5n1 /dev/nvme7n1
md0 : active raid10 nvme7n1[7] nvme5n1[6] nvme3n1[5] nvme1n1[4] nvme16n1[3] nvme15n1[2] nvme14n1[1] nvme11n1[0]
      11845584000 blocks super 1.2 16K chunks 2 near-copies [8/8] [UUUUUUUU]
      bitmap: 0/89 pages [0KB], 65536KB chunk

Benchmark Results

We ran the tests step by step, moving from using a single thread to all available 192 threads.

We still place all runs for Random loads on RAID arrays in a separate table along with graphs based on the values within. This is done to visually showcase how the solution scales with threads and system resources.

Emphasizing again, we are using the psync I/O engine, meaning all operations run synchronously with an I/O depth of 1. FIO does not write/read any blocks in the thread until the last operation is completed on the drive. This is to match how real world applications utilize the storage subsystem.

RAW drives

Before beginning RAID level testing, we measured the performance of the raw drives to ensure there were no bottlenecks and to establish a performance baseline. This also helps users understand what the raw IOP per dollar for the system would be, compared to the realized IOP per dollar with RAID.

Load Jobs (per drive) Result Avg Latency
Random Write 96(12) 3M IOPS 20.27
Random Read 192(24) 2.7M IOPS 68.08 μs
Random Mix
70% Write
192(24) Write: 2.6M IOPS
Read: 1.1M IOPS
Write: 23.36 μs
Read: 117.13 μs
Random Mix
70% Read
192(24) Write: 773k IOPS
Read: 1.8M IOPS
Write: 36.15 μs
Read: 90.43 μs

xiRAID psync random read/write performance measurements

The xiRAID testing was conducted according to the methodology outlined in the appendix, with drives preconditioning prior to testing.

In addition to the results shown in the table below, we also collected performance data for mixed random write and read workloads in two proportions: 70% random writes / 30% random reads and 30% random writes / 70% random reads. The results of these mixed workloads are presented in the appendix.

Jobs Random read (kIOPS) Avg Read Latency (µs) Random write (kIOPS) Avg write Latency (µs)
1 14.0k 71.02 58.4k 17.04
4 56.8k 70.93 217k 17.99
16 216k 73.2 834k 18.07
32 420k 75.38 1376k 20.09
64 820k 77.32 1410k 29.39
96 1221k 77.85 1450k 36.19
128 1602k 79.02 1398k 49.73
160 1933k 80.82 1405k 74.94
192 2355k 80.04 1436k 124.35

Here is the combined chart visualizing both IOPS and latency for random read and write operations:

Key observations

  • As shown in the graphs, xiRAID demonstrates excellent random write performance in psync, while maintaining low latency even under heavy load.
  • Random read performance remains stable across all load levels, with consistently low latency as well.

MDADM/MDRAID psync random read/write performance measurements

mdadm/mdraid is the most commonly used software RAID solution in Linux environments. It is particularly popular for building storage systems for databases.

For this reason, we decided to perform comparative testing of mdadm/mdraid using the default configuration with the --bitmap option enabled. While this option significantly reduces random write performance, it more accurately reflects real-world, production-like configurations.

Jobs Random read (kIOPS) Avg Read Latency (µs) Random write (kIOPS) Avg write Latency (µs)
1 14.7k 67.56 21.49k 21.49
4 58.0k 68.38 148k 26.53
16 231k 68.55 250k 63.37
32 459k 68.9 285k 111.3
64 904k 69.83 278k 226.26
96 1342k 70.8 267k 357.57
128 1737k 72.4 232k 528.3
160 2113k 74.32 246k 619.43
192 2464k 76.22 247k 755.63

Here is the combined chart visualizing both IOPS and latency for random read and write operations:

Key observations

  • Random read scales virtually linearly with increasing job count with relatively small growth in latency.
  • Random write performance is limited to about 250kIOPS and increasing job count leads to greater latency.

Optimal performance figures

Here we list the results making the most sense based on resources/performance ratio as optimal* ones and ones directly comparable between tested RAID engines as a reference point.

We consider the result optimal when it delivers the best scaling for the number of jobs used. As an example, if 4 jobs deliver X GB/s, while 8 jobs (2 times more CPU threads) deliver X+0.1GB/s, the 4 job run is picked as optimal and gets listed, while 8 job one is omitted from the table.

Load Jobs mdraid Result mdraid Avg Latency xiRAID Result xiRAID Avg Latency
Random Write 32 285k IOPS 111.3 μs 1.3M IOPS 20.09 μs
Random Write 64 278k IOPS 226.26 μs 1.4 M IOPS 29.39 μs
Random Read 192 2.5M IOPS 76.22 μs 2.4M IOPS 94.06 μs
Random Mix
70% Write
32 Write: 247k IOPS
Read: 106k IOPS
Write: 96.57 μs
Read: 73.53 μs
Write: 494k IOPS
Read: 212k IOPS
Write: 28.77 μs
Read: 81.84 μs
Random Mix
70% Write
128 Write: 248k IOPS
Read: 106k IOPS
Write: 480.69 μs
Read: 74.66 μs
Write: 1408k IOPS
Read: 603k IOPS
Write: 37.76 μs
Read: 112.2 μs
Random Mix
70% Read
96 Write: 227k IOPS
Read: 529k IOPS
Write: 246.25 μs
Read: 74.3 μs
Write: 412k IOPS
Read: 962k IOPS
Write: 32.85 μs
Read: 84.46 μs
Random Mix
70% Read
128 Write: 226k IOPS
Read: 527k IOPS
Write: 388.43 μs
Read: 74.42 μs
Write: 1234k IOPS
Read: 529k IOPS
Write: 34.39 μs
Read: 87.61 μs

Overall Comparison (mdraid vs. xiRAID)

  • The performance data highlights xiRAID's superior efficiency. In RAID10, we deliver excellent random write performance (1.4 M IOPS vs 278k IOPS on mdadm), making it highly suitable for latency-sensitive workloads such as database transactions. Keep in mind that all of these numbers are for synchronous operations, which are often way smaller than what is theoretically possible. Indeed, real world applications are usually a lot more sensitive to Random Writes, and we really excel there.
  • Random Write Performance: xiRAID significantly outperforms mdraid in random write operations, both in terms of throughput (KIOPS) and latency. xiRAID's ability to handle high concurrent write loads with relatively low and stable latency is a major advantage.
  • Random Read Performance: Both mdraid and xiRAID show strong random read performance with good KIOPS scaling and relatively low latency. While mdraid appears to scale slightly higher in KIOPS for random reads at the tested job counts, both are highly efficient for this workload. However, xiRAID maintains consistently low latency for reads even at higher job counts.

Conclusion

This collaborative study brings together the expertise of Xinnor and Reperio to analyze modern database workloads. Our research addresses a notable gap in published literature on this widely used scenario, offering valuable insights into optimizing database infrastructure.

Performance testing demonstrated outstanding xiRAID results across various configurations for synchronous read and write operations, under both sequential and random I/O patterns.

RAID10 implementations, in particular, delivered excellent random write performance—reaching 1.4 million IOPS—while maintaining consistently low latency. These characteristics make RAID10 well-suited for latency-sensitive database transactions.

The architectural analysis revealed that eliminating redundant replication can significantly boost performance for workloads that already implement cluster-wide data redundancy. At the same time, RAID continues to provide essential local storage redundancy, preventing service interruptions in the event of a device failure.

Hardware tuning also highlighted that single-threaded CPU performance—often underestimated—is critical for database workloads. Additionally, RAM bandwidth is becoming an increasingly important factor, especially when using PCIe Gen5 NVMe SSDs.

Appendix 1

Mixed workload for mdraid rwmixread=70

Jobs Type IOPS Avg Latency (µs)
1 r 11.5k 68.01
w 5k 43.03
4 r 44.8k 69.92
w 19.3k 43.04
16 r 182k 70.62
w 78.0k 38.26
32 r 342k 71.84
w 147k 47.64
64 r 489k 73.6
w 210k 130.05
96 r 529k 74.3
w 227k 246.25
128 r 527k 74.42
w 226k 388.43

Mixed workload for mdraid rwmixwrite=70

Jobs Type IOPS Avg Latency (µs)
1 r 6k 32.57
w 16k 68.01
4 r 27.4k 70.67
w 64.1k 31.32
16 r 86.8k 72.63
w 203k 46.84
32 r 106k 73.53
w 247k 96.57
64 r 119k 74.11
w 278k 197.08
96 r 106k 74.72
w 248k 352.89
128 r 106k 74.66
w 248k 480.69

Mixed workload for xiRAID rwmixread=70

Jobs Type IOPS Avg Latency (µs)
1 r 12.5k 70.96
w 5k 19.09
4 r 50.2k 71
w 21.6k 18.37
16 r 188k 74.19
w 80.6k 23.21
32 r 344k 78.25
w 148k 31.49
64 r 664k 81.29
w 284k 32.67
96 r 962k 84.46
w 412k 32.85
128 r 1234k 87.61
w 529k 34.39

Mixed workload for xiRAID rwmixwrite=70

Jobs Type IOPS Avg Latency (µs)
1 r 8k 70.35
w 20.6k 17.64
4 r 34.4k 71.85
w 80.3k 18.37
16 r 116k 77.33
w 270k 25.33
32 r 212k 81.84
w 494k 28.77
64 r 390k 88.77
w 909k 31.33
96 r 550k 95.28
w 1282k 32.88
128 r 603k 112.2
w 1408k 37.76

Benchmark Methodology

Testing Tools and Parameters

We are doing our tests using the FIO utility. Full job definitions are available in the next section, here we will discuss the most important ones:

  • numjobs is the number of load-generating threads spawned by FIO.
  • iodepth is the number of blocks in flight at the same time, used in preconditioning.
  • ioengine=psync is the engine used to apply load. psync performs all operations in a synchronous manner, practically with iodepth=1, so setting any other value won’t have any effect.
  • direct=1 bypass Linux kernel page cache/buffer, all operations go straight to the drive.
  • blocksize=4K we are testing worst case scenario here, as this is the bulk of CockroachDB load described above.
  • numa_cpu_nodes=0 and numa_mem_policy=local restrict CPU cores and RAM respectively to the NUMA nodes the drives are physically connected to. Same idea as xiRAID CPU allow parameter.
  • runtime=120 we are running pretty short tests here, but thanks to preconditioning the whole block device for hours beforehand, we are getting the actual sustained results. Running the tests themselves much longer will not add significant value.
  • exitall=1 terminates all jobs as soon as the first job finishes. Without this the results may become less representative as block device load lowers towards the end with less and less jobs running. Preconditioning may also become drastically less efficient if the load at the end is unable to saturate drive controllers' Garbage Collector.

Let’s go a bit deeper into the last statement about drive Garbage Collector. Historically preconditioning was done by overwriting the target block device (drive/RAID) with target block size 2 times, using 1 job, and that was sufficient.

But controllers inside modern PCI-e Gen5 drives, including our SanDisk DC SN861, have gained a lot of sophistication. This means that not only are 2 passes not always enough, but any idle time between preconditioning and actual tests will also result in controllers GC running rampant, inflating performance back close to Fresh Out Of the Box levels, and even running preconditioning with 1 job or in a synchronous manner may leave enough space for it to inflate performance back, partially or even completely negating preconditioning effects. And while higher performance is great, we are measuring sustained levels, as in worst case scenario, or always guaranteed levels.

Because of all this, we needed to increase preconditioning numjobs to 4 for Sequential and 8 for Random, loops to 4, use ioengine=libaio with iodepth=128 and also automate our test runs in order to have minimal intervals between preconditioning and actual load.

The scripts we used and their descriptions are available in the Appendix 2.

Our goal here is to get as close to RAW drives performance as possible with as little resources as possible.

FIO patterns

RAW

Random
Preconditioning

[global]
ioengine=libaio
numjobs=8
group_reporting=1
direct=1
iodepth=128
rw=write
blocksize=4k
loops=4
offset_increment=12%
numa_cpu_nodes=0
numa_mem_policy=local
exitall=1

[job 0]
filename=/dev/nvme0n1
[job 1]
...

Test

[global]
ioengine=psync
group_reporting=1
direct=1
rw=rand[read/write]
blocksize=4k
exitall=1
runtime=1200

[job 0]
filename=/dev/nvme0n1
[job 1]
...

RAID arrays

Random
Preconditioning

[global]
ioengine=libaio
numjobs=8
group_reporting=1
direct=1
iodepth=128
rw=write
blocksize=4K
offset_increment=12%
loops=4
exitall=1

[job1]
filename=/dev/xi_test

Test

[global]
ioengine=psync
group_reporting=1
direct=1
verify=0
rw=rand[write/read]
blocksize=4K
runtime=600
exitall

[job1]
filename=/dev/xi_test

Appendix 2

Testing scripts

#RAW drives

~/fio/fio jobs/raw-seq-precond.job

for jobs in 16 12 8 4 1; do
        ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raw-seq-write.job -o ./logs/sustained/raw-seq-write-${jobs}.log
done
for jobs in 16 12 8 4 1; do
        ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raw-seq-read.job -o ./logs/sustained/raw-seq-read-${jobs}.log
done

~/fio/fio jobs/raw-rand-precond.job

for jobs in 16 12 8 4 1; do
        ~/fio/fio --numjobs=$jobs jobs/raw-rand-write.job -o ./logs/sustained/raw-rand-write-${jobs}.log
done

for jobs in 16 12 8 4 1; do
        ~/fio/fio --numjobs=$jobs jobs/raw-rand-read.job -o ./logs/sustained/raw-rand-read-${jobs}.log
done
#RAID

~/fio/fio jobs/raid-seq-precond.job

for jobs in 96 64 48 32 16 8 4 1; do
        ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raid-seq-write.job -o ./logs/sustained/raid-seq-write-${jobs}.log
done
for jobs in 96 64 48 32 16 8 4 1; do
        ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raid-seq-read.job -o ./logs/sustained/raid-seq-read-${jobs}.log
done

~/fio/fio jobs/raid-rand-precond.job

for jobs in 96 64 48 32 16 8 4 1; do
        ~/fio/fio --numjobs=$jobs jobs/raid-rand-write.job -o ./logs/sustained/raid-rand-write-${jobs}.log
done

for jobs in 96 64 48 32 16 8 4 1; do
        ~/fio/fio --numjobs=$jobs jobs/raid-rand-read.job -o ./logs/sustained/raid-rand-read-${jobs}.log
done

Let’s go through them:

  1. First we define device type, RAW or RAID. This is just to keep difference between the scripts minimal and defined mostly in one place.
  2. Than we run load preconditioning.
  3. Next we decide on the numbers of jobs to run based on the device type. If we are testing RAW drives, the number applies to each drive, so a test with jobs=1 on 8 drives will produce 8 jobs; for RAID device the value will apply to all drives under the array together, jobs=32 will produce 32 jobs total. We run the tests in a loop, going from more jobs to less.
  4. After that we go through loads. The first one is Write, because it is the one actually affected by preconditioning and GC for the most part, Read – not so much.

offset_increment is calculated using GNU bc by dividing 100 (percent of the block device) by the number of jobs, avoiding simultaneous writes to the same sector by multiple jobs. bc rounds floating point numbers down to the nearest integer by default, which suits our needs.