Performance Guide Pt. 2: Hardware and Software Configuration

September 14, 2023

Back to all posts

This is the second part of our Performance Guide blog post series. In the previous part, we've covered the fundamentals of system performance, its basic units and methods for measurement. In this part, we’ll be discussing the optimal hardware configuration and Linux settings and walk you through the process of basic calculation of the expected performance.

Hardware

To identify performance problems, it is necessary to configure the hardware and software before starting the tests. Based on our experience, low performance is often caused by incorrect settings and configurations.

The overall storage capacity of storage devices will always be limited by the capacity of the computer bus that attaches them.

Drive connection methods and topology

PCIe

Most modern NVMe storage devices connected through U.2 or U.3 connectors, utilize up to four PCI Express lanes.

To achieve optimal performance, it is important to ensure that the correct number of PCIe lanes are used to connect the drives and that there is no overcommit.

Ensure that the version of the PCIe protocol matches the one used by the drives. You can test the connection using the "lspci -vv" command. Pay attention to the "LnkCap" and "LnkSta" sections. The first indicates the connectivity, while the second indicates the actual connection status.

Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller Cx6 (rev 01) (prog-if 02 [NVM Express])
Subsystem: KIOXIA Corporation Generic NVMe CM6
Physical Slot: 2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 140
NUMA node: 0
IOMMU group: 41
Region 0: Memory at da510000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at da500000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
				ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
				RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
				MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap: Port #1, Speed 16GT/s, Width x4, ASPM not supported
				ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
				ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta: Speed 16GT/s (ok), Width x4 (ok)
				TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

If the protocol version or the number of lanes differ from the expected ones, you should check the BIOS settings, as well as the capabilities of the motherboard. The performance of PCIe connections is influenced by both the PCIe version and number of lanes, as indicated in the table provided below.

PCIe generation Single lane transmission speed Max throughput
x1 x2 x4 x8 x16
3.0 8 GT/s 1008,246 MB/s = 0,985 GB/s 1,969 GB/s 3,938 GB/s 7,877 GB/s 15,754 GB/s (126 Gbit/s)
4.0 16 GT/s 1,969 GB/s 3,938 GB/s 7,877 GB/s 15,754 GB/s 31,508 GB/s (252 Gbit/s)
5.0 32 GT/s 3,938 GB/s 7,877 GB/s 15,754 GB/s 31,508 GB/s 64,008 GB/s (512 Gbit/s)
6.0 64 GT/s 7,563 GB/s 15,125 GB/s 30,250 GB/s 60,500 GB/s 121,000 GB/s (968 Gbit/s)

Backplane

To identify bottlenecks, it is worth checking the physical connection and studying the specifications for BackPlane and NVMe expander cards.

Retimers and SAS-HBA

NVMe drives can be connected to PCIe retimers, OCuLink ports on the motherboard, or tri-mode SAS HBA adapters via a BackPlane. However, if SAS drives are being used, they must be connected to the SAS HBA. If you are using a SAS HBA, ensure that the controllers are correctly installed and update the SAS HBA controller to the latest firmware version. Note that the performance of your storage system will not exceed the performance of the PCIe interfaces used to connect the drives, whether they are connected directly or indirectly.

SAS Generation Number of ports Max performance
SAS-3 1 1200 MB/s
4 4800 MB/s
SAS-4 1 2400 MB/s
4 9600 MB/s

Ensure that the wide ports and links are correctly configured using the management utilities provided by SAS HBA.

Drives

For tasks that do not involve small block sizes, we highly recommend reformatting NVMe namespaces in 4k.

Memory

When calculating checksums and during recovery, arrays with parity use RAM for temporary data storage. Therefore, it is critically important to correctly install and configure memory.

xiRAID does not require a significant amount of memory. However, it is necessary to use error-correcting code memory, symmetric multiprocessing, and preferably memory with the highest supported frequency on the platform.

Failing to comply with step 2 may lead to a performance loss of about 30%-40%.

CPU

  • Cores and comands requirements.

xiRAID requires a processor with support for the AVX instruction set, which is available on all modern Intel and AMD x86-64 processors. So, if you want to use xiRAID, your processor needs to have this extension.

To achieve high performance for random workload, it is necessary to have 1-2 cores for every expected 1 million IOps. For sequential workload, make sure you have at least 4 cores for every 20 GBps.

BIOS

  • С-States must be turned off.
  • PowerManagement. The CPU operation mode needs to be switched to performance (balanced is used by default).
  • CPU Topology (AMD only). The number of cores (die) on each chipset must match the number of cores specified. Incorrect configuration can result in a significant loss of performance.
  • HT/SMT. We recommend switching on HT/SMT to achieve better performance.

Linux

Preliminary checks and settings

The main tool for managing NVMe in Linux is the nvme management utility.

For example:

  • to see all the NVMe drives connected to the system, run the "nvme list" command:
# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme7n1          /dev/ng7n1            BTLJ85110C314P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme6n1          /dev/ng6n1            BTLJ85110CDH4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme5n1          /dev/ng5n1            BTLJ85110C1K4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme4n1          /dev/ng4n1            BTLJ85110BU94P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme3n1          /dev/ng3n1            BTLJ85110CG74P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme2n1          /dev/ng2n1            BTLJ85110C4N4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme1n1          /dev/ng1n1            BTLJ85110C8G4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme0n1          /dev/ng0n1            BTLJ85110BV34P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
  • to view information about all NVMe subsystems and verify the drive connections, run the nvme list-subsys command:
# nvme list-subsys
nvme-subsys7 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C314P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme7 pcie 0000:e6:00.0 live
nvme-subsys6 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110CDH4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme6 pcie 0000:e5:00.0 live
nvme-subsys5 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C1K4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme5 pcie 0000:e4:00.0 live
nvme-subsys4 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110BU94P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme4 pcie 0000:e3:00.0 live
nvme-subsys3 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110CG74P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme3 pcie 0000:9b:00.0 live
nvme-subsys2 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C4N4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme2 pcie 0000:9a:00.0 live
nvme-subsys1 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C8G4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme1 pcie 0000:99:00.0 live
nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110BV34P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme0 pcie 0000:98:00.0 live
  • we recommend checking the sector size and firmware version:
nvme id-ctrl /dev/nvmeX -H | grep "LBA Format"
nvme id-ctrl /dev/nvmeX -H | grep "Firmware"

All drives intended for use in one RAID must have the same model, firmware version (preferably the latest one), and sector size.

  • run the following command to view the status of SMART drives:
nvme smart-log /dev/nvmeX --output-format=json

If you are using EBOF, check the native NVMe multipath parameters:

nvme list-subsys

and configure Round Robin if necessary:

nvme connect -t rdma -a <target_address> -s <subnet> -n <namespace_id> -p rr
echo "rr" > /sys/class/nvme/nvmeX/subsysY/ana_group/default_load_balance

Kernel parameters

To improve performance, you should configure the kernel boot parameters.

  • intel_idle.max_cstate=0: Sets the maximum C-state for Intel processors to 0. C-states are power-saving states for the CPU, and setting it to 0 ensures that the processor does not enter any idle power-saving state.

Applying the kernel parameters presented below will lead to a security risk. While they improve performance, they also reduce system security and Xinnor never uses them for production implementations or public performance tests. Please use these parameters with care and at your own responsibility.

  • noibrs: Disables the Indirect Branch Restricted Speculation (IBRS) mitigation. IBRS is a feature that protects against certain Spectre vulnerabilities by preventing speculative execution of indirect branches.
  • noibpb: Disables the Indirect Branch Predictor Barrier (IBPB) mitigation. IBPB is another Spectre mitigation that helps prevent speculative execution of indirect branches.
  • nopti: Disables the Page Table Isolation (PTI) mitigation. PTI is a security feature that isolates kernel and user page tables to mitigate certain variants of the Meltdown vulnerability.
  • nospectre_v2: Disables the Spectre Variant 2 mitigation. Spectre Variant 2, also known as Branch Target Injection, is a vulnerability that can allow unauthorized access to sensitive information.
  • nospectre_v1: Disables the Spectre Variant 1 mitigation. Spectre Variant 1, also known as Bounds Check Bypass, is another vulnerability that can lead to unauthorized access to data.
  • l1tf=off: Turns off the L1 Terminal Fault (L1TF) mitigation. L1TF is a vulnerability that allows unauthorized access to data in the L1 cache.
  • nospec_store_bypass_disable: Disables the Speculative Store Bypass (SSB) mitigation. SSB is a vulnerability that can allow unauthorized access to sensitive data stored in the cache.
  • no_stf_barrier: Disables the Single Thread Indirect Branch Predictors (STIBP) mitigation. STIBP is a feature that prevents speculation across different threads, mitigating certain Spectre vulnerabilities.
  • mds=off: Disables the Microarchitectural Data Sampling (MDS) mitigation. MDS is a vulnerability that can lead to data leakage across various microarchitectural buffers.
  • tsx=on: Enables Intel Transactional Synchronization Extensions (TSX). TSX is an Intel feature that provides support for transactional memory, allowing for efficient and concurrent execution of certain code segments.
  • tsx_async_abort=off: Disables asynchronous aborts in Intel TSX. Asynchronous aborts are a mechanism that can terminate transactions and undo their effects in certain cases.
  • mitigations=off: Turns off all generic mitigations for security vulnerabilities. This parameter disables all known hardware and software mitigations for various vulnerabilities, prioritizing performance over security.

It's important to note that disabling or modifying these security mitigations can potentially expose your system to security vulnerabilities. These parameters are typically used for debugging or performance testing purposes and are not recommended for regular usage, especially in production environments.

Example:

sed -i 's/^GRUB_CMDLINE_LINUX="\(.*\)"$/GRUB_CMDLINE_LINUX="noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off intel_idle.max_cstate=0 \1"/' /etc/default/grub
update-grub или grub2-mkconfig

We recommend setting the polling mode for NVMe devices in Intel-based systems (excluding AMD).

echo "options nvme poll_queues=4" >> /etc/modprobe.d/nvme.conf

dracut -f or update-initramfs -u -k all

Schedulers

Make sure schedulers are set to ‘none’ or ‘noop’:

cat /sys/block/nvme*/queue/scheduler

If they are not, you can run the following Bash script:

for nvme_device in /sys/block/nvme*; do
    echo noop > $nvme_device/queue/scheduler
done

This setting is not permanent, so you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

Similarly, the script increases the I/O scheduler queue depth to 512 for NVMe devices by writing "512" to the /sys/block/<nvme_device>/queue/nr_requests file. Increasing the queue depth can allow for better utilization of the device and potentially improve performance:

for nvme_device in /sys/block/nvme*; do
    echo 512 > $nvme_device/queue/nr_requests
done

This setting is also not permanent, so you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

When using Flash drives connected via HBA, please refer to the HBA instructions for recommendations.

Other settings

Settings 1-5 are not permanent. So you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

  1. Disabling Transparent Huge Pages (THP): The script sets the THP to "never" by writing "never" to the /sys/kernel/mm/transparent_hugepage/enabled file. This helps reduce latency.
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
  2. Disabling Kernel Transparent Huge Pages (KPTI): The script sets KPTI to "never" by writing "never" to /sys/kernel/mm/transparent_hugepage/defrag file. This is another step to improve performance.
    echo never > /sys/kernel/mm/transparent_hugepage/defrag
  3. Disabling Kernel Same-Page Merging (KSM): The script disables KSM by writing "0" to the /sys/kernel/mm/ksm/run file. Disabling KSM reduces CPU usage by avoiding redundant memory sharing among processes.
    echo 0 > /sys/kernel/mm/ksm/run
  4. (Pattern-dependent) Setting read-ahead value for NVMe devices: The script uses the blockdev command to set the read-ahead value to 65536 (the maximum size) for NVMe devices. This setting controls the amount of data the system reads ahead of time from the storage device, which can help improve performance in some scenarios (mostly on sequential read).
    blockdev --setra 65536 /dev/nvme0n1
  5. Installing necessary packages: The script checks for the availability of package managers (dnf or apt) and installs the cpufrequtils and tuned packages if they are available.
  6. Setting CPU frequency governor: The script uses the cpupower frequency-set command to set the CPU frequency governor to "performance," which keeps the CPU running at the highest frequency. This can improve performance at the cost of increased power consumption and potential higher temperatures.
    cpupower frequency-set -g performance
  7. Displaying current CPU frequency governor: The script displays the current CPU frequency governor by reading the value from the /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor file.
    echo "Current CPU frequency governor: $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)"
  8. Stopping irqbalance service: The script stops the irqbalance service, which distributes interrupts across CPU cores to balance the load. Stopping irqbalance can be useful when trying to maximize performance in certain scenarios.
  9. systemctl stop irqbalance
    systemctl status irqbalance
  10. Setting system profile to throughput-performance: The script uses the tuned-adm command to set the system profile to "throughput-performance" provided by the tuned package. This profile is designed to optimize the system for high throughput and performance.
    tuned-adm profile throughput-performance

Drive Performance

After evaluating the system characteristics and determining the test objectives (as discussed in the previous blog post), it is necessary to proceed with the basic calculation of the expected performance. This calculation should be based on the hardware specifications and the conducted tests.

This should be done as follows:

  1. Find the manufacturer's specifications for the drives.
  2. Run tests on 1-3 separate drives.
  3. If the results obtained differ significantly from the characteristics stated in the specification, refer to the troubleshooting guide.
  4. Calculate the total expected performance of all drives intended for use in RAID.
  5. Run tests simultaneously on all drives intended for use in the RAID.
  6. If the results obtained differ significantly from the expected ones, refer to the troubleshooting guide.
  7. Calculate the expected performance of the array based on these results.

The calculations should be based on the manufacturers' specifications, which often include the performance of random read/write 4k blocks, as well as of mixed workloads (70% read, 30% write). They also provide information on the sequential read/write 128k block performance.

The total queue depth or the ratio of the queue depth to the number of threads is often specified.

To assess performance levels, you can use the fio utility. Here are some examples of fio configuration files:

Prepare the drive for the tests: overwrite the drive with a 128k block twice.

[global]
direct=1
bs=128k
ioengine=libaio
rw=write
iodepth=32
numjobs=1
blockalign=128k
[drive]
filename=/dev/nvme1n1

Test the workflows:

[global]
direct=1
bs=128k
ioengine=libaio
rw=read / write
iodepth=8
numjobs=2
norandommap
time_based=1
runtime=600
group_reporting
offset_increment=50%
gtod_reduce=1
[drive]
filename=/dev/nvme1n1

Prepare the drive for tests: overwrite the drive with a 4k block using random pattern.

[global]
direct=1
bs=4k
ioengine=libaio
rw=randwrite
iodepth=128
numjobs=2
random_generator=tausworthe64
[drive]
filename=/dev/nvme1n1

Test random workload:

[global]
direct=1
bs=4k
ioengine=libaio
rw=randrw
rwmixread=0 / 50 / 70 / 100
iodepth=128
numjobs=2
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1
[drive]
filename=/dev/nvme1n1

After receiving the results for one drive, you should run tests on all drives simultaneously to calculate the total performance.

Prepare all drives for testing: overwrite each with the 128k block twice. Examples of fio configuration files:

[global]
direct=1
bs=128k
ioengine=libaio
rw=write
iodepth=32
numjobs=1
blockalign=128k
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Test the workflows:

[global]
direct=1
bs=128k
ioengine=libaio
rw=read / write
iodepth=8
numjobs=2
norandommap
time_based=1
runtime=600
group_reporting
offset_increment=50%
gtod_reduce=1
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Prepare all drives for testing: overwrite each with a 4k block using random pattern.

[global]
direct=1
bs=4k
ioengine=libaio
rw=randwrite
iodepth=128
numjobs=2
random_generator=tausworthe64
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Test random workload:

[global]
direct=1
bs=4k
ioengine=libaio
rw=randrw
rwmixread=0 / 50 / 70 / 100
iodepth=128
numjobs=2
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Ideally, the test results for all drives should be equal to the results for one drive multiplied by the number of drives (in terms of GBps for sequential tests and IOPS for random tests). However, due to platform limitations, these results are not always achieved across all patterns. These limitations should be taken into account further on.

To ensure accurate results, it is recommended to repeat sequential tests using a block size that corresponds to the future array strip size (unless it is already equal to the specified block size used in the sequential tests). This is important because drives may show varying performance based on the block size being used.

Additionally, the load level, determined by the number of threads and the queue depth, should correspond to both the manufacturer's specifications and the expected load on the RAID.

To calculate the expected RAID performance, use the Table below:

1 drive All drives
Expected performance Queue depth Number of threads Expected performance Queue depth Number of threads
4k rand read Corresponds to the specification for the drive (drive_performance) Q T Drive_performance*N Q T*N
4k rand write Corresponds to the specification for the drive (drive_performance) Q T Drive_performance*N Q T*N
4k rand read/write 50/50 Close to double write perrformance or corresponds to the secification (if any) Q T Drive_performance*N Q T*N
128k Seq Read Corresponds to the specification for the drive (drive_performance) Q T Drive_performance*N Q T*N
128 Seq Write Corresponds to the specification for the drive (drive_performance) Q T Drive_performance*N Q T*N
Max expected RAID performance Max expected RAID performance in case of failures (% of normal performance)
Expected performance Queue depth Number of threads Expected performance Queue depth Number of threads
4k rand read The combined performance of the drives Q T*N 50% for RAID 5
75% for RAID 50 with 2 groups
87,5% for RAID 50 with 4 groups
Q T*N
4k rand write Depends strongly on the characteristics of the drives. In most cases, it ranges from (drive_performance*N/5) to (drive_performance*N/3) for RAID 5 and RAID 6 Q T*N For RAID 5 - typically around 80-85% of this RAID 5 performance in undegraded state (depends on the number of drives in the array) Q T*N
4k rand read/write 50/50 Drive_performance*N/2 for RAID 5 and RAID 6 Q T*N 80-85% for RAID 5 (depends on the number of drives in the array) Q T*N
128k Seq Read Drive_performance*N Q T*N For RAID 5 - up to 100% of this RAID 5 performance in undegraded state (depends on the hardware configuration and the performance of the processors) Q T*N
128 Seq Write Drive_performance*N Q T*N For RAID 5 - up to 100% of this RAID 5 performance in undegraded state (depends on the hardware configuration and the performance of the processors) Q T*N

T is the number of threads required for the test as given in the specification for the drive.

Q is the queue depth required for the test as given in the specification for the drive.

N is the number of drives used in the tests and in RAID creation.

These calculations are theoretical and allow us to estimate the maximum theoretical performance of the array.

In reality, the performance of the system can be influenced by both external factors (such as the operation of internal components of storage devices and temperature conditions), as well as internal factors (like the need to perform calculations and temporarily store data in memory). Therefore, we consider it normal for approximately 90-95% of the calculated performance to be achieved on the array.