General Configuration Recommendations
RAID Creation
Next recommendations are appropriate and depend on drives’ parameters and vendors.
-
The appropriate RAID level depends on the required availability level.
Level of availability as high as 99.999% can be achieved by using RAID 6 if the RAID consists of less than 20 drives. Use RAID 7.3 with more than 20 drives.
Level of availability as high as 99.999% can be achieved by using RAID 50 if the RAID consists of less than 16 drives. With more drives, use RAID 60 or RAID 70.
- The recommended stripe size for the xiRAID RAID is 16 KiB (set by default).
Using NVMe-oF devices to create a RAID.
-
Xinnor xiRAID allows using NVMe-oF devices to create a RAID. Set the --ctrl-loss-tmo parameter to 0 to prevent command freezing because of connection loss when using these devices. It is relevant to nvme-cli version >= 1.4.
# nvme connect -t rdma -n nqn.Xinnor12_1 -a 10.30.0.12 -s 4420 --ctrl-loss-tmo=0
-
At the creation of NVMe-oF target for xiRAID RAID, you can enable Merge if the access pattern assumably will be sequential write.
Depending on the version of Linux Kernel or Mellanox drivers, NVMe-oF targets may split big requests to 32 KiB + the rest. This kind of behavior leads to constant read-modify-writes. For an SPDK NVMe-oF target, set the InCapsuleDataSize parameter denoting at by what value requests should be split.
RAID and System Setup Recommendations
--init_prio
Syndrome RAID creation starts the initialization process automatically. During it, RAID is available for reading and writing operations. Since initialization priority by default is set to 100, you can wait until the initialization is finished, or if the access pattern is not random write, you can lower the initialization priority. Therefore, user I/O will be processed faster due to the reduction of initialization requests. If the initialization priority is set to 0, initialization requests are not created during user I/O.
--recon_prio
The reconstruction process starts automatically. By default, reconstruction priority equals to 100, which means reconstruction has maximum priority among other processes. Setting the priority to 0 allows the user I/O processes running before the reconstruction process.
--restripe_prio
The modify command enables to change restriping priority. If the priority value of the function is zero, restriping starts and continues only if there is no workload. By default, priority is set to 100% that stands for the highest possible rate of the restriping process. To improve the system workload performance, try decreasing restripe priority.
--sched_enabled
There are 2 possible ways of handling an incoming request:
- continue execution on the current CPU;
- transfer the request to the other CPU core and continue execution. Note that it takes time for the transferring.
If the access pattern uses less than half of the system CPU, it is efficient to use the --sched_enabled parameter. When a lot of requests are processed by the single CPU core, enabling scheduling allows to redistribute the workload equally between all system CPUs. On multithreading access patterns, scheduling is inefficient, because useless transfer of requests from one CPU core to another wastes time.
Enable Scheduling when the access pattern is low threaded.
--merge_write_enabled
The --merge_write_enabled improves the system workload performance when access pattern is sequential and high threaded, and the block sizes are small. This parameter sets a waiting time for all incoming requests in sequential areas. During waiting time, requests to this area are not intentionally transferred to the drives. Instead of immediate data transfer, incoming requests are formed into a tree structure. At the end of waiting time, requests are merged together if possible. This function reduces the number of read-modify-write operations on syndrome RAIDs. Despite the extra waiting time, this function can improve the system workload performance. If the access pattern is mainly random or queue depth is small, the waiting time will not allow merging requests. In this case enabling --merge_write_enabled will decrease the system workload performance.
Enable merge by the --merge_write_enabled parameter when the access pattern is sequential and high threaded and the block sizes are small.
Since the time between incoming I/O depends on the workload intensity, size, and other parameters, it may be necessary to change --merge_wait and --merge_max parameters for better query consolidation. Usually, large I/O sizes require large values for these parameters.
The function only works when the condition is met:
data_drives * stripe_size ≤ 1024
where
-
“data_drives” is a number of drives in the RAID (for RAIDs 5, 6, 7.3 or N+M) or in one RAID group (for RAIDs 50, 60, or 70) that are dedicated for data;
-
“stripe_size” is a selected stripe size for the RAID (stripe_size value) in KiB.
The “data_drives” value depending on a RAID level:
RAID level |
Value of data_drives |
---|---|
RAID 5 |
Number of RAID drives minus 1 |
RAID 6 |
Number of RAID drives minus 2 |
RAID 7.3 |
Number of RAID drives minus 3 |
RAID N+M |
Number of RAID drives minus M |
RAID 50 |
Number of drives in one RAID group minus 1 |
RAID 60 |
Number of drives in one RAID group minus 2 |
RAID 70 |
Number of drives in one RAID group minus 3 |
Deactivate Merge when queue depth of user's workload is not enough to merge a full stripe. Activate Merge, if
iodepth * block_size >= data_drives * stripe_size
where “block_size” is a block size of the RAID (the block_size value in RAID parameters) in KiB.
--request_limit
This parameter limits the number of incoming requests per RAID. For example, writing files with a file system without synchronization.
To improve system workload performance, we recommend enabling the limit on the number of incoming requests by the --request_limit parameter when you are working with file system and the buffered writing is performed.
--force_online
If a RAID has unrecoverable sections, then the RAID becomes unreadable (get in the offline, unrecoverable state). To try to read available data, manually turn on the online mode for the RAID by running the command
# xicli raid modify -n <raid_name> --force_online
While in the mode, I/O operations on unrecoverable sections of the RAID may lead to data corruption.
--resync_enabled
The function starts a RAID re-initialization after an unsafe system shutdown, thereby protecting syndromic RAIDs (all but RAID 0, RAID 1, and RAID 10) from data loss caused by a write hole.
To disable resync for all RAIDs, run
# modprobe xiraid resync=0
Strip Size
Recommended RAID strip size is 16 KiB.
RAM Limit
Current memory usage is being monitored and controlled to be within the limit. You can modify the --memory-limit parameter at any time. By default, memory usage is unlimited.
Deactivating monitoring of current memory usage and limitation control can improve system workload performance. Set --memory-limit to 0 to deactivate monitoring with the modify command.
If it is necessary to limit the use of RAM, we recommend choosing amount of RAM depending on the selected strip size for the RAD:
Strip size, in KiB |
Amount of RAM, in MiB |
---|---|
16 |
2048 |
32 |
2048 |
64 |
4096 |
128 |
8192 |
256 |
16384 |
NUMA
-
Create a RAID out of drives belonging to the same NUMA node, if your systems are multiprocessor.
To figure out the NUMA node drive, run:
# cat /sys/block/nvme0n1/device/device/numa_node
or via lspci:
# lspci -vvv
-
At creation of NVMe-oF target for xiRAID RAID, you can use network adapter of the same NUMA node as NVMe drives.
System
-
xiRAID shows better performance with enabled hyper-threading (HT).
To find out if there is HT support on the CPU, run
# cat /proc/cpuinfo | grep ht
In the flags field, check for the ht flag.
Command output example:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip arch_capabilities
To check if HT is enabled, run
# lscpu
If Thread(s) per core is 1, then HT is off. HT can be enabled in BIOS/UEFI.
Command output example:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 …
-
The tuned-adm profile set to throughput-performance provides better performance on most of the tests:
# tuned-adm profile throughput-performance
Workload
In Xinnor xiRAID 4.0.1, user I/O tends to be executed on the same CPU on which the user sent them. However, for some access patterns, you can transfer I/O commands to other CPUs, so the commands will not idle. You can enable I/O Scheduling to all system CPU using a parameter --sched-enabled (1 – activated, 0 – deactivated).
Activating and deactivating the Scheduling function depending on the access pattern recommendations are provided in the --shed_enabled section.
Swap File
On high load servers, we recommend to disable swap file usage to increase server performance.
File System Mounting Aspects
Since the system restores xiRAID RAIDs after loading an appropriate Linux core and sending a RAID-restore command, to perform automatic mounting at system startup of file systems for these RAIDs, use one of the following instructions.
To set up automatic mounting at system startup, we recommend using systemd.mount.
systemd.mount
When automatic mounting at system startup via systemd, in the [Unit] section, put the following strings:
- Requires = xiraid-restore.service
- After = xiraid-restore.service
Example: mounting xfs located on a RAID /dev/xi_raidname into /mnt/raid/ through systemd.mount:
-
Set a timeout of 5 minutes for the xiRAID device in the unit file:
-
Run
# systemctl edit --force --full /dev/xi_raidname
-
Add the following lines:
[Unit] JobRunningTimeoutSec=5m
Save the changes.
-
Check the changes:
#systemctl cat /dev/xi_raidname
-
-
Create a file at /etc/systemd/system/ with the mount options.
The file name must match the path of the mount directory with "/" replaced by "-" (for example, for /mnt/raid the file name will be "mnt-raid.mount").
The example file /etc/systemd/system/mnt-raid.mount
[Unit]
Description=Mount filesystem on Xinnor xiRAID
Requires=xiraid-restore.service
After=xiraid-restore.service
DefaultDependencies=no
Before=umount.target
Conflicts=umount.target
[Mount]
What=/dev/xi_raidname
Where=/mnt/raid/
Options=defaults
Type=xfs
[Install]
WantedBy=multi-user.target
-
Run the command
# systemctl daemon-reload
Enable automatic mounting at system startup:
# systemctl enable mnt-raid.mount
Start the service to mount the file system:
# systemctl start mnt-raid.mount
/etc/fstab
When setting up automatic mounting at system startup via /etc/fstab, point out one of the following sets of options:
- x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,_netdev
- x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,nofail
The parameter “x-systemd.requires”
The value "xiraid-restore.service" for this parameter sets a strict dependency of device mounting on service execution.
The parameter “x-systemd.device-timeout”
The parameter x-systemd.device-timeout= configures how long systemd should wait for a device to show up before giving up on an entry from /etc/fstab. Specify a time in seconds or explicitly append a unit such as "s", "min", "h", "ms".
Note that this option can only be used in /etc/fstab, and will be ignored when part of the Options= setting in a unit file.
The value “_netdev”
The value _netdev sets that the filesystem resides on a device that requires network access (used to prevent the system from attempting to mount these filesystems until the network has been enabled on the system).
The value “nofail”
If the device is not permanently connected and may not be present when the system starts, mount it with the value nofail. This will prevent from errors when mounting such a device.
Example:
mounting xfs located on a RAID /dev/xi_raidname into /mnt/raid/ through /etc/fstab
with the _netdev option:
The string from the file /etc/fstab
/dev/xi_raidname /mnt/raid/ xfs x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,_netdev 0 0
Example: mounting xfs located on a RAID
/dev/xi_raidname into /mnt/raid/
through /etc/fstab
with the nofail
option:
The string from the file /etc/fstab
/dev/xi_raidname /mnt/raid/ xfs x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,nofail 0 0