General Configuration Recommendations

RAID Creation

Next recommendations are appropriate and depend on drives’ parameters and vendors.

  • The appropriate RAID level depends on the required availability level.

    Level of availability as high as 99.999% can be achieved by using RAID 6 if the RAID consists of less than 20 drives. Use RAID 7.3 with more than 20 drives.

    Level of availability as high as 99.999% can be achieved by using RAID 50 if the RAID consists of less than 16 drives. With more drives, use RAID 60 or RAID 70.

  • The recommended stripe size for the xiRAID RAID is 16 KiB (set by default).

Using NVMe-oF devices to create a RAID.

  1. Xinnor xiRAID allows using NVMe-oF devices to create a RAID. Set the --ctrl-loss-tmo parameter to 0 to prevent command freezing because of connection loss when using these devices. It is relevant to nvme-cli version >= 1.4.

    # nvme connect -t rdma -n nqn.Xinnor12_1 -a 10.30.0.12 -s 4420 --ctrl-loss-tmo=0
  2. At the creation of NVMe-oF target for xiRAID RAID, you can enable Merge if the access pattern assumably will be sequential write.

    Depending on the version of Linux Kernel or Mellanox drivers, NVMe-oF targets may split big requests to 32 KiB + the rest. This kind of behavior leads to constant read-modify-writes. For an SPDK NVMe-oF target, set the InCapsuleDataSize parameter denoting at by what value requests should be split.

RAID and System Setup Recommendations

--init_prio

Syndrome RAID creation starts the initialization process automatically. During it, RAID is available for reading and writing operations. Since initialization priority by default is set to 100, you can wait until the initialization is finished, or if the access pattern is not random write, you can lower the initialization priority. Therefore, user I/O will be processed faster due to the reduction of initialization requests. If the initialization priority is set to 0, initialization requests are not created during user I/O.

--recon_prio

The reconstruction process starts automatically. By default, reconstruction priority equals to 100, which means reconstruction has maximum priority among other processes. Setting the priority to 0 allows the user I/O processes running before the reconstruction process.

--restripe_prio

The modify command enables to change restriping priority. If the priority value of the function is zero, restriping starts and continues only if there is no workload. By default, priority is set to 100% that stands for the highest possible rate of the restriping process. To improve the system workload performance, try decreasing restripe priority.

--sched_enabled

There are 2 possible ways of handling an incoming request:

  • continue execution on the current CPU;
  • transfer the request to the other CPU core and continue execution. Note that it takes time for the transferring.

If the access pattern uses less than half of the system CPU, it is efficient to use the --sched_enabled parameter. When a lot of requests are processed by the single CPU core, enabling scheduling allows to redistribute the workload equally between all system CPUs. On multithreading access patterns, scheduling is inefficient, because useless transfer of requests from one CPU core to another wastes time.

Tip:

Enable Scheduling when the access pattern is low threaded.

--merge_write_enabled

The --merge_write_enabled improves the system workload performance when access pattern is sequential and high threaded, and the block sizes are small. This parameter sets a waiting time for all incoming requests in sequential areas. During waiting time, requests to this area are not intentionally transferred to the drives. Instead of immediate data transfer, incoming requests are formed into a tree structure. At the end of waiting time, requests are merged together if possible. This function reduces the number of read-modify-write operations on syndrome RAIDs. Despite the extra waiting time, this function can improve the system workload performance. If the access pattern is mainly random or queue depth is small, the waiting time will not allow merging requests. In this case enabling --merge_write_enabled will decrease the system workload performance.

Tip:

Enable merge by the --merge_write_enabled parameter when the access pattern is sequential and high threaded and the block sizes are small.

Since the time between incoming I/O depends on the workload intensity, size, and other parameters, it may be necessary to change --merge_wait and --merge_max parameters for better query consolidation. Usually, large I/O sizes require large values for these parameters.

The function only works when the condition is met:

data_drives * stripe_size ≤ 1024

where

  • “data_drives” is a number of drives in the RAID (for RAIDs 5, 6, 7.3 or N+M) or in one RAID group (for RAIDs 50, 60, or 70) that are dedicated for data;

  • “stripe_size” is a selected stripe size for the RAID (stripe_size value) in KiB.

The “data_drives” value depending on a RAID level:

RAID level

Value of data_drives

RAID 5

Number of RAID drives minus 1

RAID 6

Number of RAID drives minus 2

RAID 7.3

Number of RAID drives minus 3

RAID N+M

Number of RAID drives minus M

RAID 50

Number of drives in one RAID group minus 1

RAID 60

Number of drives in one RAID group minus 2

RAID 70

Number of drives in one RAID group minus 3

Deactivate Merge when queue depth of user's workload is not enough to merge a full stripe. Activate Merge, if

iodepth * block_size >= data_drives * stripe_size

where “block_size” is a block size of the RAID (the block_size value in RAID parameters) in KiB.

--request_limit

This parameter limits the number of incoming requests per RAID. For example, writing files with a file system without synchronization.

Tip:

To improve system workload performance, we recommend enabling the limit on the number of incoming requests by the --request_limit parameter when you are working with file system and the buffered writing is performed.

--force_online

If a RAID has unrecoverable sections, then the RAID becomes unreadable (get in the offline, unrecoverable state). To try to read available data, manually turn on the online mode for the RAID by running the command

# xicli raid modify -n <raid_name> --force_online

While in the mode, I/O operations on unrecoverable sections of the RAID may lead to data corruption.

--resync_enabled

The function starts a RAID re-initialization after an unsafe system shutdown, thereby protecting syndromic RAIDs (all but RAID 0, RAID 1, and RAID 10) from data loss caused by a write hole.

To disable resync for all RAIDs, run

# modprobe xiraid resync=0

Strip Size

Recommended RAID strip size is 16 KiB.

RAM Limit

Current memory usage is being monitored and controlled to be within the limit. You can modify the --memory-limit parameter at any time. By default, memory usage is unlimited.

Deactivating monitoring of current memory usage and limitation control can improve system workload performance. Set --memory-limit to 0 to deactivate monitoring with the modify command.

If it is necessary to limit the use of RAM, we recommend choosing amount of RAM depending on the selected strip size for the RAD:

Strip size, in KiB

Amount of RAM, in MiB

16

2048

32

2048

64

4096

128

8192

256

16384

NUMA

  1. Create a RAID out of drives belonging to the same NUMA node, if your systems are multiprocessor.

    To figure out the NUMA node drive, run:

    # cat /sys/block/nvme0n1/device/device/numa_node

    or via lspci:

    # lspci -vvv
  2. At creation of NVMe-oF target for xiRAID RAID, you can use network adapter of the same NUMA node as NVMe drives.

System

  1. xiRAID shows better performance with enabled hyper-threading (HT).

    To find out if there is HT support on the CPU, run

    # cat /proc/cpuinfo | grep ht

    In the flags field, check for the ht flag.

    Command output example:

    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip arch_capabilities

    To check if HT is enabled, run

    # lscpu

    If Thread(s) per core is 1, then HT is off. HT can be enabled in BIOS/UEFI.

    Command output example:

    Architecture:                    x86_64
    CPU op-mode(s):                  32-bit, 64-bit
    Byte Order:                      Little Endian
    Address sizes:                   40 bits physical, 48 bits virtual
    CPU(s):                          4
    On-line CPU(s) list:             0-3
    Thread(s) per core:              1
    …
  2. The tuned-adm profile set to throughput-performance provides better performance on most of the tests:

    # tuned-adm profile throughput-performance

Workload

In Xinnor xiRAID 4.0.1, user I/O tends to be executed on the same CPU on which the user sent them. However, for some access patterns, you can transfer I/O commands to other CPUs, so the commands will not idle. You can enable I/O Scheduling to all system CPU using a parameter --sched-enabled (1 – activated, 0 – deactivated).

Activating and deactivating the Scheduling function depending on the access pattern recommendations are provided in the --shed_enabled section.

Swap File

On high load servers, we recommend to disable swap file usage to increase server performance.

File System Mounting Aspects

Since the system restores xiRAID RAIDs after loading an appropriate Linux core and sending a RAID-restore command, to perform automatic mounting at system startup of file systems for these RAIDs, use one of the following instructions.

Tip:

To set up automatic mounting at system startup, we recommend using systemd.mount.

systemd.mount

When automatic mounting at system startup via systemd, in the [Unit] section, put the following strings:

  • Requires = xiraid-restore.service
  • After = xiraid-restore.service

Example: mounting xfs located on a RAID /dev/xi_raidname into /mnt/raid/ through systemd.mount:

  1. Set a timeout of 5 minutes for the xiRAID device in the unit file:

    1. Run

      # systemctl edit --force --full /dev/xi_raidname
    2. Add the following lines:

      [Unit]
      JobRunningTimeoutSec=5m

      Save the changes.

    3. Check the changes:

      #systemctl cat /dev/xi_raidname
  2. Create a file at /etc/systemd/system/ with the mount options.

    The file name must match the path of the mount directory with "/" replaced by "-" (for example, for /mnt/raid the file name will be "mnt-raid.mount").

    The example file /etc/systemd/system/mnt-raid.mount

    [Unit]

    Description=Mount filesystem on Xinnor xiRAID

    Requires=xiraid-restore.service

    After=xiraid-restore.service

    DefaultDependencies=no

    Before=umount.target

    Conflicts=umount.target

     

    [Mount]

    What=/dev/xi_raidname

    Where=/mnt/raid/

    Options=defaults

    Type=xfs

     

    [Install]

    WantedBy=multi-user.target

  3. Run the command

    # systemctl daemon-reload

    Enable automatic mounting at system startup:

    # systemctl enable mnt-raid.mount

    Start the service to mount the file system:

    # systemctl start mnt-raid.mount

/etc/fstab

When setting up automatic mounting at system startup via /etc/fstab, point out one of the following sets of options:

  • x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,_netdev
  • x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,nofail

The parameter “x-systemd.requires”

The value "xiraid-restore.service" for this parameter sets a strict dependency of device mounting on service execution.

The parameter “x-systemd.device-timeout”

The parameter x-systemd.device-timeout= configures how long systemd should wait for a device to show up before giving up on an entry from /etc/fstab. Specify a time in seconds or explicitly append a unit such as "s", "min", "h", "ms".

Note that this option can only be used in /etc/fstab, and will be ignored when part of the Options= setting in a unit file.

The value “_netdev”

The value _netdev sets that the filesystem resides on a device that requires network access (used to prevent the system from attempting to mount these filesystems until the network has been enabled on the system).

The value “nofail”

If the device is not permanently connected and may not be present when the system starts, mount it with the value nofail. This will prevent from errors when mounting such a device.

Example: mounting xfs located on a RAID /dev/xi_raidname into /mnt/raid/ through /etc/fstab with the _netdev option:

The string from the file /etc/fstab

/dev/xi_raidname    /mnt/raid/   xfs    x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,_netdev  0   0

Example: mounting xfs located on a RAID /dev/xi_raidname into /mnt/raid/ through /etc/fstab with the nofail option:

The string from the file /etc/fstab

/dev/xi_raidname    /mnt/raid/   xfs    x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,nofail  0   0