Building a High-Performance, Highly Available Lustre Solution with xiRAID Classic 4.1 on a Dual-Node System with Shared NVMe Drives

September 18, 2024

Back to all posts

This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments.

Contents

System layout

xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.

This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.

This article will familiarize you with how to deploy xiRAID Classic for a real-life task.

Lustre server SBB Platform

We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below.

Viking VDS2249R

Viking VDS2249R

  node0 node1
Hostname node26 node27
CPU AMD EPYC 7713P 64-Core AMD EPYC 7713P 64-Core
Memory 256GB 256GB
OS drives 2 x Samsung SSD 970 EVO Plus 250GB mirrored 2 x Samsung SSD 970 EVO Plus 250GB mirrored
OS Rocky Linux 8.9 Rocky Linux 8.9
IPMI address 192.168.64.106 192.168.67.23
IPMI login admin admin
IPMI password admin admin
Management NIC enp194s0f0:
192.168.65.26/24
enp194s0f0:
192.168.65.27
Cluster Heartbeat NIC enp194s0f1:
10.10.10.1
enp194s0f1:
10.10.10.2
Infiniband LNET HDR ib0:
100.100.100.26
ib0:
100.100.100.27
  ib3:
100.100.100.126
ib3:
100.100.100.127
NVMes 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T84

System configuration and tuning

Before software installation and configuration, we need to prepare the platform to provide optimal performance.

Performance tuning

tuned-adm profile accelerator-performance

Network configuration

Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.65.26 node26
192.168.65.27 node27
10.10.10.1 node26-ic
10.10.10.2 node27-ic
192.168.64.50 node26-ipmi
192.168.64.76 node27-ipmi
100.100.100.26 node26-ib
100.100.100.27 node27-ib

Policy-based routing setup

We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces.

node26 setup:

node26# nmcli connection modify ib0 ipv4.route-metric 100
node26# nmcli connection modify ib3 ipv4.route-metric 101
node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100"
node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100"
node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200"
node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200"
node26# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

node27 setup:

node27# nmcli connection modify ib0 ipv4.route-metric 100
node27# nmcli connection modify ib3 ipv4.route-metric 101
node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100"
node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100"
node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200"
node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200"
node27# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

NVMe drives setup

In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 drives. They are PCIe 4.0, dual-ported, read-intensive drives with 1DWPD endurance. A single drive's performance can theoretically reach up to 6.9GB/s for sequential read and 4.2GB/s for sequential write (according to the vendor specification).

In our setup, we plan to create a simple Lustre installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the Lustre OST RAIDs, and create separate RAIDs from the first NVMe namespaces and the second NVMe namespaces. By configuring our cluster software to use the RAIDs made from the first namespaces (and their Lustre servers) on Lustre node #0 and the RAIDs created from the second namespaces on node #1, we will be able to utilize all four PCIe lanes for each NVMe used to store OST data, as Lustre itself will distribute the workload among all OSTs.

Since we are deploying a simple Lustre installation, we will use a simple filesystem scheme with just one metadata server. As we will have only one metadata server, we will need only one RAID for the metadata. Because of this, we will not create two namespaces on the drives used for the MDT RAID.

Here is how the NVMe drive configuration looks initially:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106

The Samsung drives are used for the operating system installation.

Let's reserve /dev/nvme0 and /dev/nvme1 drives for the metadata RAID1. Currently, xiRAID does not support spare pools in a cluster configuration, but having a spare drive is really useful for quick manual drive replacement. So, let's also reserve /dev/nvme3 to be a spare for the RAID1 drive and split all other KCM61RUL3T84 drives into 2 namespaces.

Let’s take /dev/nvme4 as an example. All other drives will be splited in absolutely the same way.

Check the maximum possible size of the drive to be sure:

# nvme id-ctrl /dev/nvme4 | grep -i tnvmcap
tnvmcap : 3840755982336

Check the maximal number of the namespaces supported by the drive:

# nvme id-ctrl /dev/nvme4 | grep ^nn
nn : 64

Check the controller used for the drive connection at both servers (they will differ):

node27# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x1

node26# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x2

We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is:

3840755982336/4096=937684566

So, each namespace size in 4K blocks will be:

937684566/2=468842283

In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 468700000 blocks.

If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.

As a first step, remove the existing namespace on one of the nodes:

node26# nvme delete-ns /dev/nvme4 -n 1

After that, create namespaces on the same node:

node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:1
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:2
node26# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x2
attach-ns: Success, nsid:1
node26# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x2
attach-ns: Success, nsid:2

Attach the namespaces on the second node with the proper controller:

node27# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x1
attach-ns: Success, nsid:1
node27# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x1
attach-ns: Success, nsid:2

It looks like this on both nodes:

# nvme list |grep nvme4
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106

All other drives were split in the same way. Here is the resulting configuration:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme10n2         21G0A04ET2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n2         21G0A045T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme13n2         21G0A04KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n2         21G0A047T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n2         21G0A04CT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n2         11U0A00KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n2         21G0A04JT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n2         21G0A048T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme20n2         21G0A03TT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n2         21G0A04FT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n2         21G0A03ZT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n2         21G0A04DT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n2         21G0A03VT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n2         21G0A044T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n2          21G0A04HT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n2          21G0A049T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n2          21G0A043T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n2          21G0A04AT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n2          21G0A03XT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106

Software components installation

Lustre installation

Create Lustre repo file /etc/yum.repos.d/lustre-repo.repo :

[lustre-server]
name=lustre-server
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/server
# exclude=*debuginfo*
gpgcheck=0

[lustre-client]
name=lustre-client
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/client
# exclude=*debuginfo*
gpgcheck=0

[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el8
# exclude=*debuginfo*
gpgcheck=0

Installing e2fs tools:

yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs

Installing Lustre kernel:

yum --nogpgcheck --disablerepo=baseos,extras,updates --enablerepo=lustre-server install kernel kernel-devel kernel-headers

Reboot to the new kernel:

reboot

Check the kernel version after reboot:

node26# uname -a
Linux node26 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Installing lustre server components:

yum --nogpgcheck --enablerepo=lustre-server,ha install kmod-lustre kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents

Check Lustre module load:

[root@node26 ~]# modprobe -v lustre
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/libcfs.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/lnet.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/obdclass.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/ptlrpc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fld.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fid.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/osc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lov.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/mdc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lmv.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lustre.ko

Unload modules:

# lustre_rmmod

Installing xiRAID Classic 4.1

Installing xiRAID Classic 4.1 at both nodes from the repositories following the Xinnor xiRAID 4.1.0 Installation Guide:

# yum install -y epel-release
# yum install https://pkg.xinnor.io/repository/Repository/xiraid/el/8/kver-4.18/xiraid-repo-1.1.0-446.kver.4.18.noarch.rpm
# yum install xiraid-release

Pacemaker installation

Running the following steps at both nodes:

Enable cluster repo

# yum config-manager --set-enabled ha appstream

Installing cluster:

# yum install pcs pacemaker psmisc policycoreutils-python3

Csync2 installation

Since we are installing the system on Rocky Linux 8, there is no need to compile Csync2 from sources ourselves. Just install the Csync2 package from the Xinnor repository on both nodes:

# yum install csync2

NTP server installation

# yum install chrony

HA cluster setup

Time synchronisation setup

Modify /etc/chrony.conf file if needed to setup working with proper NTP servers. At this setup we will work with the default settings.

# systemctl enable --now chronyd.service

Verify, that time sync works properly by running chronyc tracking.

Pacemaker cluster creation

In this chapter, the cluster configuration is described. In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between enp194s0f1 interfaces on the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high. A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.

Set the firewall to allow pacemaker software to work (on both nodes):

# firewall-cmd --add-service=high-availability
# firewall-cmd --permanent --add-service=high-availability

Set the same password for the hacluster user at both nodes:

# passwd hacluster

Start the cluster software at both nodes:

# systemctl start pcsd.service
# systemctl enable pcsd.service

Authenticate the cluster nodes from one node by their interconnect interfaces:

node26# pcs host auth node26-ic node27-ic -u hacluster
Password:
node26-ic: Authorized
node27-ic: Authorized

Create and start the cluster (start at one node):

node26# pcs cluster setup lustrebox0 node26-ic node27-ic
No addresses specified for host 'node26-ic', using 'node26-ic'
No addresses specified for host 'node27-ic', using 'node27-ic'
Destroying cluster on hosts: 'node26-ic', 'node27-ic'...
node26-ic: Successfully destroyed cluster
node27-ic: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node26-ic', 'node27-ic'
node26-ic: successful removal of the file 'pcsd settings'
node27-ic: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync authkey'
node26-ic: successful distribution of the file 'pacemaker authkey'
node27-ic: successful distribution of the file 'corosync authkey'
node27-ic: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync.conf'
node27-ic: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

node26# pcs cluster start --all
node26-ic: Starting Cluster...
node27-ic: Starting Cluster...

Check the current cluster status:

node26# pcs status
Cluster name: lustrebox0

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Jul 12 20:55:53 2024 on node26-ic
  * Last change:  Fri Jul 12 20:55:12 2024 by hacluster via hacluster on node27-ic
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Fencing setup

It's very important to have properly configured and working fencing (STONITH) in any HA cluster that works with shared storage devices. In our case, the shared devices are all the NVMe namespaces we created earlier. The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.

First of all, let's check the list of installed fencing agents in our system:

node26# pcs stonith list
fence_watchdog - Dummy watchdog fence agent

So, we don’t have the IPMI fencing agent installed at our cluster nodes. To install it, run the following command (at both nodes):

# yum install fence-agents-ipmilan

You may check the IPMI fencing agent options description by running the following command:

pcs stonith describe fence_ipmilan

Adding the fencing resources:

node26# pcs stonith create node27.stonith fence_ipmilan ip="192.168.67.23" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node27-ic" pcmk_host_check=static-list op monitor interval=10s
node26# pcs stonith create node26.stonith fence_ipmilan ip="192.168.64.106" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node26-ic" pcmk_host_check=static-list op monitor interval=10s

Preventing the STONITH resources from start on the node it should kill:

node26# pcs constraint location node27.stonith avoids node27-ic=INFINITY
node26# pcs constraint location node26.stonith avoids node26-ic=INFINITY

Csync2 configuration

Configure firewall to allow Csync2 to work (run at both nodes):

# firewall-cmd --add-port=30865/tcp
# firewall-cmd --permanent --add-port=30865/tcp

Create the Csync2 configuration file /usr/local/etc/csync2.cfg with the following content at node26 only:

nossl * *;
group csxiha {
host node26;
host node27;
key /usr/local/etc/csync2.key_ha;
include /etc/xiraid/raids; }

Generate the key:

node26# csync2 -k /usr/local/etc/csync2.key_ha

Copy the config and the key file to the second node:

node26# scp /usr/local/etc/csync2.cfg /usr/local/etc/csync2.key_ha node27:/usr/local/etc/

For Csync2 synchronisation by schedule one time per minute run crontab -e at both nodes and add the following record:

* * * * * /usr/local/sbin/csync2 -x

Also for asynchronous synchronisation run the following command to create a synchronisation script (repeat the script creation procedure at both nodes):

# vi /etc/xiraid/config_update_handler.sh

Fill the created script with the following content:

#!/usr/bin/bash
/usr/local/sbin/csync2 -xv

Save the file.

After that run the following command to set correct permissions for the script file:

# chmod +x /etc/xiraid/config_update_handler.sh

xiRAID Configuration for cluster setup

Disable RAID autostart to prevent RAIDs from being activated by xiRAID itself during a node boot. In a cluster configuration, RAIDs have to be activated by Pacemaker via cluster resources. Run the following command on both nodes:

# xicli settings cluster modify --raid_autostart 0

Make xiRAID Classic 4.1 resource agent visible for Pacemaker (run command this sequence at both nodes):

# mkdir -p /usr/lib/ocf/resource.d/xraid
# ln -s /etc/xraid/agents/raid /usr/lib/ocf/resource.d/xraid/raid

xiRAID RAIDs creation

To be able to create RAIDs, we need to install licenses for xiRAID Classic 4.1 on both hosts first. The licenses should be received from Xinnor. To generate the licenses, Xinnor requires the output of the xicli license show command (from both nodes).

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: null
version: 0
crypto_version: 0
created: 0-0-0
expired: 0-0-0
disks: 4
levels: 0
type: nvme
disks_in_use: 2
status: trial

The license files received from Xinnor needs to be installed by xicli license update -p <filename> command (once again, at both nodes):

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: 0F5A4B87A0FC6DB7544EA446B1B4AF5F34A08169C44E5FD119CE6D2352E202677768ECC78F56B583DABE11698BBC800EC96E556AA63E576DAB838010247678E7E3B95C7C4E3F592672D06C597045EAAD8A42CDE38C363C533E98411078967C38224C9274B862D45D4E6DED70B7E34602C80B60CBA7FDE93316438AFDCD7CBD23
version: 1
crypto_version: 1
created: 2024-7-16
expired: 2024-9-30
disks: 600
levels: 70
type: nvme
disks_in_use: 2
status: valid

Since we plan to deploy a small Lustre installation, combining MGT and MDT on the same target device is absolutely OK. But for medium or large Lustre installations, it's better to use a separate target (and RAID) for MGT.

Here is the list of the RAIDs we need to create.

RAID Name RAID Level Number of devices Strip size Drive list Lustre target
r_mdt0 1 2 16 /dev/nvme0n1
/dev/nvme1n1
MGT
+
MDT index=0
r_ost0 6 10 128 /dev/nvme4n1
/dev/nvme5n1
/dev/nvme6n1
/dev/nvme7n1
/dev/nvme8n1
/dev/nvme9n1
/dev/nvme10n1
/dev/nvme11n1
/dev/nvme13n1
/dev/nvme14n1
OST index=0
r_ost1 6 10 128 /dev/nvme4n2
/dev/nvme5n2
/dev/nvme6n2
/dev/nvme7n2
/dev/nvme8n2
/dev/nvme9n2
/dev/nvme10n2
/dev/nvme11n2
/dev/nvme13n2
/dev/nvme14n2
OST index=1
r_ost2 6 10 128 /dev/nvme15n1
/dev/nvme16n1
/dev/nvme17n1
/dev/nvme18n1
/dev/nvme20n1
/dev/nvme21n1
/dev/nvme22n1
/dev/nvme23n1
/dev/nvme24n1
/dev/nvme25n1
OST index=2
r_ost3 6 10 128 /dev/nvme15n2
/dev/nvme16n2
/dev/nvme17n2
/dev/nvme18n2
/dev/nvme20n2
/dev/nvme21n2
/dev/nvme22n2
/dev/nvme23n2
/dev/nvme24n2
/dev/nvme25n2
OST index=3

Creating all the RAIDs at the first node:

node26# xicli raid create -n r_mdt0 -l 1 -d /dev/nvme0n1 /dev/nvme1n1
node26# xicli raid create -n r_ost0 -l 6 -ss 128 -d /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1
node26# xicli raid create -n r_ost1 -l 6 -ss 128 -d /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2
node26# xicli raid create -n r_ost2 -l 6 -ss 128 -d /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1
node26# xicli raid create -n r_ost3 -l 6 -ss 128 -d /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2

At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.

Checking the RAID statuses at the first node:

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name   ║ static           ║ state       ║ devices                ║ info              ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║                   ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║                   ║
║        ║ strip_size: 16   ║             ║                        ║                   ║
║        ║ block_size: 4096 ║             ║                        ║                   ║
║        ║ sparepool: -     ║             ║                        ║                   ║
║        ║ active: True     ║             ║                        ║                   ║
║        ║ config: True     ║             ║                        ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║ init_progress: 11 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n1 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║ init_progress: 7  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n2 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║ init_progress: 5  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n1 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║ init_progress: 2  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n2 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║                   ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝

Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case):

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name   ║ static           ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None  ║         ║      ║
║        ║ level: 1         ║       ║         ║      ║
║        ║ strip_size: 16   ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝

After RAID creation, there's no need to wait for RAID initialization to finish. The RAIDs are available for use immediately after creation, albeit with slightly reduced performance.

For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID. Currently, all RAIDs are active on node26, so the sets will joint, but when they are spread between node26 and node27, they will not joint.

node26# xicli raid modify -n r_mdt0 -ca 0-7 -se 1
node26# xicli raid modify -n r_ost0 -ca 8-67 -se 1
node26# xicli raid modify -n r_ost1 -ca 8-67 -se 1 # will be running at node27
node26# xicli raid modify -n r_ost2 -ca 68-127 -se 1
node26# xicli raid modify -n r_ost3 -ca 68-127 -se 1 # will be running at node27

Lustre setup

LNET configuration

To make lustre working, we need to configure Lustre network stack (LNET).

Run at both nodes

# systemctl start lnet
# systemctl enable lnet
# lnetctl net add --net o2ib0 --if ib0
# lnetctl net add --net o2ib0 --if ib3

Check the configuration

# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 289478
              recv_count: 289474
              drop_count: 4
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          dev cpt: 0
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
    - net type: o2ib
      local NI(s):
        - nid: 100.100.100.26@o2ib
          status: down
          interfaces:
              0: ib0
          statistics:
              send_count: 213607
              recv_count: 213604
              drop_count: 7
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
        - nid: 100.100.100.126@o2ib
          status: up
          interfaces:
              0: ib3
          statistics:
              send_count: 4
              recv_count: 4
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"

Please pay attention to the LNET at the hosts - NIDs. We will use 100.100.100.26@o2ib for node26 and 100.100.100.27@o2ib for node27 as primary NIDs.

Save the LNET configuration:

# lnetctl export -b > /etc/lnet.conf

LDISKFS filesystems creation

At this step, we format the RAIDs into LDISKFS filesystem format. During formatting, we specify the target type (--mgs/--mdt/--ost), unique number of the specific target type (--index), Lustre filesystem name (--fsname), NIDs where each target filesystem could be mounted and where the corresponding servers will get started automatically (--servicenode), and NIDs where MGS could be found (--mgsnode).

Since our RAIDs will work within a cluster, we specify NIDs of both server nodes as the NIDs where the target filesystem could be mounted and where the corresponding servers will get started automatically for each target filesystem. For the same reason, we specify two NIDs where other servers should look for the MGS service.

node26# mkfs.lustre --mgs --mdt --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_mdt0
node26# mkfs.lustre --ost --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost0
node26# mkfs.lustre --ost --fsname=lustre0 --index=1 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost1
node26# mkfs.lustre --ost --fsname=lustre0 --index=2 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost2
node26# mkfs.lustre --ost --fsname=lustre0 --index=3 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost3

More details could be found in the Lustre documentation.

Cluster resources creation

Please check the table below. The configuration to configure is described in the table.

RAID name HA cluster RAID resource name Lustre target Mountpoint HA cluster filesystem resource name Preferred cluster node
r_mdt0 rr_mdt0 MGT
+
MDT index=0
/lustre_t/mdt0 fsr_mdt0 node26
r_ost0 rr_ost0 OST index=0 /lustre_t/ost0 fsr_ost0 node26
r_ost1 rr_ost1 OST index=1 /lustre_t/ost1 fsr_ost1 node27
r_ost2 rr_ost2 OST index=2 /lustre_t/ost2 fsr_ost2 node26
r_ost3 rr_ost3 OST index=3 /lustre_t/ost3 fsr_ost3 node27

To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.

To cluster Lustre services, there are two options, as currently two resource agents are capable of managing Lustre OSDs:

  1. ocf:heartbeat:Filesystem: Distributed by ClusterLabs in the resource-agents package, the Filesystem RA is a very mature and stable application and has been part of the Pacemaker project for many years. Filesystem provides generic support for mounting and unmounting storage devices, which indirectly includes Lustre.
  2. ocf:lustre:Lustre: Developed specifically for Lustre OSDs, this RA is distributed by the Lustre project and is available in Lustre releases from version 2.10.0 onwards. As a result of its narrower scope, it is less complex than ocf:heartbeat:Filesystem and better suited for managing Lustre storage resources.

For simplicity, we will use ocf:heartbeat:Filesystem in our case. However, ocf:lustre:Lustre can also be easily used in conjunction with xiRAID Classic in a Pacemaker cluster configuration. For more details on Lustre clustering, please check this page of Lustre documentation.

First of all, create mountpoints for all the RAIDs formatted in LDISKFS at both nodes:

# mkdir -p /lustre_t/ost3
# mkdir -p /lustre_t/ost2
# mkdir -p /lustre_t/ost1
# mkdir -p /lustre_t/ost0
# mkdir -p /lustre_t/mdt0

Unload all the RAIDs at the node where they are active:

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║      ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: True     ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n1 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n1 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node26# xicli raid unload -n r_mdt0
node26# xicli raid unload -n r_ost0
node26# xicli raid unload -n r_ost1
node26# xicli raid unload -n r_ost2
node26# xicli raid unload -n r_ost3

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name   ║ static           ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None  ║         ║      ║
║        ║ level: 1         ║       ║         ║      ║
║        ║ strip_size: 16   ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝

Create a copy of the cluster information base to make changes to at the first node:

node26# pcs cluster cib fs_cfg
node26# ls -l fs_cfg
-rw-r--r--. 1 root root 8614 Jul 20 02:04 fs_cfg

Getting the RAIDs UUIDs:

node26# grep uuid /etc/xiraid/raids/*.conf
/etc/xiraid/raids/r_mdt0.conf:    "uuid": "75E2CAA5-3E5B-4ED0-89E9-4BF3850FD542",
/etc/xiraid/raids/r_ost0.conf:    "uuid": "AB341442-20AC-43B1-8FE6-F9ED99D1D6C0",
/etc/xiraid/raids/r_ost1.conf:    "uuid": "1441D09C-0073-4555-A398-71984E847F9E",
/etc/xiraid/raids/r_ost2.conf:    "uuid": "0E225812-6877-4344-A552-B6A408EC7351",
/etc/xiraid/raids/r_ost3.conf:    "uuid": "F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E",

Creating resource rr_mdt0 for the r_mdt0 RAID:

node26# pcs -f fs_cfg resource create rr_mdt0 ocf:xraid:raid name=r_mdt0 uuid=75E2CAA5-3E5B-4ED0-89E9-4BF3850FD542 op monitor interval=5s meta migration-threshold=1

Setting a constraint to make the first node preferable for r_mdt0 resource:

node26# pcs -f fs_cfg constraint location rr_mdt0 prefers node26-ic=50

Creating a resource for the r_mdt0 RAID mountpoint at /lustre_t/mdt0:

node26# pcs -f fs_cfg resource create fsr_mdt0 Filesystem device="/dev/xi_r_mdt0" directory="/lustre_t/mdt0" fstype="lustre"

Configure the cluster to start rr_mdt0 and fsr_mdt0 at the same node ONLY:

node26# pcs -f fs_cfg constraint colocation add rr_mdt0 with fsr_mdt0 INFINITY

Configure the cluster to start fsr_mdt0 only after rr_mdt0:

node26# pcs -f fs_cfg constraint order rr_mdt0 then fsr_mdt0

Configure other resources in the same way:

node26# pcs -f fs_cfg resource create rr_ost0 ocf:xraid:raid name=r_ost0 uuid=AB341442-20AC-43B1-8FE6-F9ED99D1D6C0 op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost0 prefers node26-ic=50
node26# pcs -f fs_cfg resource create fsr_ost0 Filesystem device="/dev/xi_r_ost0" directory="/lustre_t/ost0" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost0 with fsr_ost0 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost0 then fsr_ost0

node26# pcs -f fs_cfg resource create rr_ost1 ocf:xraid:raid name=r_ost1 uuid=1441D09C-0073-4555-A398-71984E847F9E op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost1 prefers node27-ic=50
node26# pcs -f fs_cfg resource create fsr_ost1 Filesystem device="/dev/xi_r_ost1" directory="/lustre_t/ost1" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost1 with fsr_ost1 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost1 then fsr_ost1

node26# pcs -f fs_cfg resource create rr_ost2 ocf:xraid:raid name=r_ost2 uuid=0E225812-6877-4344-A552-B6A408EC7351 op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost2 prefers node26-ic=50
node26# pcs -f fs_cfg resource create fsr_ost2 Filesystem device="/dev/xi_r_ost2" directory="/lustre_t/ost2" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost2 with fsr_ost2 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost2 then fsr_ost2

node26# pcs -f fs_cfg resource create rr_ost3 ocf:xraid:raid name=r_ost3 uuid=F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost3 prefers node27-ic=50
node26# pcs -f fs_cfg resource create fsr_ost3 Filesystem device="/dev/xi_r_ost3" directory="/lustre_t/ost3" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost3 with fsr_ost3 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost3 then fsr_ost3

In xiRAID Classic 4.1, it is required to guarantee that only one RAID can be starting at a time. To do so, we define the following constraints. This limitation is planned for removal in xiRAID Classic 4.2.

node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost0 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost1 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost1 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost2 then start rr_ost3 kind=Serialize

To ensure Lustre servers start in the proper order, we need to configure the cluster to start MDS before all the OSS's. Since the Linux kernel starts MDSes and OSS's automatically when mounting the LDISKFS filesystem, we just need to set the proper start order for the fsr_* resources:

node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost0
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost1
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost2
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost3

Applying the batch cluster information base changes:

node26# pcs cluster cib-push fs_cfg --config

Checking the resulting cluster configuration:

node26# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node26-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Tue Jul 23 02:14:54 2024 on node26-ic
  * Last change:  Tue Jul 23 02:14:50 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Double-check on both nodes that the RAIDs are active and the filesystems are mounted properly. Please note that we have all the OST RAIDs based on /dev/nvme*n1 active on the first node (node26) and all the OST RAIDs based on /dev/nvme*n2 on the second one (node27), which will help us utilize the full NVMe throughput as planned.

node26:

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║      ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: True     ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n1 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n1 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node26# df -h|grep xi
/dev/xi_r_mdt0       2.1T  5.7M  2.0T   1% /lustre_t/mdt0
/dev/xi_r_ost0        14T  1.3M   14T   1% /lustre_t/ost0
/dev/xi_r_ost2        14T  1.3M   14T   1% /lustre_t/ost2

node27:

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None        ║                        ║      ║
║        ║ level: 1         ║             ║                        ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node27# df -h|grep xi
/dev/xi_r_ost1        14T  1.3M   14T   1% /lustre_t/ost1
/dev/xi_r_ost3        14T  1.3M   14T   1% /lustre_t/ost3

Lustre performance tuning

Here we set some parameters for the performance optimisation. All the commands have to be run at the host, where MDS server is running.

Server-side parameters:

# OSTs: 16MB bulk RPCs
node26# lctl set_param -P obdfilter.*.brw_size=16
node26# lctl set_param -P obdfilter.*.precreate_batch=1024
# Clients: 16MB RPCs
node26# lctl set_param -P obdfilter.*.osc.max_pages_per_rpc=4096
node26# lctl set_param -P osc.*.max_pages_per_rpc=4096
# Clients: 32 RPCs in flight
node26# lctl set_param -P mdc.*.max_rpcs_in_flight=128
node26# lctl set_param -P osc.*.max_rpcs_in_flight=128
node26# lctl set_param -P mdc.*.max_mod_rpcs_in_flight=127
# Clients: Disable memory and wire checksums (~20% performance hit)
node26# lctl set_param -P llite.*.checksum_pages=0
node26# lctl set_param -P llite.*.checksums=0
node26# lctl set_param -P osc.*.checksums=0
node26# lctl set_param -P mdc.*.checksums=0

These parameters are optimised for the best performance. They are not universal and can be not optimal for some cases.

Tests

Testbed description

Lustre client systems:

The Lustre client systems are 4 servers in identical configurations connected to the same Infiniband switch Mellanox QuantumTM HDR Edge Switch QM8700. The SBB system nodes (the cluster nodes) are also connected to the same switch. Lustre client parameters are modified to get the best performance. These parameter changes are accepted by the Lustre community for modern tests showing high performance. More details are provided in the table below:

Hostname lclient00 lclient01 lclient02 lclient03
CPU AMD EPYC 7502 32-Core AMD EPYC 7502 32-Core AMD EPYC 7502 32-Core AMD EPYC 7502 32-Core
Memory 256GB 256GB 256GB 256GB
OS drives INTEL SSDPEKKW256G8 INTEL SSDPEKKW256G8 INTEL SSDPEKKW256G8 INTEL SSDPEKKW256G8
OS Rocky Linux 8.7 Rocky Linux 8.7 Rocky Linux 8.7 Rocky Linux 8.7
Management NIC 192.168.65.50 192.168.65.52 192.168.65.54 192.168.65.56
Infiniband LNET HDR 100.100.100.50 100.100.100.52 100.100.100.54 100.100.100.56

The Lustre clients are combined into a simple OpenMPI cluster, and the standard parallel filesystem test - IOR - is used to run the tests. The test files are created in the /stripe filesystem subfolder, which was created on the Lustre filesystem with the following striping parameters:

lclient01# mount -t lustre 100.100.100.26@o2ib:100.100.100.27@o2ib:/lustre0 /mnt.l
lclient01# mkdir /mnt.l/stripe4M
lclient01# lfs setstripe -c -1 -S 4M /mnt.l/stripe4M/

Test results

We used the standard parallel filesystem IOR test to measure the performance of the installation. For example, we ran 4 tests. Each test is started with 128 threads spread among 4 clients. The tests differ by transfer size (1M and 128M) and the use of directIO.

Normal state cluster performance

Tests with directIO enabled

The following list shows the test command and results for directIO enabled test with transfer size of 1MB.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . . 

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     19005      19005      0.006691    8388608    1024.00    0.008597   55.17      3.92       55.17      0
read      82075      82077      0.001545    8388608    1024.00    0.002592   12.78      0.213460   12.78      0
Max Write: 19005.04 MiB/sec (19928.23 MB/sec)
Max Read:  82075.33 MiB/sec (86062.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       19005.04   19005.04   19005.04       0.00   19005.04   19005.04   19005.04       0.00   55.17357         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        82075.33   82075.33   82075.33       0.00   82075.33   82075.33   82075.33       0.00   12.77578         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for directIO enabled test with transfer size of 128MB.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     52892      413.23     0.306686    8388608    131072     0.096920   19.82      0.521081   19.82      0
read      70588      551.50     0.229853    8388608    131072     0.002983   14.85      0.723477   14.85      0
Max Write: 52892.27 MiB/sec (55461.56 MB/sec)
Max Read:  70588.32 MiB/sec (74017.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       52892.27   52892.27   52892.27       0.00     413.22     413.22     413.22       0.00   19.82475         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        70588.32   70588.32   70588.32       0.00     551.47     551.47     551.47       0.00   14.85481         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Tests with directIO disabled

The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 1MB.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     48202      48204      0.002587    8388608    1024.00    0.008528   21.75      1.75       21.75      0
read      40960      40960      0.002901    8388608    1024.00    0.002573   25.60      2.39       25.60      0
Max Write: 48202.43 MiB/sec (50543.91 MB/sec)
Max Read:  40959.57 MiB/sec (42949.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       48202.43   48202.43   48202.43       0.00   48202.43   48202.43   48202.43       0.00   21.75359         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        40959.57   40959.57   40959.57       0.00   40959.57   40959.57   40959.57       0.00   25.60027         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 128MB.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     46315      361.84     0.349582    8388608    131072     0.009255   22.64      2.70       22.64      0
read      39435      308.09     0.368192    8388608    131072     0.002689   26.59      7.65       26.59      0
Max Write: 46314.67 MiB/sec (48564.45 MB/sec)
Max Read:  39434.54 MiB/sec (41350.12 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       46314.67   46314.67   46314.67       0.00     361.83     361.83     361.83       0.00   22.64026         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        39434.54   39434.54   39434.54       0.00     308.08     308.08     308.08       0.00   26.59029         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Failover behavior

To check the cluster behavior in case of a node failure, we will crash a node to simulate such a failure. Before the failure simulation, let's check the normal cluster state:

# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Tue Aug 13 19:13:23 2024 on node26-ic
  * Last change:  Tue Aug 13 19:13:18 2024 by hacluster via hacluster on node27-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Now let’s execute the node node26 crash:

node26# echo c > /proc/sysrq-trigger

Here node27 identifies, that node26 does not responding and preparing to fence it.

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Aug 30 00:55:04 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Node node26-ic: UNCLEAN (offline)
  * Online: [ node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Stopped
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Stopping node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic (UNCLEAN)
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Pending Fencing Actions:
  * reboot of node26-ic pending: client=pacemaker-controld.286449, origin=node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Here, all the cluster resources are online at node27 after successful node26 fencing.

During the experiment, the cluster required about 1 minute 50 seconds to identify node26's absence, fence it, and start all the services in the required sequence on the survived node27.

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Aug 30 00:56:30 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node27-ic ]
  * OFFLINE: [ node26-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node27-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost0             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Starting node27-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Starting node27-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Stopped
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Since node26 was not shut down properly, the RAIDs migrated to node27 are under initialization to prevent from write hole. It’s the expected behaviour:

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name   ║ static           ║ state       ║ devices                ║ info              ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║                   ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║                   ║
║        ║ strip_size: 16   ║             ║                        ║                   ║
║        ║ block_size: 4096 ║             ║                        ║                   ║
║        ║ sparepool: -     ║             ║                        ║                   ║
║        ║ active: True     ║             ║                        ║                   ║
║        ║ config: True     ║             ║                        ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║ init_progress: 31 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n1 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║                   ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║ init_progress: 29 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n1 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║                   ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║                   ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝

Failover state cluster performance

Now all the Lustre filesystem servers are running on the survived node. In this configuration, we expect the performance to be halved because now all communication will go through only one server. Other bottlenecks in this situation are:

  • Decreased NVMe performance: since we have one server running, all the workload goes to the NVMes via only 2 PCIe lanes;
  • Lack of CPU;
  • Lack of RAM.

Tests with directIO enabled

The following list shows the test command and results for directIO enabled test with a transfer size of 1MB on the system with just one node working.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     17185      17185      0.007389    8388608    1024.00    0.012074   61.02      2.86       61.02      0
read      45619      45620      0.002803    8388608    1024.00    0.003000   22.99      0.590771   22.99      0
Max Write: 17185.06 MiB/sec (18019.84 MB/sec)
Max Read:  45619.10 MiB/sec (47835.10 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       17185.06   17185.06   17185.06       0.00   17185.06   17185.06   17185.06       0.00   61.01671         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        45619.10   45619.10   45619.10       0.00   45619.10   45619.10   45619.10       0.00   22.98546         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for directIO enabled test with a transfer size of 128MB on the system with just one node working.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30129      235.39     0.524655    8388608    131072     0.798392   34.80      1.64       34.80      0
read      35731      279.15     0.455215    8388608    131072     0.002234   29.35      2.37       29.35      0
Max Write: 30129.26 MiB/sec (31592.82 MB/sec)
Max Read:  35730.91 MiB/sec (37466.57 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30129.26   30129.26   30129.26       0.00     235.38     235.38     235.38       0.00   34.80258         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        35730.91   35730.91   35730.91       0.00     279.15     279.15     279.15       0.00   29.34647         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Tests with directIO disabled

The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30967      31042      0.004072    8388608    1024.00    0.008509   33.78      7.55       33.86      0
read      38440      38441      0.003291    8388608    1024.00    0.282087   27.28      8.22       27.28      0
Max Write: 30966.96 MiB/sec (32471.21 MB/sec)
Max Read:  38440.06 MiB/sec (40307.32 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30966.96   30966.96   30966.96       0.00   30966.96   30966.96   30966.96       0.00   33.86112         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        38440.06   38440.06   38440.06       0.00   38440.06   38440.06   38440.06       0.00   27.27821         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
Finished            : Thu Sep 12 03:18:41 2024

The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30728      240.72     0.515679    8388608    131072     0.010178   34.03      8.70       34.12      0
read      35974      281.05     0.386365    8388608    131072     0.067996   29.15      10.73      29.15      0
Max Write: 30727.85 MiB/sec (32220.49 MB/sec)
Max Read:  35974.24 MiB/sec (37721.72 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30727.85   30727.85   30727.85       0.00     240.06     240.06     240.06       0.00   34.12461         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        35974.24   35974.24   35974.24       0.00     281.05     281.05     281.05       0.00   29.14797         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Failback

Meanwhile node26 booted after the crash. At our configuration the cluster software does not start automatically.

node26# pcs status
Error: error running crm_mon, is pacemaker running?
crm_mon: Connection to cluster failed: Connection refused

It could be useful in real life: before returning a node to a cluster, the administrator should identify, localize, and fix the problem to prevent it from recurring.

The cluster software works properly at node27:

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:13:57 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node27-ic ]
  * OFFLINE: [ node26-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node27-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost0             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Stopped
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Since we know the reason of the node26 crash, we start the cluster software there:

node26# pcs cluster start
Starting Cluster...

In some time the cluster software starts and the resources, which should work at node26, become properly moved from node27 to node26. The failback process took about 30 seconds.

node26# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:15:03 2024 on node26-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:15:40 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Conclusion

The article shows the possibility of creating a small, highly available, and high-performance Lustre installation based on an SBB system with dual-ported NVMe drives and xiRAID Classic 4.1 RAID engine. It also demonstrates the ease of xiRAID Classic integration with Pacemaker clusters and compatibility of xiRAID Classic with the classical approach to Lustre clustering.

The configuration is straightforward and requires the following program components to be installed and properly configured:

  • xiRAID Classic 4.1 and Csync2
  • Lustre software
  • Pacemaker software

The resulting system, based on the Viking VDS2249R SBB system, equipped with two single-CPU servers and 24 PCIe 4.0 NVMe drives, showed performance up to 55GB/s on writing and up to 86GB/s on reading from Lustre clients, using the standard parallel filesystem test program IOR.

This article, with minimal changes, can also be used to set up additional systems to expand existing Lustre installations.