This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments.
Contents
- System layout
- System configuration and tuning
- Software components installation
- HA cluster setup
- Csync2 configuration
- xiRAID Configuration for cluster setup
- xiRAID RAIDs creation
- Lustre setup
- Tests
System layout
xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.
This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.
This article will familiarize you with how to deploy xiRAID Classic for a real-life task.
Lustre server SBB Platform
We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below.
node0 | node1 | |
---|---|---|
Hostname | node26 | node27 |
CPU | AMD EPYC 7713P 64-Core | AMD EPYC 7713P 64-Core |
Memory | 256GB | 256GB |
OS drives | 2 x Samsung SSD 970 EVO Plus 250GB mirrored | 2 x Samsung SSD 970 EVO Plus 250GB mirrored |
OS | Rocky Linux 8.9 | Rocky Linux 8.9 |
IPMI address | 192.168.64.106 | 192.168.67.23 |
IPMI login | admin | admin |
IPMI password | admin | admin |
Management NIC | enp194s0f0: 192.168.65.26/24 |
enp194s0f0: 192.168.65.27 |
Cluster Heartbeat NIC | enp194s0f1: 10.10.10.1 |
enp194s0f1: 10.10.10.2 |
Infiniband LNET HDR | ib0: 100.100.100.26 |
ib0: 100.100.100.27 |
ib3: 100.100.100.126 |
ib3: 100.100.100.127 |
|
NVMes | 24 x Kioxia CM6-R 3.84Tb KCM61RUL3T84 |
System configuration and tuning
Before software installation and configuration, we need to prepare the platform to provide optimal performance.
Performance tuning
Network configuration
Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.65.26 node26 192.168.65.27 node27 10.10.10.1 node26-ic 10.10.10.2 node27-ic 192.168.64.50 node26-ipmi 192.168.64.76 node27-ipmi 100.100.100.26 node26-ib 100.100.100.27 node27-ib
Policy-based routing setup
We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces.
node26 setup:
node26# nmcli connection modify ib0 ipv4.route-metric 100 node26# nmcli connection modify ib3 ipv4.route-metric 101 node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100" node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100" node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200" node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200" node26# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)
node27 setup:
node27# nmcli connection modify ib0 ipv4.route-metric 100 node27# nmcli connection modify ib3 ipv4.route-metric 101 node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100" node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100" node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200" node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200" node27# nmcli connection up ib0 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) node26# nmcli connection up ib3 Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)
NVMe drives setup
In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 drives. They are PCIe 4.0, dual-ported, read-intensive drives with 1DWPD endurance. A single drive's performance can theoretically reach up to 6.9GB/s for sequential read and 4.2GB/s for sequential write (according to the vendor specification).
In our setup, we plan to create a simple Lustre installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the Lustre OST RAIDs, and create separate RAIDs from the first NVMe namespaces and the second NVMe namespaces. By configuring our cluster software to use the RAIDs made from the first namespaces (and their Lustre servers) on Lustre node #0 and the RAIDs created from the second namespaces on node #1, we will be able to utilize all four PCIe lanes for each NVMe used to store OST data, as Lustre itself will distribute the workload among all OSTs.
Since we are deploying a simple Lustre installation, we will use a simple filesystem scheme with just one metadata server. As we will have only one metadata server, we will need only one RAID for the metadata. Because of this, we will not create two namespaces on the drives used for the MDT RAID.
Here is how the NVMe drive configuration looks initially:
# nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 21G0A046T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme1n1 21G0A04BT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme10n1 21G0A04ET2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme11n1 21G0A045T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme12n1 S59BNM0R702322Z Samsung SSD 970 EVO Plus 250GB 1 8.67 GB / 250.06 GB 512 B + 0 B 2B2QEXM7 /dev/nvme13n1 21G0A04KT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme14n1 21G0A047T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme15n1 21G0A04CT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme16n1 11U0A00KT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme17n1 21G0A04JT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme18n1 21G0A048T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme19n1 S59BNM0R702439A Samsung SSD 970 EVO Plus 250GB 1 208.90 kB / 250.06 GB 512 B + 0 B 2B2QEXM7 /dev/nvme2n1 21G0A041T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme20n1 21G0A03TT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme21n1 21G0A04FT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme22n1 21G0A03ZT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme23n1 21G0A04DT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme24n1 21G0A03VT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme25n1 21G0A044T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme3n1 21G0A04GT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme5n1 21G0A04HT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme6n1 21G0A049T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme7n1 21G0A043T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme8n1 21G0A04AT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme9n1 21G0A03XT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106
The Samsung drives are used for the operating system installation.
Let's reserve /dev/nvme0 and /dev/nvme1 drives for the metadata RAID1. Currently, xiRAID does not support spare pools in a cluster configuration, but having a spare drive is really useful for quick manual drive replacement. So, let's also reserve /dev/nvme3 to be a spare for the RAID1 drive and split all other KCM61RUL3T84 drives into 2 namespaces.
Let’s take /dev/nvme4 as an example. All other drives will be splited in absolutely the same way.
Check the maximum possible size of the drive to be sure:
tnvmcap : 3840755982336
Check the maximal number of the namespaces supported by the drive:
nn : 64
Check the controller used for the drive connection at both servers (they will differ):
cntlid : 0x1
node26# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x2
We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is:
So, each namespace size in 4K blocks will be:
In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 468700000 blocks.
If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.
As a first step, remove the existing namespace on one of the nodes:
After that, create namespaces on the same node:
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1 create-ns: Success, created nsid:1 node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1 create-ns: Success, created nsid:2 node26# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x2 attach-ns: Success, nsid:1 node26# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x2 attach-ns: Success, nsid:2
Attach the namespaces on the second node with the proper controller:
node27# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x1 attach-ns: Success, nsid:1 node27# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x1 attach-ns: Success, nsid:2
It looks like this on both nodes:
# nvme list |grep nvme4 /dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme4n2 21G0A042T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
All other drives were split in the same way. Here is the resulting configuration:
# nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 21G0A046T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme1n1 21G0A04BT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme10n1 21G0A04ET2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme10n2 21G0A04ET2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme11n1 21G0A045T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme11n2 21G0A045T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme12n1 S59BNM0R702322Z Samsung SSD 970 EVO Plus 250GB 1 8.67 GB / 250.06 GB 512 B + 0 B 2B2QEXM7 /dev/nvme13n1 21G0A04KT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme13n2 21G0A04KT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme14n1 21G0A047T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme14n2 21G0A047T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme15n1 21G0A04CT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme15n2 21G0A04CT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme16n1 11U0A00KT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme16n2 11U0A00KT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme17n1 21G0A04JT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme17n2 21G0A04JT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme18n1 21G0A048T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme18n2 21G0A048T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme19n1 S59BNM0R702439A Samsung SSD 970 EVO Plus 250GB 1 208.90 kB / 250.06 GB 512 B + 0 B 2B2QEXM7 /dev/nvme2n1 21G0A041T2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme20n1 21G0A03TT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme20n2 21G0A03TT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme21n1 21G0A04FT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme21n2 21G0A04FT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme22n1 21G0A03ZT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme22n2 21G0A03ZT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme23n1 21G0A04DT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme23n2 21G0A04DT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme24n1 21G0A03VT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme24n2 21G0A03VT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme25n1 21G0A044T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme25n2 21G0A044T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme3n1 21G0A04GT2G8 KCM61RUL3T84 1 0.00 B / 3.84 TB 4 KiB + 0 B 0106 /dev/nvme4n1 21G0A042T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme4n2 21G0A042T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme5n1 21G0A04HT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme5n2 21G0A04HT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme6n1 21G0A049T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme6n2 21G0A049T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme7n1 21G0A043T2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme7n2 21G0A043T2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme8n1 21G0A04AT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme8n2 21G0A04AT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme9n1 21G0A03XT2G8 KCM61RUL3T84 1 0.00 B / 1.92 TB 4 KiB + 0 B 0106 /dev/nvme9n2 21G0A03XT2G8 KCM61RUL3T84 2 0.00 B / 1.92 TB 4 KiB + 0 B 0106
Software components installation
Lustre installation
Create Lustre repo file /etc/yum.repos.d/lustre-repo.repo :
name=lustre-server
baseurl=https://downloads.
# exclude=*debuginfo*
gpgcheck=0
[lustre-client]
name=lustre-client
baseurl=https://downloads.
# exclude=*debuginfo*
gpgcheck=0
[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.
# exclude=*debuginfo*
gpgcheck=0
Installing e2fs tools:
Installing Lustre kernel:
Reboot to the new kernel:
Check the kernel version after reboot:
Linux node26 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Installing lustre server components:
Check Lustre module load:
[root@node26 ~]# modprobe -v lustre insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/libcfs.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/lnet.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/obdclass.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/ptlrpc.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fld.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fid.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/osc.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lov.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/mdc.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lmv.ko insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lustre.ko
Unload modules:
Installing xiRAID Classic 4.1
Installing xiRAID Classic 4.1 at both nodes from the repositories following the Xinnor xiRAID 4.1.0 Installation Guide:
Pacemaker installation
Running the following steps at both nodes:
Enable cluster repo
Installing cluster:
Csync2 installation
Since we are installing the system on Rocky Linux 8, there is no need to compile Csync2 from sources ourselves. Just install the Csync2 package from the Xinnor repository on both nodes:
NTP server installation
HA cluster setup
Time synchronisation setup
Modify /etc/chrony.conf file if needed to setup working with proper NTP servers. At this setup we will work with the default settings.
Verify, that time sync works properly by running chronyc tracking.
Pacemaker cluster creation
In this chapter, the cluster configuration is described. In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between enp194s0f1 interfaces on the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high. A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.
Set the firewall to allow pacemaker software to work (on both nodes):
# firewall-cmd --permanent --add-service=high-availability
Set the same password for the hacluster user at both nodes:
Start the cluster software at both nodes:
# systemctl enable pcsd.service
Authenticate the cluster nodes from one node by their interconnect interfaces:
Password:
node26-ic: Authorized
node27-ic: Authorized
Create and start the cluster (start at one node):
node26# pcs cluster setup lustrebox0 node26-ic node27-ic No addresses specified for host 'node26-ic', using 'node26-ic' No addresses specified for host 'node27-ic', using 'node27-ic' Destroying cluster on hosts: 'node26-ic', 'node27-ic'... node26-ic: Successfully destroyed cluster node27-ic: Successfully destroyed cluster Requesting remove 'pcsd settings' from 'node26-ic', 'node27-ic' node26-ic: successful removal of the file 'pcsd settings' node27-ic: successful removal of the file 'pcsd settings' Sending 'corosync authkey', 'pacemaker authkey' to 'node26-ic', 'node27-ic' node26-ic: successful distribution of the file 'corosync authkey' node26-ic: successful distribution of the file 'pacemaker authkey' node27-ic: successful distribution of the file 'corosync authkey' node27-ic: successful distribution of the file 'pacemaker authkey' Sending 'corosync.conf' to 'node26-ic', 'node27-ic' node26-ic: successful distribution of the file 'corosync.conf' node27-ic: successful distribution of the file 'corosync.conf' Cluster has been successfully set up. node26# pcs cluster start --all node26-ic: Starting Cluster... node27-ic: Starting Cluster...
Check the current cluster status:
node26# pcs status Cluster name: lustrebox0 WARNINGS: No stonith devices and stonith-enabled is not false Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Fri Jul 12 20:55:53 2024 on node26-ic * Last change: Fri Jul 12 20:55:12 2024 by hacluster via hacluster on node27-ic * 2 nodes configured * 0 resource instances configured Node List: * Online: [ node26-ic node27-ic ] Full List of Resources: * No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Fencing setup
It's very important to have properly configured and working fencing (STONITH) in any HA cluster that works with shared storage devices. In our case, the shared devices are all the NVMe namespaces we created earlier. The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.
First of all, let's check the list of installed fencing agents in our system:
fence_watchdog - Dummy watchdog fence agent
So, we don’t have the IPMI fencing agent installed at our cluster nodes. To install it, run the following command (at both nodes):
You may check the IPMI fencing agent options description by running the following command:
Adding the fencing resources:
node26# pcs stonith create node27.stonith fence_ipmilan ip="192.168.67.23" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node27-ic" pcmk_host_check=static-list op monitor interval=10s node26# pcs stonith create node26.stonith fence_ipmilan ip="192.168.64.106" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node26-ic" pcmk_host_check=static-list op monitor interval=10s
Preventing the STONITH resources from start on the node it should kill:
node26# pcs constraint location node27.stonith avoids node27-ic=INFINITY node26# pcs constraint location node26.stonith avoids node26-ic=INFINITY
Csync2 configuration
Configure firewall to allow Csync2 to work (run at both nodes):
# firewall-cmd --permanent --add-port=30865/tcp
Create the Csync2 configuration file /usr/local/etc/csync2.cfg with the following content at node26 only:
group csxiha {
host node26;
host node27;
key /usr/local/etc/csync2.key_ha;
include /etc/xiraid/raids; }
Generate the key:
Copy the config and the key file to the second node:
For Csync2 synchronisation by schedule one time per minute run crontab -e at both nodes and add the following record:
Also for asynchronous synchronisation run the following command to create a synchronisation script (repeat the script creation procedure at both nodes):
Fill the created script with the following content:
/usr/local/sbin/csync2 -xv
Save the file.
After that run the following command to set correct permissions for the script file:
xiRAID Configuration for cluster setup
Disable RAID autostart to prevent RAIDs from being activated by xiRAID itself during a node boot. In a cluster configuration, RAIDs have to be activated by Pacemaker via cluster resources. Run the following command on both nodes:
Make xiRAID Classic 4.1 resource agent visible for Pacemaker (run command this sequence at both nodes):
# ln -s /etc/xraid/agents/raid /usr/lib/ocf/resource.d/xraid/raid
xiRAID RAIDs creation
To be able to create RAIDs, we need to install licenses for xiRAID Classic 4.1 on both hosts first. The licenses should be received from Xinnor. To generate the licenses, Xinnor requires the output of the xicli license show command (from both nodes).
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64
hwkey: B8828A09E09E8F48
license_key: null
version: 0
crypto_version: 0
created: 0-0-0
expired: 0-0-0
disks: 4
levels: 0
type: nvme
disks_in_use: 2
status: trial
The license files received from Xinnor needs to be installed by xicli license update -p <filename> command (once again, at both nodes):
node26# xicli license show Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64 hwkey: B8828A09E09E8F48 license_key: 0F5A4B87A0FC6DB7544EA446B1B4AF5F34A08169C44E5FD119CE6D2352E202677768ECC78F56B583DABE11698BBC800EC96E556AA63E576DAB838010247678E7E3B95C7C4E3F592672D06C597045EAAD8A42CDE38C363C533E98411078967C38224C9274B862D45D4E6DED70B7E34602C80B60CBA7FDE93316438AFDCD7CBD23 version: 1 crypto_version: 1 created: 2024-7-16 expired: 2024-9-30 disks: 600 levels: 70 type: nvme disks_in_use: 2 status: valid
Since we plan to deploy a small Lustre installation, combining MGT and MDT on the same target device is absolutely OK. But for medium or large Lustre installations, it's better to use a separate target (and RAID) for MGT.
Here is the list of the RAIDs we need to create.
RAID Name | RAID Level | Number of devices | Strip size | Drive list | Lustre target |
---|---|---|---|---|---|
r_mdt0 | 1 | 2 | 16 | /dev/nvme0n1 /dev/nvme1n1 |
MGT + MDT index=0 |
r_ost0 | 6 | 10 | 128 | /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1 |
OST index=0 |
r_ost1 | 6 | 10 | 128 | /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2 |
OST index=1 |
r_ost2 | 6 | 10 | 128 | /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1 |
OST index=2 |
r_ost3 | 6 | 10 | 128 | /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2 |
OST index=3 |
Creating all the RAIDs at the first node:
node26# xicli raid create -n r_mdt0 -l 1 -d /dev/nvme0n1 /dev/nvme1n1 node26# xicli raid create -n r_ost0 -l 6 -ss 128 -d /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1 node26# xicli raid create -n r_ost1 -l 6 -ss 128 -d /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2 node26# xicli raid create -n r_ost2 -l 6 -ss 128 -d /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1 node26# xicli raid create -n r_ost3 -l 6 -ss 128 -d /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2
At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.
Checking the RAID statuses at the first node:
node26# xicli raid show ╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_mdt0 ║ size: 3576 GiB ║ online ║ 0 /dev/nvme0n1 online ║ ║ ║ ║ level: 1 ║ initialized ║ 1 /dev/nvme1n1 online ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: True ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost0 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n1 online ║ init_progress: 11 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme5n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost1 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n2 online ║ init_progress: 7 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme5n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n2 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost2 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n1 online ║ init_progress: 5 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme16n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost3 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n2 online ║ init_progress: 2 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme16n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n2 online ║ ║ ╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝
Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case):
node27# xicli raid show ╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_mdt0 ║ size: 3576 GiB ║ None ║ ║ ║ ║ ║ level: 1 ║ ║ ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost0 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost1 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost2 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost3 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╚════════╩══════════════════╩═══════╩═════════╩══════╝
After RAID creation, there's no need to wait for RAID initialization to finish. The RAIDs are available for use immediately after creation, albeit with slightly reduced performance.
For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID. Currently, all RAIDs are active on node26, so the sets will joint, but when they are spread between node26 and node27, they will not joint.
node26# xicli raid modify -n r_mdt0 -ca 0-7 -se 1 node26# xicli raid modify -n r_ost0 -ca 8-67 -se 1 node26# xicli raid modify -n r_ost1 -ca 8-67 -se 1 # will be running at node27 node26# xicli raid modify -n r_ost2 -ca 68-127 -se 1 node26# xicli raid modify -n r_ost3 -ca 68-127 -se 1 # will be running at node27
Lustre setup
LNET configuration
To make lustre working, we need to configure Lustre network stack (LNET).
Run at both nodes
# systemctl enable lnet
# lnetctl net add --net o2ib0 --if ib0
# lnetctl net add --net o2ib0 --if ib3
Check the configuration
# lnetctl net show -v net: - net type: lo local NI(s): - nid: 0@lo status: up statistics: send_count: 289478 recv_count: 289474 drop_count: 4 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: dev cpt: 0 CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]" - net type: o2ib local NI(s): - nid: 100.100.100.26@o2ib status: down interfaces: 0: ib0 statistics: send_count: 213607 recv_count: 213604 drop_count: 7 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 1 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 dev cpt: -1 CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]" - nid: 100.100.100.126@o2ib status: up interfaces: 0: ib3 statistics: send_count: 4 recv_count: 4 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 1 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 dev cpt: -1 CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
Please pay attention to the LNET at the hosts - NIDs. We will use 100.100.100.26@o2ib for node26 and 100.100.100.27@o2ib for node27 as primary NIDs.
Save the LNET configuration:
LDISKFS filesystems creation
At this step, we format the RAIDs into LDISKFS filesystem format. During formatting, we specify the target type (--mgs/--mdt/--ost), unique number of the specific target type (--index), Lustre filesystem name (--fsname), NIDs where each target filesystem could be mounted and where the corresponding servers will get started automatically (--servicenode), and NIDs where MGS could be found (--mgsnode).
Since our RAIDs will work within a cluster, we specify NIDs of both server nodes as the NIDs where the target filesystem could be mounted and where the corresponding servers will get started automatically for each target filesystem. For the same reason, we specify two NIDs where other servers should look for the MGS service.
node26# mkfs.lustre --mgs --mdt --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_mdt0 node26# mkfs.lustre --ost --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost0 node26# mkfs.lustre --ost --fsname=lustre0 --index=1 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost1 node26# mkfs.lustre --ost --fsname=lustre0 --index=2 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost2 node26# mkfs.lustre --ost --fsname=lustre0 --index=3 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost3
More details could be found in the Lustre documentation.
Cluster resources creation
Please check the table below. The configuration to configure is described in the table.
RAID name | HA cluster RAID resource name | Lustre target | Mountpoint | HA cluster filesystem resource name | Preferred cluster node |
---|---|---|---|---|---|
r_mdt0 | rr_mdt0 | MGT + MDT index=0 |
/lustre_t/mdt0 | fsr_mdt0 | node26 |
r_ost0 | rr_ost0 | OST index=0 | /lustre_t/ost0 | fsr_ost0 | node26 |
r_ost1 | rr_ost1 | OST index=1 | /lustre_t/ost1 | fsr_ost1 | node27 |
r_ost2 | rr_ost2 | OST index=2 | /lustre_t/ost2 | fsr_ost2 | node26 |
r_ost3 | rr_ost3 | OST index=3 | /lustre_t/ost3 | fsr_ost3 | node27 |
To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.
To cluster Lustre services, there are two options, as currently two resource agents are capable of managing Lustre OSDs:
- ocf:heartbeat:Filesystem: Distributed by ClusterLabs in the resource-agents package, the Filesystem RA is a very mature and stable application and has been part of the Pacemaker project for many years. Filesystem provides generic support for mounting and unmounting storage devices, which indirectly includes Lustre.
- ocf:lustre:Lustre: Developed specifically for Lustre OSDs, this RA is distributed by the Lustre project and is available in Lustre releases from version 2.10.0 onwards. As a result of its narrower scope, it is less complex than ocf:heartbeat:Filesystem and better suited for managing Lustre storage resources.
For simplicity, we will use ocf:heartbeat:Filesystem in our case. However, ocf:lustre:Lustre can also be easily used in conjunction with xiRAID Classic in a Pacemaker cluster configuration. For more details on Lustre clustering, please check this page of Lustre documentation.
First of all, create mountpoints for all the RAIDs formatted in LDISKFS at both nodes:
# mkdir -p /lustre_t/ost2
# mkdir -p /lustre_t/ost1
# mkdir -p /lustre_t/ost0
# mkdir -p /lustre_t/mdt0
Unload all the RAIDs at the node where they are active:
node26# xicli raid show ╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_mdt0 ║ size: 3576 GiB ║ online ║ 0 /dev/nvme0n1 online ║ ║ ║ ║ level: 1 ║ initialized ║ 1 /dev/nvme1n1 online ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: True ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost0 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n1 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme5n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost1 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme5n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n2 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost2 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n1 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme16n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost3 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme16n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n2 online ║ ║ ╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝ node26# xicli raid unload -n r_mdt0 node26# xicli raid unload -n r_ost0 node26# xicli raid unload -n r_ost1 node26# xicli raid unload -n r_ost2 node26# xicli raid unload -n r_ost3 node26# xicli raid show ╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_mdt0 ║ size: 3576 GiB ║ None ║ ║ ║ ║ ║ level: 1 ║ ║ ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost0 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost1 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost2 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═══════╬═════════╬══════╣ ║ r_ost3 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╚════════╩══════════════════╩═══════╩═════════╩══════╝
Create a copy of the cluster information base to make changes to at the first node:
node26# ls -l fs_cfg
-rw-r--r--. 1 root root 8614 Jul 20 02:04 fs_cfg
Getting the RAIDs UUIDs:
node26# grep uuid /etc/xiraid/raids/*.conf /etc/xiraid/raids/r_mdt0.conf: "uuid": "75E2CAA5-3E5B-4ED0-89E9-4BF3850FD542", /etc/xiraid/raids/r_ost0.conf: "uuid": "AB341442-20AC-43B1-8FE6-F9ED99D1D6C0", /etc/xiraid/raids/r_ost1.conf: "uuid": "1441D09C-0073-4555-A398-71984E847F9E", /etc/xiraid/raids/r_ost2.conf: "uuid": "0E225812-6877-4344-A552-B6A408EC7351", /etc/xiraid/raids/r_ost3.conf: "uuid": "F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E",
Creating resource rr_mdt0 for the r_mdt0 RAID:
Setting a constraint to make the first node preferable for r_mdt0 resource:
Creating a resource for the r_mdt0 RAID mountpoint at /lustre_t/mdt0:
Configure the cluster to start rr_mdt0 and fsr_mdt0 at the same node ONLY:
Configure the cluster to start fsr_mdt0 only after rr_mdt0:
Configure other resources in the same way:
node26# pcs -f fs_cfg resource create rr_ost0 ocf:xraid:raid name=r_ost0 uuid=AB341442-20AC-43B1-8FE6-F9ED99D1D6C0 op monitor interval=5s meta migration-threshold=1 node26# pcs -f fs_cfg constraint location rr_ost0 prefers node26-ic=50 node26# pcs -f fs_cfg resource create fsr_ost0 Filesystem device="/dev/xi_r_ost0" directory="/lustre_t/ost0" fstype="lustre" node26# pcs -f fs_cfg constraint colocation add rr_ost0 with fsr_ost0 INFINITY node26# pcs -f fs_cfg constraint order rr_ost0 then fsr_ost0 node26# pcs -f fs_cfg resource create rr_ost1 ocf:xraid:raid name=r_ost1 uuid=1441D09C-0073-4555-A398-71984E847F9E op monitor interval=5s meta migration-threshold=1 node26# pcs -f fs_cfg constraint location rr_ost1 prefers node27-ic=50 node26# pcs -f fs_cfg resource create fsr_ost1 Filesystem device="/dev/xi_r_ost1" directory="/lustre_t/ost1" fstype="lustre" node26# pcs -f fs_cfg constraint colocation add rr_ost1 with fsr_ost1 INFINITY node26# pcs -f fs_cfg constraint order rr_ost1 then fsr_ost1 node26# pcs -f fs_cfg resource create rr_ost2 ocf:xraid:raid name=r_ost2 uuid=0E225812-6877-4344-A552-B6A408EC7351 op monitor interval=5s meta migration-threshold=1 node26# pcs -f fs_cfg constraint location rr_ost2 prefers node26-ic=50 node26# pcs -f fs_cfg resource create fsr_ost2 Filesystem device="/dev/xi_r_ost2" directory="/lustre_t/ost2" fstype="lustre" node26# pcs -f fs_cfg constraint colocation add rr_ost2 with fsr_ost2 INFINITY node26# pcs -f fs_cfg constraint order rr_ost2 then fsr_ost2 node26# pcs -f fs_cfg resource create rr_ost3 ocf:xraid:raid name=r_ost3 uuid=F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E op monitor interval=5s meta migration-threshold=1 node26# pcs -f fs_cfg constraint location rr_ost3 prefers node27-ic=50 node26# pcs -f fs_cfg resource create fsr_ost3 Filesystem device="/dev/xi_r_ost3" directory="/lustre_t/ost3" fstype="lustre" node26# pcs -f fs_cfg constraint colocation add rr_ost3 with fsr_ost3 INFINITY node26# pcs -f fs_cfg constraint order rr_ost3 then fsr_ost3
In xiRAID Classic 4.1, it is required to guarantee that only one RAID can be starting at a time. To do so, we define the following constraints. This limitation is planned for removal in xiRAID Classic 4.2.
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost0 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost1 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost2 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost3 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost1 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost2 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost3 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost2 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost3 kind=Serialize node26# pcs -f fs_cfg constraint order start rr_ost2 then start rr_ost3 kind=Serialize
To ensure Lustre servers start in the proper order, we need to configure the cluster to start MDS before all the OSS's. Since the Linux kernel starts MDSes and OSS's automatically when mounting the LDISKFS filesystem, we just need to set the proper start order for the fsr_* resources:
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost0 node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost1 node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost2 node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost3
Applying the batch cluster information base changes:
Checking the resulting cluster configuration:
node26# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node26-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Tue Jul 23 02:14:54 2024 on node26-ic * Last change: Tue Jul 23 02:14:50 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node26-ic node27-ic ] Full List of Resources: * node27.stonith (stonith:fence_ipmilan): Started node26-ic * node26.stonith (stonith:fence_ipmilan): Started node27-ic * rr_mdt0 (ocf::xraid:raid): Started node26-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost0 (ocf::xraid:raid): Started node26-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node26-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Double-check on both nodes that the RAIDs are active and the filesystems are mounted properly. Please note that we have all the OST RAIDs based on /dev/nvme*n1 active on the first node (node26) and all the OST RAIDs based on /dev/nvme*n2 on the second one (node27), which will help us utilize the full NVMe throughput as planned.
node26:
node26# xicli raid show ╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_mdt0 ║ size: 3576 GiB ║ online ║ 0 /dev/nvme0n1 online ║ ║ ║ ║ level: 1 ║ initialized ║ 1 /dev/nvme1n1 online ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: True ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost0 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n1 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme5n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost1 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost2 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n1 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme16n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost3 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝ node26# df -h|grep xi /dev/xi_r_mdt0 2.1T 5.7M 2.0T 1% /lustre_t/mdt0 /dev/xi_r_ost0 14T 1.3M 14T 1% /lustre_t/ost0 /dev/xi_r_ost2 14T 1.3M 14T 1% /lustre_t/ost2
node27:
node27# xicli raid show ╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_mdt0 ║ size: 3576 GiB ║ None ║ ║ ║ ║ ║ level: 1 ║ ║ ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost0 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost1 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme5n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n2 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost2 ║ size: 14302 GiB ║ None ║ ║ ║ ║ ║ level: 6 ║ ║ ║ ║ ║ ║ strip_size: 128 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: False ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣ ║ r_ost3 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme16n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n2 online ║ ║ ╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝ node27# df -h|grep xi /dev/xi_r_ost1 14T 1.3M 14T 1% /lustre_t/ost1 /dev/xi_r_ost3 14T 1.3M 14T 1% /lustre_t/ost3
Lustre performance tuning
Here we set some parameters for the performance optimisation. All the commands have to be run at the host, where MDS server is running.
Server-side parameters:
# OSTs: 16MB bulk RPCs node26# lctl set_param -P obdfilter.*.brw_size=16 node26# lctl set_param -P obdfilter.*.precreate_batch=1024 # Clients: 16MB RPCs node26# lctl set_param -P obdfilter.*.osc.max_pages_per_rpc=4096 node26# lctl set_param -P osc.*.max_pages_per_rpc=4096 # Clients: 32 RPCs in flight node26# lctl set_param -P mdc.*.max_rpcs_in_flight=128 node26# lctl set_param -P osc.*.max_rpcs_in_flight=128 node26# lctl set_param -P mdc.*.max_mod_rpcs_in_flight=127 # Clients: Disable memory and wire checksums (~20% performance hit) node26# lctl set_param -P llite.*.checksum_pages=0 node26# lctl set_param -P llite.*.checksums=0 node26# lctl set_param -P osc.*.checksums=0 node26# lctl set_param -P mdc.*.checksums=0
These parameters are optimised for the best performance. They are not universal and can be not optimal for some cases.
Tests
Testbed description
Lustre client systems:
The Lustre client systems are 4 servers in identical configurations connected to the same Infiniband switch Mellanox QuantumTM HDR Edge Switch QM8700. The SBB system nodes (the cluster nodes) are also connected to the same switch. Lustre client parameters are modified to get the best performance. These parameter changes are accepted by the Lustre community for modern tests showing high performance. More details are provided in the table below:
Hostname | lclient00 | lclient01 | lclient02 | lclient03 |
CPU | AMD EPYC 7502 32-Core | AMD EPYC 7502 32-Core | AMD EPYC 7502 32-Core | AMD EPYC 7502 32-Core |
Memory | 256GB | 256GB | 256GB | 256GB |
OS drives | INTEL SSDPEKKW256G8 | INTEL SSDPEKKW256G8 | INTEL SSDPEKKW256G8 | INTEL SSDPEKKW256G8 |
OS | Rocky Linux 8.7 | Rocky Linux 8.7 | Rocky Linux 8.7 | Rocky Linux 8.7 |
Management NIC | 192.168.65.50 | 192.168.65.52 | 192.168.65.54 | 192.168.65.56 |
Infiniband LNET HDR | 100.100.100.50 | 100.100.100.52 | 100.100.100.54 | 100.100.100.56 |
The Lustre clients are combined into a simple OpenMPI cluster, and the standard parallel filesystem test - IOR - is used to run the tests. The test files are created in the /stripe filesystem subfolder, which was created on the Lustre filesystem with the following striping parameters:
lclient01# mkdir /mnt.l/stripe4M
lclient01# lfs setstripe -c -1 -S 4M /mnt.l/stripe4M/
Test results
We used the standard parallel filesystem IOR test to measure the performance of the installation. For example, we ran 4 tests. Each test is started with 128 threads spread among 4 clients. The tests differ by transfer size (1M and 128M) and the use of directIO.
Normal state cluster performance
Tests with directIO enabled
The following list shows the test command and results for directIO enabled test with transfer size of 1MB.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 19005 19005 0.006691 8388608 1024.00 0.008597 55.17 3.92 55.17 0 read 82075 82077 0.001545 8388608 1024.00 0.002592 12.78 0.213460 12.78 0 Max Write: 19005.04 MiB/sec (19928.23 MB/sec) Max Read: 82075.33 MiB/sec (86062.22 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 19005.04 19005.04 19005.04 0.00 19005.04 19005.04 19005.04 0.00 55.17357 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0 read 82075.33 82075.33 82075.33 0.00 82075.33 82075.33 82075.33 0.00 12.77578 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0
The following list shows the test command and results for directIO enabled test with transfer size of 128MB.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 52892 413.23 0.306686 8388608 131072 0.096920 19.82 0.521081 19.82 0 read 70588 551.50 0.229853 8388608 131072 0.002983 14.85 0.723477 14.85 0 Max Write: 52892.27 MiB/sec (55461.56 MB/sec) Max Read: 70588.32 MiB/sec (74017.22 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 52892.27 52892.27 52892.27 0.00 413.22 413.22 413.22 0.00 19.82475 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0 read 70588.32 70588.32 70588.32 0.00 551.47 551.47 551.47 0.00 14.85481 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0
Tests with directIO disabled
The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 1MB.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 48202 48204 0.002587 8388608 1024.00 0.008528 21.75 1.75 21.75 0 read 40960 40960 0.002901 8388608 1024.00 0.002573 25.60 2.39 25.60 0 Max Write: 48202.43 MiB/sec (50543.91 MB/sec) Max Read: 40959.57 MiB/sec (42949.22 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 48202.43 48202.43 48202.43 0.00 48202.43 48202.43 48202.43 0.00 21.75359 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0 read 40959.57 40959.57 40959.57 0.00 40959.57 40959.57 40959.57 0.00 25.60027 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0
The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 128MB.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 46315 361.84 0.349582 8388608 131072 0.009255 22.64 2.70 22.64 0 read 39435 308.09 0.368192 8388608 131072 0.002689 26.59 7.65 26.59 0 Max Write: 46314.67 MiB/sec (48564.45 MB/sec) Max Read: 39434.54 MiB/sec (41350.12 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 46314.67 46314.67 46314.67 0.00 361.83 361.83 361.83 0.00 22.64026 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0 read 39434.54 39434.54 39434.54 0.00 308.08 308.08 308.08 0.00 26.59029 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0
Failover behavior
To check the cluster behavior in case of a node failure, we will crash a node to simulate such a failure. Before the failure simulation, let's check the normal cluster state:
# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Tue Aug 13 19:13:23 2024 on node26-ic * Last change: Tue Aug 13 19:13:18 2024 by hacluster via hacluster on node27-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node26-ic node27-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node26-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost0 (ocf::xraid:raid): Started node26-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node26-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic * node27.stonith (stonith:fence_ipmilan): Started node26-ic * node26.stonith (stonith:fence_ipmilan): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Now let’s execute the node node26 crash:
Here node27 identifies, that node26 does not responding and preparing to fence it.
node27# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Fri Aug 30 00:55:04 2024 on node27-ic * Last change: Thu Aug 29 01:26:09 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Node node26-ic: UNCLEAN (offline) * Online: [ node27-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node26-ic (UNCLEAN) * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node26-ic (UNCLEAN) * rr_ost0 (ocf::xraid:raid): Started node26-ic (UNCLEAN) * fsr_ost0 (ocf::heartbeat:Filesystem): Started node26-ic (UNCLEAN) * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Stopped * rr_ost2 (ocf::xraid:raid): Started node26-ic (UNCLEAN) * fsr_ost2 (ocf::heartbeat:Filesystem): Started node26-ic (UNCLEAN) * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Stopping node27-ic * node27.stonith (stonith:fence_ipmilan): Started node26-ic (UNCLEAN) * node26.stonith (stonith:fence_ipmilan): Started node27-ic Pending Fencing Actions: * reboot of node26-ic pending: client=pacemaker-controld.286449, origin=node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Here, all the cluster resources are online at node27 after successful node26 fencing.
During the experiment, the cluster required about 1 minute 50 seconds to identify node26's absence, fence it, and start all the services in the required sequence on the survived node27.
node27# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Fri Aug 30 00:56:30 2024 on node27-ic * Last change: Thu Aug 29 01:26:09 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node27-ic ] * OFFLINE: [ node26-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node27-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost0 (ocf::xraid:raid): Started node27-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Starting node27-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node27-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Starting node27-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic * node27.stonith (stonith:fence_ipmilan): Stopped * node26.stonith (stonith:fence_ipmilan): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Since node26 was not shut down properly, the RAIDs migrated to node27 are under initialization to prevent from write hole. It’s the expected behaviour:
node27# xicli raid show ╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗ ║ name ║ static ║ state ║ devices ║ info ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_mdt0 ║ size: 3576 GiB ║ online ║ 0 /dev/nvme0n1 online ║ ║ ║ ║ level: 1 ║ initialized ║ 1 /dev/nvme1n1 online ║ ║ ║ ║ strip_size: 16 ║ ║ ║ ║ ║ ║ block_size: 4096 ║ ║ ║ ║ ║ ║ sparepool: - ║ ║ ║ ║ ║ ║ active: True ║ ║ ║ ║ ║ ║ config: True ║ ║ ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost0 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n1 online ║ init_progress: 31 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme5n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost1 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme4n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme5n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme6n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme7n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme8n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme9n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme10n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme11n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme13n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme14n2 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost2 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n1 online ║ init_progress: 29 ║ ║ ║ level: 6 ║ initing ║ 1 /dev/nvme16n1 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n1 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n1 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n1 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n1 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n1 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n1 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n1 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n1 online ║ ║ ╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣ ║ r_ost3 ║ size: 14302 GiB ║ online ║ 0 /dev/nvme15n2 online ║ ║ ║ ║ level: 6 ║ initialized ║ 1 /dev/nvme16n2 online ║ ║ ║ ║ strip_size: 128 ║ ║ 2 /dev/nvme17n2 online ║ ║ ║ ║ block_size: 4096 ║ ║ 3 /dev/nvme18n2 online ║ ║ ║ ║ sparepool: - ║ ║ 4 /dev/nvme20n2 online ║ ║ ║ ║ active: True ║ ║ 5 /dev/nvme21n2 online ║ ║ ║ ║ config: True ║ ║ 6 /dev/nvme22n2 online ║ ║ ║ ║ ║ ║ 7 /dev/nvme23n2 online ║ ║ ║ ║ ║ ║ 8 /dev/nvme24n2 online ║ ║ ║ ║ ║ ║ 9 /dev/nvme25n2 online ║ ║ ╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝
Failover state cluster performance
Now all the Lustre filesystem servers are running on the survived node. In this configuration, we expect the performance to be halved because now all communication will go through only one server. Other bottlenecks in this situation are:
- Decreased NVMe performance: since we have one server running, all the workload goes to the NVMes via only 2 PCIe lanes;
- Lack of CPU;
- Lack of RAM.
Tests with directIO enabled
The following list shows the test command and results for directIO enabled test with a transfer size of 1MB on the system with just one node working.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 17185 17185 0.007389 8388608 1024.00 0.012074 61.02 2.86 61.02 0 read 45619 45620 0.002803 8388608 1024.00 0.003000 22.99 0.590771 22.99 0 Max Write: 17185.06 MiB/sec (18019.84 MB/sec) Max Read: 45619.10 MiB/sec (47835.10 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 17185.06 17185.06 17185.06 0.00 17185.06 17185.06 17185.06 0.00 61.01671 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0 read 45619.10 45619.10 45619.10 0.00 45619.10 45619.10 45619.10 0.00 22.98546 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0
The following list shows the test command and results for directIO enabled test with a transfer size of 128MB on the system with just one node working.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 30129 235.39 0.524655 8388608 131072 0.798392 34.80 1.64 34.80 0 read 35731 279.15 0.455215 8388608 131072 0.002234 29.35 2.37 29.35 0 Max Write: 30129.26 MiB/sec (31592.82 MB/sec) Max Read: 35730.91 MiB/sec (37466.57 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 30129.26 30129.26 30129.26 0.00 235.38 235.38 235.38 0.00 34.80258 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0 read 35730.91 35730.91 35730.91 0.00 279.15 279.15 279.15 0.00 29.34647 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0
Tests with directIO disabled
The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 30967 31042 0.004072 8388608 1024.00 0.008509 33.78 7.55 33.86 0 read 38440 38441 0.003291 8388608 1024.00 0.282087 27.28 8.22 27.28 0 Max Write: 30966.96 MiB/sec (32471.21 MB/sec) Max Read: 38440.06 MiB/sec (40307.32 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 30966.96 30966.96 30966.96 0.00 30966.96 30966.96 30966.96 0.00 33.86112 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0 read 38440.06 38440.06 38440.06 0.00 38440.06 38440.06 38440.06 0.00 27.27821 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 1048576 1048576.0 POSIX 0 Finished : Thu Sep 12 03:18:41 2024
The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.
lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G -k -r -w -o /mnt.l/stripe4M/testfile . . . access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 30728 240.72 0.515679 8388608 131072 0.010178 34.03 8.70 34.12 0 read 35974 281.05 0.386365 8388608 131072 0.067996 29.15 10.73 29.15 0 Max Write: 30727.85 MiB/sec (32220.49 MB/sec) Max Read: 35974.24 MiB/sec (37721.72 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 30727.85 30727.85 30727.85 0.00 240.06 240.06 240.06 0.00 34.12461 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0 read 35974.24 35974.24 35974.24 0.00 281.05 281.05 281.05 0.00 29.14797 NA NA 0 128 32 1 1 0 1 0 0 1 8589934592 134217728 1048576.0 POSIX 0
Failback
Meanwhile node26 booted after the crash. At our configuration the cluster software does not start automatically.
Error: error running crm_mon, is pacemaker running?
crm_mon: Connection to cluster failed: Connection refused
It could be useful in real life: before returning a node to a cluster, the administrator should identify, localize, and fix the problem to prevent it from recurring.
The cluster software works properly at node27:
node27# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Sat Aug 31 01:13:57 2024 on node27-ic * Last change: Thu Aug 29 01:26:09 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node27-ic ] * OFFLINE: [ node26-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node27-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost0 (ocf::xraid:raid): Started node27-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node27-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic * node27.stonith (stonith:fence_ipmilan): Stopped * node26.stonith (stonith:fence_ipmilan): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Since we know the reason of the node26 crash, we start the cluster software there:
Starting Cluster...
In some time the cluster software starts and the resources, which should work at node26, become properly moved from node27 to node26. The failback process took about 30 seconds.
node26# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Sat Aug 31 01:15:03 2024 on node26-ic * Last change: Thu Aug 29 01:26:09 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node26-ic node27-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node26-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost0 (ocf::xraid:raid): Started node26-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node26-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic * node27.stonith (stonith:fence_ipmilan): Started node26-ic * node26.stonith (stonith:fence_ipmilan): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
node27# pcs status Cluster name: lustrebox0 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum * Last updated: Sat Aug 31 01:15:40 2024 on node27-ic * Last change: Thu Aug 29 01:26:09 2024 by root via root on node26-ic * 2 nodes configured * 12 resource instances configured Node List: * Online: [ node26-ic node27-ic ] Full List of Resources: * rr_mdt0 (ocf::xraid:raid): Started node26-ic * fsr_mdt0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost0 (ocf::xraid:raid): Started node26-ic * fsr_ost0 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost1 (ocf::xraid:raid): Started node27-ic * fsr_ost1 (ocf::heartbeat:Filesystem): Started node27-ic * rr_ost2 (ocf::xraid:raid): Started node26-ic * fsr_ost2 (ocf::heartbeat:Filesystem): Started node26-ic * rr_ost3 (ocf::xraid:raid): Started node27-ic * fsr_ost3 (ocf::heartbeat:Filesystem): Started node27-ic * node27.stonith (stonith:fence_ipmilan): Started node26-ic * node26.stonith (stonith:fence_ipmilan): Started node27-ic Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Conclusion
The article shows the possibility of creating a small, highly available, and high-performance Lustre installation based on an SBB system with dual-ported NVMe drives and xiRAID Classic 4.1 RAID engine. It also demonstrates the ease of xiRAID Classic integration with Pacemaker clusters and compatibility of xiRAID Classic with the classical approach to Lustre clustering.
The configuration is straightforward and requires the following program components to be installed and properly configured:
- xiRAID Classic 4.1 and Csync2
- Lustre software
- Pacemaker software
The resulting system, based on the Viking VDS2249R SBB system, equipped with two single-CPU servers and 24 PCIe 4.0 NVMe drives, showed performance up to 55GB/s on writing and up to 86GB/s on reading from Lustre clients, using the standard parallel filesystem test program IOR.
This article, with minimal changes, can also be used to set up additional systems to expand existing Lustre installations.