Building a Highly Available BeeGFS Solution with xiRAID Classic 4.1 on a Dual-Node System with Shared NVMe Drives

This document provides a comprehensive guide to building a highly available BeeGFS solution using xiRAID Classic 4.1 on a dual-node system with shared NVMe drives. The focus is on optimizing the deployment of a BeeGFS parallel file system with data placed on clustered RAIDs powered by xiRAID, ensuring both performance and reliability. By integrating xiRAID into a Pacemaker-based high-availability cluster, this setup is designed to meet the needs of users who require efficient data storage and access across multiple nodes. Throughout the guide, system configuration, network setup, and performance tuning are addressed to create an optimal and resilient file system environment.

System layout

xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.

This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic BeeGFS parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.

This article will familiarize you with how to deploy xiRAID Classic for a real-life task.

The configuration details are presented in the table below.

	node #1	node #2
Hostname	node-01	node-02
OS	Rocky Linux 9.4	Rocky Linux 9.4
IPMI address	192.168.0.11	192.168.0.12
IPMI login	admin	admin
IPMI password	admin	admin
Management address	172.16.133.193/24	172.16.133.214/24
Cluster Heartbeat address	10.10.0.11	10.10.0.12
BeeGFS Network address	100.100.100.11	100.100.100.12
Virtual IP	100.100.100.10

System configuration and tuning

Before software installation and configuration, we need to prepare the platform to provide optimal performance.

Performance tuning

It is recommended to disable C-states in the system BIOS and set the following performance profile on both nodes:

# tuned-adm profile accelerator-performance

Network configuration

In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high.

A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.

Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:

127.0.0.1    localhost localhost.localdomain localhost4 localhost4.localdomain4
::1          localhost localhost.localdomain localhost6 localhost6.localdomain6
# Management
172.16.133.193 node-01
172.16.133.214 node-02
# IPMI
192.168.0.11   node-01-ipmi
192.168.0.12   node-02-ipmi
# Interconnect
10.10.0.11     node-01-ic
10.10.0.12     node-02-ic
# IB
100.100.100.11 node-01-ib
100.100.100.12 node-02-ib

Configure firewalld

The following ports are need to be allowed.

Service	firewalld service name	Port	Protocol
Cluster Components	high-availability	2224	TCP
		3121	TCP
		5403	TCP
		5404 - 5405	UDP
		21064	TCP
		9929	TCP / UDP
Csync²	-	30865	TCP
BeeGFS Management	-	8008	TCP / UDP
BeeGFS Metadata	-	8011-8012	TCP / UDP
BeeGFS Storage	-	8021-8024	TCP / UDP
BeeGFS Client • beegfs-ctl	-	8004	TCP / UDP
BeeGFS Client • beegfs-ctl	-	33000-65000	UDP
BeeGFS Helper	-	8006	TCP

Run the following commands to allow specified ports on both nodes.

# firewall-cmd --permanent --add-service=high-availability
# firewall-cmd --permanent --add-port={{8004,8008,8011,8012,8021,8022,8023,8024}/{tcp,udp},{8006,30865}/tcp,33000-65000/udp}
# firewall-cmd --reload

NVMe drives setup

In our setup, we plan to create a simple BeeGFS installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the BeeGFS RAIDs, and create separate RAIDs from the NVMe namespaces. By configuring our cluster software to use the RAIDs for BeeGFS storage made from the first namespaces on node #1 and the RAIDs created from the second namespaces on node #2, we will be able to utilize all four PCIe lanes for each NVMe used for BeeGFS Storage data.

Let’s take /dev/nvme4 as an example. All other drives will be split in absolutely the same way.

Check the maximum possible size of the drive to be sure.

# nvme id-ctrl /dev/nvme4 | grep -i tnvmcap
tnvmcap : 1602022801408

Check the maximal number of the namespaces supported by the drive.

# nvme id-ctrl /dev/nvme4 | grep ^nn
nn : 128

Check the controller used for the drive connection at both servers (they will differ).

node-01# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x23
node-02# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x24

We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is

1602022801408/4096=391118848

So, each namespace size in 4K blocks will be

391118848/2=195559424

In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 1954400000 blocks.

If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.

As a first step, remove the existing namespace on one of the nodes.

node-01# nvme delete-ns /dev/nvme4 -n 1

After that, create namespaces on the same node.

node-01# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:1

node-01# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:2

node-01# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x23
attach-ns: Success, nsid:1

node-01# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x23
attach-ns: Success, nsid:2

Attach the namespaces on the second node with the proper controller.

node-02# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x24
attach-ns: Success, nsid:1

node-02# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x24
attach-ns: Success, nsid:2

It looks like this on both nodes.

# nvme list | grep nvme4
/dev/nvme4n1	SDM00000D7AC	HUSMR7616BDP301	0x1	800.77	GB / 800.77	GB	4 KiB +	0 B	KNGND100
/dev/nvme4n2	SDM00000D7AC	HUSMR7616BDP301	0x2	800.77	GB / 800.77	GB	4 KiB +	0 B	KNGND100

All other drives were split in the same way.

Software components installation

xiRAID Classic 4.1 installation

Perform the following steps to install xiRAID Classic 4.1.0 on the both nodes.

Install EPEL.

# yum install -y epel-release

Install kernel-headers for the currently loaded kernel.

# yum install kernel-devel-$(uname -r)

Install xiraid-repo for current OS.

# yum install https://pkg.xinnor.io/repository/Repository/xiraid/el/9/kver-5.14/xiraid-repo-1.1.0-446.kver.5.14.noarch.rpm

Install xiraid-release.

# yum install xiraid-release

BeeGFS installation

Installation will be performed from BeeGFS repositories, supported Linux distributions and repo links could be found by the following link: Release Notes v7.4.4 — BeeGFS Documentation 7.4.4

Download the BeeGFS repository file.

# wget -O /etc/yum.repos.d/beegfs_rhel9.repo https://www.beegfs.io/release/beegfs_7.4.4/dists/beegfs-rhel9.repo

Add the public BeeGFS GPG key to package manager.

# rpm --import https://www.beegfs.io/release/beegfs_7.4.4/gpg/GPG-KEY-beegfs

Install the packages from the repository.

# yum install beegfs-mgmtd beegfs-meta beegfs-storage libbeegfs-ib

beegfs-mgmtd - BeeGFS management service
beegfs-meta - BeeGFS metadata service
beegfs-storage - BeeGFS storage service
libbeegfs-ib - support for remote direct memory access (RDMA) based on the OFED ibverbs API

It is recommended to disable SELinux on the all BeeGFS hosts.

# setenforce 0
# vi /etc/sysconfig/selinux
...
SELINUX=disabled
...

Pacemaker installation

Activate the High Availability and CRB repos.

# yum config-manager --set-enabled highavailability crb

Install pacemaker, pcs and resource and fence agents.

# yum install pcs pacemaker resource-agents fence-agents-all

Start and enable the daemon by issuing the following commands on each node.

# systemctl start pcsd.service
# systemctl enable pcsd.service

Csync² installation

Csync² is an utility for asynchronous file synchronization in cluster. Csync² is used to synchronize RAIDs config files between nodes in cluster.

Csync² package available in the Xinnor repository, install it on both nodes.

# yum install csync2

HA cluster setup

Csync2 configuration

Configure Csync2 to synchronize RAIDs config files between nodes in cluster.

Create configuration file on the first node.

# vi /usr/local/etc/csync2.cfg

With the following content.

nossl * *;
group csxiha {
	host node-01;
	host node-02;
	key /usr/local/etc/csync2.key_ha;
	include /etc/xiraid/raids;
}

Generate a key for the first connection.

# csync2 -k /usr/local/etc/csync2.key_ha

Copy config file and generated key to the second node.

# scp /usr/local/etc/csync2.cfg /usr/local/etc/csync2.key_ha node-02:/usr/local/etc/

For Csync2 synchronization by schedule run crontab -e on both nodes and add the following record.

* * * * * /usr/local/sbin/csync2 -x

Also, for asynchronous synchronization on xiRAID config changes run the following command to create a synchronization script (create this script on both nodes).

# vi /etc/xiraid/config_update_handler.sh

Fill the created script with the following content.

#!/usr/bin/bash /usr/local/sbin/csync2 -xv

Save the file.

After that run the following command to set correct permissions for the script file.

# chmod +x /etc/xiraid/config_update_handler.sh

Pacemaker cluster creation

Set the same password for the hacluster user on both nodes.

# passwd hacluster

Authenticate the cluster nodes from one node by their interconnect interfaces.

# pcs host auth node-01-ic node-02-ic -u hacluster Password: node-01-ic: Authorized node-02-ic: Authorized

Create and start the cluster (start at one node).

# pcs cluster setup beegfs-ha node-01-ic node-02-ic
No addresses specified for host 'node-01-ic', using 'node-01-ic'
No addresses specified for host 'node-02-ic', using 'node-02-ic'
Destroying cluster on hosts: 'node-01-ic', 'node-02-ic'...
node-02-ic: Successfully destroyed cluster
node-01-ic: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node-01-ic', 'node-02-ic'
node-01-ic: successful removal of the file 'pcsd settings'
node-02-ic: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node-01-ic', 'node-02-ic'
node-01-ic: successful distribution of the file 'corosync authkey'
node-01-ic: successful distribution of the file 'pacemaker authkey'
node-02-ic: successful distribution of the file 'corosync authkey'
node-02-ic: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node-01-ic', 'node-02-ic'
node-01-ic: successful distribution of the file 'corosync.conf'
node-02-ic: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

Start the cluster.

# pcs cluster start --all

Check the current cluster status.

# pcs status
Cluster name: beegfs-ha
WARNINGS:
No stonith devices and stonith-enabled is not false
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-01-ic (version 2.1.7-5.2.el9_4-0f7f88312) - partition with quorum
  * Last updated: Mon Sep 30 10:40:48 2024 on node-01-ic
  * Last change:  Mon Sep 30 10:40:35 2024 by hacluster via hacluster on node-01-ic
  * 2 nodes configured
  * 0 resource instances configured
Node List:
  * Online: [ node-01-ic node-02-ic ]
Full List of Resources:
  * No resources
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Fencing setup

The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.

You may check the IPMI fencing agent options description by running the following command.

# pcs stonith describe fence_ipmilan

Adding the fencing resources.

# pcs stonith create node-01.stonith fence_ipmilan ip="192.168.0.11" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node-01-ic" pcmk_host_check=static-list op monitor interval=10s
# pcs stonith create node-02.stonith fence_ipmilan ip="192.168.0.12" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node-02-ic" pcmk_host_check=static-list op monitor interval=10s

Prevent the STONITH resources from start on the node it should kill.

# pcs constraint location node-01.stonith avoids node-01-ic=INFINITY
# pcs constraint location node-02.stonith avoids node-02-ic=INFINITY

xiRAID Configuration for cluster setup

Since we plan to deploy a small BeeGFS installation, combining Management and Meta services on the same target device is absolutely OK. But for medium or large BeeGFS installations, it's better to use a separate target (and RAID) for Management service.

Also, it is important to consider the distribution of drives across NUMA nodes and avoid including drives located on different NUMA nodes in one RAID.

The topology of the system of the system could be viewed by the following command.

# lstopo
Machine (249GB total)
  Package L#0
    NUMANode L#0 (P#0 124GB)
...
      PCIBridge
        PCI 17:00.0 (NVMExp)
          Block(Disk) "nvme14c14n1"
          Block(Disk) "nvme14c14n2"
      PCIBridge
        PCI 1a:00.0 (NVMExp)
          Block(Disk) "nvme15c15n2"
          Block(Disk) "nvme15c15n1"
      PCIBridge
        PCI 1d:00.0 (NVMExp)
          Block(Disk) "nvme16c16n1"
          Block(Disk) "nvme16c16n2"
      PCIBridge
        PCI 20:00.0 (NVMExp)
          Block(Disk) "nvme17c17n2"
          Block(Disk) "nvme17c17n1"
      PCIBridge
        PCI 23:00.0 (NVMExp)
          Block(Disk) "nvme10c10n1"
          Block(Disk) "nvme10c10n2"
      PCIBridge
        PCI 26:00.0 (NVMExp)
          Block(Disk) "nvme11c11n2"
          Block(Disk) "nvme11c11n1"
      PCIBridge
        PCI 29:00.0 (NVMExp)
          Block(Disk) "nvme12c12n1"
          Block(Disk) "nvme12c12n2"
      PCIBridge
        PCI 2c:00.0 (NVMExp)
          Block(Disk) "nvme13c13n2"
          Block(Disk) "nvme13c13n1"
      PCIBridge
        PCI 2f:00.0 (NVMExp)
          Block(Disk) "nvme18c18n1"
          Block(Disk) "nvme18c18n2"
      PCIBridge
        PCI 32:00.0 (NVMExp)
          Block(Disk) "nvme19c19n1"
          Block(Disk) "nvme19c19n2"
...
  Package L#1
    NUMANode L#1 (P#1 125GB)
...
      PCIBridge
        PCI 85:00.0 (NVMExp)
          Block(Disk) "nvme8c8n2"
          Block(Disk) "nvme8c8n1"
      PCIBridge
        PCI 88:00.0 (NVMExp)
          Block(Disk) "nvme9c9n2"
          Block(Disk) "nvme9c9n1"
      PCIBridge
        PCI 91:00.0 (NVMExp)
          Block(Disk) "nvme4c4n2"
          Block(Disk) "nvme4c4n1"
      PCIBridge
        PCI 94:00.0 (NVMExp)
          Block(Disk) "nvme5c5n2"
          Block(Disk) "nvme5c5n1"
      PCIBridge
        PCI 97:00.0 (NVMExp)
          Block(Disk) "nvme6c6n2"
          Block(Disk) "nvme6c6n1"
      PCIBridge
        PCI 9a:00.0 (NVMExp)
          Block(Disk) "nvme7c7n2"
          Block(Disk) "nvme7c7n1"
      PCIBridge
        PCI 9d:00.0 (NVMExp)
          Block(Disk) "nvme0c0n1"
          Block(Disk) "nvme0c0n2"
      PCIBridge
        PCI a0:00.0 (NVMExp)
          Block(Disk) "nvme1c1n1"
          Block(Disk) "nvme1c1n2"
      PCIBridge
        PCI a3:00.0 (NVMExp)
          Block(Disk) "nvme2c2n1"
          Block(Disk) "nvme2c2n2"
      PCIBridge
        PCI a6:00.0 (NVMExp)
          Block(Disk) "nvme3c3n2"
          Block(Disk) "nvme3c3n1"
...

Thus, the distribution of drives between NUMA nodes looks like this.

NUMA1	NUMA0
/dev/nvme0c0n1 /dev/nvme0c0n2 /dev/nvme1c1n1 /dev/nvme1c1n2 /dev/nvme2c2n1 /dev/nvme2c2n2 /dev/nvme3c3n2 /dev/nvme3c3n1 /dev/nvme4c4n2 /dev/nvme4c4n1 /dev/nvme5c5n2 /dev/nvme5c5n1 /dev/nvme6c6n2 /dev/nvme6c6n1 /dev/nvme7c7n2 /dev/nvme7c7n1 /dev/nvme8c8n2 /dev/nvme8c8n1 /dev/nvme9c9n2 /dev/nvme9c9n1	/dev/nvme10c10n1 /dev/nvme10c10n2 /dev/nvme11c11n2 /dev/nvme11c11n1 /dev/nvme12c12n1 /dev/nvme12c12n2 /dev/nvme13c13n2 /dev/nvme13c13n1 /dev/nvme14c14n1 /dev/nvme14c14n2 /dev/nvme15c15n2 /dev/nvme15c15n1 /dev/nvme16c16n1 /dev/nvme16c16n2 /dev/nvme17c17n2 /dev/nvme17c17n1 /dev/nvme18c18n1 /dev/nvme18c18n2 /dev/nvme19c19n1 /dev/nvme19c19n2

Here is a list of RAID arrays that we need to create, considering drives affiliation to NUMA nodes.

RAID Name	RAID Level	Number of devices	Strip size	NUMA node	Drive list	BeeGFS service
rd_mt01	1	2	16	0	/dev/nvme10n1 /dev/nvme19n2	Management Meta
rd_mt02	1	2	16	1	/dev/nvme0n1 /dev/nvme9n2	Meta
rd_st01	5	9	128	0	/dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme19n1	Storage
rd_st02	5	9	128	0	/dev/nvme10n2 /dev/nvme11n2 /dev/nvme12n2 /dev/nvme13n2 /dev/nvme14n2 /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2	Storage
rd_st11	5	9	128	1	/dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1	Storage
rd_st12	5	9	128	1	/dev/nvme0n2 /dev/nvme1n2 /dev/nvme2n2 /dev/nvme3n2 /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2	Storage

Creating all the RAIDs at the first node.

# xicli raid create -n rd_mt01 -l 1 -d /dev/nvme10n1 /dev/nvme19n2
# xicli raid create -n rd_mt02 -l 1 -d /dev/nvme0n1 /dev/nvme9n2
# xicli raid create -n rd_st01 -l 5 -ss 128 -d /dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme19n1
# xicli raid create -n rd_st02 -l 5 -ss 128 -d /dev/nvme10n2 /dev/nvme11n2 /dev/nvme12n2 /dev/nvme13n2 /dev/nvme14n2 /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2
# xicli raid create -n rd_st11 -l 5 -ss 128 -d /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1
# xicli raid create -n rd_st12 -l 5 -ss 128 -d /dev/nvme0n2 /dev/nvme1n2 /dev/nvme2n2 /dev/nvme3n2 /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2

At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.

Checking the RAID statuses at the first node.

# xicli raid show
╔RAIDs════╦══════════════════╦═════════╦═════════════════════════╦═══════════════════╗
║ name    ║ static           ║ state   ║ devices                 ║ info              ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_mt01 ║ size: 800 GiB    ║ online  ║ 0 /dev/nvme10n1 online  ║ init_progress: 13 ║
║         ║ level: 1         ║ initing ║ 1 /dev/nvme19n2 online  ║                   ║
║         ║ strip_size: 16   ║         ║                         ║                   ║
║         ║ block_size: 4096 ║         ║                         ║                   ║
║         ║ sparepool: -     ║         ║                         ║                   ║
║         ║ active: True     ║         ║                         ║                   ║
║         ║ config: True     ║         ║                         ║                   ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_mt02 ║ size: 800 GiB    ║ online  ║ 0 /dev/nvme0n1 online   ║ init_progress: 11 ║
║         ║ level: 1         ║ initing ║ 1 /dev/nvme9n2 online   ║                   ║
║         ║ strip_size: 16   ║         ║                         ║                   ║
║         ║ block_size: 4096 ║         ║                         ║                   ║
║         ║ sparepool: -     ║         ║                         ║                   ║
║         ║ active: True     ║         ║                         ║                   ║
║         ║ config: True     ║         ║                         ║                   ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_st01 ║ size: 6406 GiB   ║ online  ║ 0 /dev/nvme11n1 online  ║ init_progress: 46 ║
║         ║ level: 5         ║ initing ║ 1 /dev/nvme12n1 online  ║                   ║
║         ║ strip_size: 128  ║         ║ 2 /dev/nvme13n1 online  ║                   ║
║         ║ block_size: 4096 ║         ║ 3 /dev/nvme14n1 online  ║                   ║
║         ║ sparepool: -     ║         ║ 4 /dev/nvme15n1 online  ║                   ║
║         ║ active: True     ║         ║ 5 /dev/nvme16n1 online  ║                   ║
║         ║ config: True     ║         ║ 6 /dev/nvme17n1 online  ║                   ║
║         ║                  ║         ║ 7 /dev/nvme18n1 online  ║                   ║
║         ║                  ║         ║ 8 /dev/nvme19n1 online  ║                   ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_st02 ║ size: 6406 GiB   ║ online  ║ 0 /dev/nvme10n2 online  ║ init_progress: 15 ║
║         ║ level: 5         ║ initing ║ 1 /dev/nvme11n2 online  ║                   ║
║         ║ strip_size: 128  ║         ║ 2 /dev/nvme12n2 online  ║                   ║
║         ║ block_size: 4096 ║         ║ 3 /dev/nvme13n2 online  ║                   ║
║         ║ sparepool: -     ║         ║ 4 /dev/nvme14n2 online  ║                   ║
║         ║ active: True     ║         ║ 5 /dev/nvme15n2 online  ║                   ║
║         ║ config: True     ║         ║ 6 /dev/nvme16n2 online  ║                   ║
║         ║                  ║         ║ 7 /dev/nvme17n2 online  ║                   ║
║         ║                  ║         ║ 8 /dev/nvme18n2 online  ║                   ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_st11 ║ size: 6406 GiB   ║ online  ║ 0 /dev/nvme1n1 online   ║ init_progress: 11 ║
║         ║ level: 5         ║ initing ║ 1 /dev/nvme2n1 online   ║                   ║
║         ║ strip_size: 128  ║         ║ 2 /dev/nvme3n1 online   ║                   ║
║         ║ block_size: 4096 ║         ║ 3 /dev/nvme4n1 online   ║                   ║
║         ║ sparepool: -     ║         ║ 4 /dev/nvme5n1 online   ║                   ║
║         ║ active: True     ║         ║ 5 /dev/nvme6n1 online   ║                   ║
║         ║ config: True     ║         ║ 6 /dev/nvme7n1 online   ║                   ║
║         ║                  ║         ║ 7 /dev/nvme8n1 online   ║                   ║
║         ║                  ║         ║ 8 /dev/nvme9n1 online   ║                   ║
╠═════════╬══════════════════╬═════════╬═════════════════════════╬═══════════════════╣
║ rd_st12 ║ size: 6406 GiB   ║ online  ║ 0 /dev/nvme0n2 online   ║ init_progress: 4  ║
║         ║ level: 5         ║ initing ║ 1 /dev/nvme1n2 online   ║                   ║
║         ║ strip_size: 128  ║         ║ 2 /dev/nvme2n2 online   ║                   ║
║         ║ block_size: 4096 ║         ║ 3 /dev/nvme3n2 online   ║                   ║
║         ║ sparepool: -     ║         ║ 4 /dev/nvme4n2 online   ║                   ║
║         ║ active: True     ║         ║ 5 /dev/nvme5n2 online   ║                   ║
║         ║ config: True     ║         ║ 6 /dev/nvme6n2 online   ║                   ║
║         ║                  ║         ║ 7 /dev/nvme7n2 online   ║                   ║
║         ║                  ║         ║ 8 /dev/nvme8n2 online   ║                   ║
╚═════════╩══════════════════╩═════════╩═════════════════════════╩═══════════════════╝

Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case).

# xicli raid show
╔RAIDs════╦══════════════════╦═══════╦═════════╦══════╗
║ name    ║ static           ║ state ║ devices ║ info ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_mt01 ║ size: 800 GiB    ║ None  ║         ║      ║
║         ║ level: 1         ║       ║         ║      ║
║         ║ strip_size: 16   ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_mt02 ║ size: 800 GiB    ║ None  ║         ║      ║
║         ║ level: 1         ║       ║         ║      ║
║         ║ strip_size: 16   ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st01 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st02 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st11 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st12 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╚═════════╩══════════════════╩═══════╩═════════╩══════╝

For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID.

Currently, all RAIDs are active on node-01, so the sets will joint, but when they are spread between node-01 and node-02, they will not joint.

# xicli raid modify -n rd_mt01 -ca 0-7 -se 1
# xicli raid modify -n rd_mt02 -ca 0-7 -se 1      # will be running at node-02
# xicli raid modify -n rd_st01 -ca 8-67 -se 1
# xicli raid modify -n rd_st02 -ca 8-67 -se 1     # will be running at node-02
# xicli raid modify -n rd_st11 -ca 68-127 -se 1
# xicli raid modify -n rd_st12 -ca 68-127 -se 1   # will be running at node-02

RAID-optimized File System creation

Metadata RAID arrays

When formatting RAID device for Metadata service, it is important to include options that minimize access times for large directories (-Odir_index), to create large inodes that allow storing BeeGFS metadata as extended attributes directly inside the inodes for maximum performance (-I 512), to reserve a sufficient number of inodes (-i 2048), and to use a large journal (-J size=400).

# mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/xi_rd_mt01
# mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/xi_rd_mt02

Create mount points on both nodes.

# mkdir -p /mnt/bgfscluster/rd_mt01
# mkdir -p /mnt/bgfscluster/rd_mt02

Storage RAID arrays

When formatting RAID device for Storage service, you should to specify settings for RAID. This enables the file system to optimize its read and write access for RAID alignment, e.g. by committing data as complete stripe sets for maximum throughput. While BeeGFS uses dedicated metadata services to manage global metadata, the metadata performance of the underlying file system on storage servers still matters for operations like file creates, deletes, small reads/writes, etc. Recent versions of XFS allow inlining of data into inodes to avoid the need for additional blocks. In order to use this efficiently, the inode size should be increased to 512 bytes or larger.

# mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/xi_rd_st01
# mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/xi_rd_st02
# mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/xi_rd_st11
# mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/xi_rd_st12

Create mount points on both nodes.

# mkdir -p /mnt/bgfscluster/rd_st01
# mkdir -p /mnt/bgfscluster/rd_st02
# mkdir -p /mnt/bgfscluster/rd_st11
# mkdir -p /mnt/bgfscluster/rd_st12

BeeGFS setup

BeeGFS will be configured in multi mode - a way to run multiple instances of one type of service on one server. For multi mode, it will be needed to create a separate configuration files for the services instances, using different network ports and different storage directories.

Temporarily mount the RAID arrays on the node where they are active to perform the BeeGFS services configuration. Later, the mount will be configured using cluster resources.

# mount /dev/xi_rd_mt01 /mnt/bgfscluster/rd_mt01
# mount /dev/xi_rd_mt02 /mnt/bgfscluster/rd_mt02
# mount /dev/xi_rd_st01 /mnt/bgfscluster/rd_st01
# mount /dev/xi_rd_st02 /mnt/bgfscluster/rd_st02
# mount /dev/xi_rd_st11 /mnt/bgfscluster/rd_st11
# mount /dev/xi_rd_st12 /mnt/bgfscluster/rd_st12

Create directories for BeeGFS services.

# mkdir -p /mnt/bgfscluster/rd_mt01/mgm
# mkdir -p /mnt/bgfscluster/rd_mt01/mt01
# mkdir -p /mnt/bgfscluster/rd_mt02/mt02
# mkdir -p /mnt/bgfscluster/rd_st01/st01
# mkdir -p /mnt/bgfscluster/rd_st02/st02
# mkdir -p /mnt/bgfscluster/rd_st11/st11
# mkdir -p /mnt/bgfscluster/rd_st12/st12

Temporarily add virtual IP to perform the BeeGFS services configuration. Later, it also will be configured using cluster resources.

# ifconfig ens20:ip-mgm 100.100.100.10

Configure BeeGFS management service.

# /opt/beegfs/sbin/beegfs-setup-mgmtd -p /mnt/bgfscluster/rd_mt01/mgm -S mgm-srv

Here mgm-srv is used as nodeID, by default, the nodeID is created as a hostname, but since the management service can move between nodes, it must be set as a separate value.

Create directories for instances config files.

# mkdir /etc/beegfs/mt01.d
# mkdir /etc/beegfs/mt02.d
# mkdir /etc/beegfs/st01.d
# mkdir /etc/beegfs/st02.d
# mkdir /etc/beegfs/st11.d
# mkdir /etc/beegfs/st12.d

Copy default config files.

# cp /etc/beegfs/beegfs-meta.conf /etc/beegfs/mt01.d/beegfs-meta.conf
# cp /etc/beegfs/beegfs-meta.conf /etc/beegfs/mt02.d/beegfs-meta.conf
# cp /etc/beegfs/beegfs-storage.conf /etc/beegfs/st01.d/beegfs-storage.conf
# cp /etc/beegfs/beegfs-storage.conf /etc/beegfs/st02.d/beegfs-storage.conf
# cp /etc/beegfs/beegfs-storage.conf /etc/beegfs/st11.d/beegfs-storage.conf
# cp /etc/beegfs/beegfs-storage.conf /etc/beegfs/st12.d/beegfs-storage.conf

Configure BeeGFS meta service instances.

# /opt/beegfs/sbin/beegfs-setup-meta -c /etc/beegfs/mt01.d/beegfs-meta.conf -p /mnt/bgfscluster/rd_mt01/mt01 -s 01 -S sr_meta_01 -m 100.100.100.10
# /opt/beegfs/sbin/beegfs-setup-meta -c /etc/beegfs/mt02.d/beegfs-meta.conf -p /mnt/bgfscluster/rd_mt02/mt02 -s 02 -S sr_meta_02 -m 100.100.100.10
Here Virtual IP 100.100.100.10 set as BeeGFS management host.

Configure BeeGFS storage service.

# /opt/beegfs/sbin/beegfs-setup-storage -c /etc/beegfs/st01.d/beegfs-storage.conf -p /mnt/bgfscluster/rd_st01/st01 -s 101 -S sr_stor_01 -i 201 -m 100.100.100.10
# /opt/beegfs/sbin/beegfs-setup-storage -c /etc/beegfs/st02.d/beegfs-storage.conf -p /mnt/bgfscluster/rd_st02/st02 -s 102 -S sr_stor_02 -i 202 -m 100.100.100.10
# /opt/beegfs/sbin/beegfs-setup-storage -c /etc/beegfs/st11.d/beegfs-storage.conf -p /mnt/bgfscluster/rd_st11/st11 -s 111 -S sr_stor_11 -i 211 -m 100.100.100.10
# /opt/beegfs/sbin/beegfs-setup-storage -c /etc/beegfs/st12.d/beegfs-storage.conf -p /mnt/bgfscluster/rd_st12/st12 -s 112 -S sr_stor_12 -i 212 -m 100.100.100.10

Next, configure connection authentication with a connAuthFile. Create a file which contains a shared secret.

# dd if=/dev/random of=/etc/beegfs/connauthfile bs=128 count=1

Ensure the file is only readable by the root user.

# chown root:root /etc/beegfs/connauthfile
# chmod 400 /etc/beegfs/connauthfile

Create file /etc/beegfs/connInterfacesFile1.conf to specify management service interface.

# vi /etc/beegfs/connInterfacesFile1.conf

And specify Virtual IP interface.

ens20:ip-mgm

Create file /etc/beegfs/connInterfacesFile2.conf to specify services instances interface.

# vi /etc/beegfs/connInterfacesFile2.conf

And specify IP interface.

ens20

Edit BeeGFS Management configuration file by specifying authentication file connAuthFile=/etc/beegfs/connauthfile and Management service Interface file connInterfacesFile=/etc/beegfs/connInterfacesFile1.conf with the absolute path/filename.

# vi /etc/beegfs/beegfs-mgmtd.conf
...
connAuthFile        = /etc/beegfs/connauthfile
...
connInterfacesFile  = /etc/beegfs/connInterfacesFile1.conf
...

Edit BeeGFS services instances configuration files with specifying authentication and interface files, and defined ports for service instances.

Service instance configuration file	Defined port
/etc/beegfs/mt01.d/beegfs-meta.conf /etc/beegfs/mt01.d/beegfs-meta.conf /etc/beegfs/st01.d/beegfs-storage.conf /etc/beegfs/st02.d/beegfs-storage.conf /etc/beegfs/st11.d/beegfs-storage.conf /etc/beegfs/st12.d/beegfs-storage.conf	8011 8012 8021 8022 8023 8024

# vi /etc/beegfs/mt01.d/beegfs-meta.conf
...
connAuthFile                 = /etc/beegfs/connauthfile
...
connInterfacesFile           = /etc/beegfs/connInterfacesFile2.conf
...
connMetaPortTCP              = 8011
connMetaPortUDP              = 8011

Repeat same for other instances (use corresponding defined ports for BeeGFS services instances).

# vi /etc/beegfs/mt01.d/beegfs-meta.conf
# vi /etc/beegfs/st01.d/beegfs-storage.conf
# vi /etc/beegfs/st02.d/beegfs-storage.conf
# vi /etc/beegfs/st11.d/beegfs-storage.conf
# vi /etc/beegfs/st12.d/beegfs-storage.conf

Copy BeeGFS configuration files and shared secret to second host in the cluster.

# rsync -av /etc/beegfs/ node-02:/etc/beegfs/

Cluster resources setup

In order for the cluster resources to be started in the correct order and to avoid starting resources on nodes other than those where the resources of the corresponding RAID arrays are started, resource distribution by groups will be used.

Virtual IP resource

The Virtual IP will be used to connect BeeGFS services and clients to the management service to enable ability to move it between cluster nodes.

Remove temporarily created virtual IP.

# ifconfig ens20:ip-mgm down

Create resource for Virtual IP for management service with gr_beegfs_mt01 group assignment.

# pcs resource create ip_mgm ocf:heartbeat:IPaddr2 ip=100.100.100.10 nic=ens20 iflabel=ip-mgm cidr_netmask=24 op monitor interval=30s --group=gr_beegfs_mt01

xiRAID resources

To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.

Unmount and unload all the RAIDs at the node where they are active.

# umount /mnt/bgfscluster/*
# xicli raid unload --all

Make sure that all RAIDs are unloaded.

# xicli raid show
╔RAIDs════╦══════════════════╦═══════╦═════════╦══════╗
║ name    ║ static           ║ state ║ devices ║ info ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_mt01 ║ size: 800 GiB    ║ None  ║         ║      ║
║         ║ level: 1         ║       ║         ║      ║
║         ║ strip_size: 16   ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_mt02 ║ size: 800 GiB    ║ None  ║         ║      ║
║         ║ level: 1         ║       ║         ║      ║
║         ║ strip_size: 16   ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st01 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st02 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st11 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╠═════════╬══════════════════╬═══════╬═════════╬══════╣
║ rd_st12 ║ size: 6406 GiB   ║ None  ║         ║      ║
║         ║ level: 5         ║       ║         ║      ║
║         ║ strip_size: 128  ║       ║         ║      ║
║         ║ block_size: 4096 ║       ║         ║      ║
║         ║ sparepool: -     ║       ║         ║      ║
║         ║ active: False    ║       ║         ║      ║
║         ║ config: True     ║       ║         ║      ║
╚═════════╩══════════════════╩═══════╩═════════╩══════╝

Getting the RAIDs UUIDs.

# grep uuid /etc/xiraid/raids/*.conf
/etc/xiraid/raids/rd_mt01.conf:    "uuid": "B86CF976-5BC3-47D2-A707-DDD2066BFDB6",
/etc/xiraid/raids/rd_mt02.conf:    "uuid": "68920316-B4A3-4B68-B825-BD62FD5985DE",
/etc/xiraid/raids/rd_st01.conf:    "uuid": "0422190E-6D47-43E3-A8E4-48AA4366B07B",
/etc/xiraid/raids/rd_st02.conf:    "uuid": "7EB24482-5178-457E-903E-3C13E201DAB6",
/etc/xiraid/raids/rd_st11.conf:    "uuid": "0467E185-DF3B-4BB7-8ACD-2AF04DDBCA05",
/etc/xiraid/raids/rd_st12.conf:    "uuid": "2244777A-C286-41B8-9B67-18723074E38A",

Create pcs resource for RAIDs using UUIDs from previous step, each resource assign to its group (for example: --group=gr_beegfs_mt01).

# pcs resource create xiraid_rd_mt01 ocf:xraid:raid name=rd_mt01 uuid=B86CF976-5BC3-47D2-A707-DDD2066BFDB6 op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_mt01
# pcs resource create xiraid_rd_mt02 ocf:xraid:raid name=rd_mt02 uuid=68920316-B4A3-4B68-B825-BD62FD5985DE op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_mt02
# pcs resource create xiraid_rd_st01 ocf:xraid:raid name=rd_st01 uuid=0422190E-6D47-43E3-A8E4-48AA4366B07B op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_st01
# pcs resource create xiraid_rd_st02 ocf:xraid:raid name=rd_st02 uuid=7EB24482-5178-457E-903E-3C13E201DAB6 op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_st02
# pcs resource create xiraid_rd_st11 ocf:xraid:raid name=rd_st11 uuid=0467E185-DF3B-4BB7-8ACD-2AF04DDBCA05 op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_st11
# pcs resource create xiraid_rd_st12 ocf:xraid:raid name=rd_st12 uuid=2244777A-C286-41B8-9B67-18723074E38A op monitor interval=5s meta migration-threshold=1 --group=gr_beegfs_st12

Filesystem resources

File system resources must be distributed among the appropriate groups to avoid running resources on a different node than the RAID array resource.

To avoid the overhead of updating the last access file timestamps, the metadata filesystems will be mounting with the noatime option without any influence on the last access timestamps that users see in an BeeGFS mount.

Create filesystem resource for metadata services.

# pcs resource create fs_mt01 Filesystem device="/dev/xi_rd_mt01" directory="/mnt/bgfscluster/rd_mt01" fstype="ext4" options="noatime,nodiratime" --group=gr_beegfs_mt01
# pcs resource create fs_mt02 Filesystem device="/dev/xi_rd_mt02" directory="/mnt/bgfscluster/rd_mt02" fstype="ext4" options="noatime,nodiratime" --group=gr_beegfs_mt02

The storage filesystems will be mounting with increased number of log buffers and their size by adding logbufs and logbsize mount options, what allows XFS to generally handle and enqueue pending file and directory operations more efficiently. There are also several mount options for XFS that are intended to further optimize streaming performance on RAID storage, such as largeio, inode64, and swalloc. For optimal streaming write throughput, mount option allocsize=131072k added to reduce the risk of fragmentation for large files.

Create filesystem resource for storage services.

# pcs resource create fs_st01 Filesystem device="/dev/xi_rd_st01" directory="/mnt/bgfscluster/rd_st01" fstype="xfs" options="noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k" --group=gr_beegfs_st01
# pcs resource create fs_st02 Filesystem device="/dev/xi_rd_st02" directory="/mnt/bgfscluster/rd_st02" fstype="xfs" options="noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k" --group=gr_beegfs_st02
# pcs resource create fs_st11 Filesystem device="/dev/xi_rd_st11" directory="/mnt/bgfscluster/rd_st11" fstype="xfs" options="noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k" --group=gr_beegfs_st11
# pcs resource create fs_st12 Filesystem device="/dev/xi_rd_st12" directory="/mnt/bgfscluster/rd_st12" fstype="xfs" options="noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k" --group=gr_beegfs_st12

BeeGFS resources

Same as filesystem resources, BeeGFS service resources must be distributed among the appropriate groups to avoid running resources on a different node than the RAID array and filesystem resources.

Please note that BeeGFS Management service resource must be created the first.

# pcs resource create beegfs_mgm systemd:beegfs-mgmtd        op monitor interval=10s  --group=gr_beegfs_mt01
# pcs resource create bgfs_mt01  systemd:beegfs-meta@mt01    op monitor interval=10s  --group=gr_beegfs_mt01
# pcs resource create bgfs_mt02  systemd:beegfs-meta@mt02    op monitor interval=10s  --group=gr_beegfs_mt02
# pcs resource create bgfs_st01  systemd:beegfs-storage@st01 op monitor interval=10s  --group=gr_beegfs_st01
# pcs resource create bgfs_st02  systemd:beegfs-storage@st02 op monitor interval=10s  --group=gr_beegfs_st02
# pcs resource create bgfs_st11  systemd:beegfs-storage@st11 op monitor interval=10s  --group=gr_beegfs_st11
# pcs resource create bgfs_st12  systemd:beegfs-storage@st12 op monitor interval=10s  --group=gr_beegfs_st12

Location preferences and start order

In xiRAID Classic 4.1, it is required to guarantee that only one RAID can be starting at a time. To do so, we define the following constraints. This limitation is planned for removal in xiRAID Classic 4.2.

# pcs constraint order start xiraid_rd_mt01 then start xiraid_rd_mt02 kind=Serialize
# pcs constraint order start xiraid_rd_mt01 then start xiraid_rd_st01 kind=Serialize
# pcs constraint order start xiraid_rd_mt01 then start xiraid_rd_st02 kind=Serialize
# pcs constraint order start xiraid_rd_mt01 then start xiraid_rd_st11 kind=Serialize
# pcs constraint order start xiraid_rd_mt01 then start xiraid_rd_st12 kind=Serialize
# pcs constraint order start xiraid_rd_mt02 then start xiraid_rd_st01 kind=Serialize
# pcs constraint order start xiraid_rd_mt02 then start xiraid_rd_st02 kind=Serialize
# pcs constraint order start xiraid_rd_mt02 then start xiraid_rd_st11 kind=Serialize
# pcs constraint order start xiraid_rd_mt02 then start xiraid_rd_st12 kind=Serialize
# pcs constraint order start xiraid_rd_st01 then start xiraid_rd_st02 kind=Serialize
# pcs constraint order start xiraid_rd_st01 then start xiraid_rd_st11 kind=Serialize
# pcs constraint order start xiraid_rd_st01 then start xiraid_rd_st12 kind=Serialize
# pcs constraint order start xiraid_rd_st02 then start xiraid_rd_st11 kind=Serialize
# pcs constraint order start xiraid_rd_st02 then start xiraid_rd_st12 kind=Serialize
# pcs constraint order start xiraid_rd_st11 then start xiraid_rd_st12 kind=Serialize

To ensure BeeGFS services start in the proper order, we need to configure the cluster to start Management service before all other BeeGFS services.

# constraint order beegfs_mgm then bgfs_mt02
# constraint order beegfs_mgm then bgfs_st01
# constraint order beegfs_mgm then bgfs_st02
# constraint order beegfs_mgm then bgfs_st11
# constraint order beegfs_mgm then bgfs_st12

Setting a constraints to make resources node preferences.

# pcs constraint location gr_beegfs_mt01 prefers node-01-ic=50
# pcs constraint location gr_beegfs_mt02 prefers node-02-ic=50
# pcs constraint location gr_beegfs_st01 prefers node-01-ic=50
# pcs constraint location gr_beegfs_st02 prefers node-02-ic=50
# pcs constraint location gr_beegfs_st11 prefers node-01-ic=50
# pcs constraint location gr_beegfs_st12 prefers node-02-ic=50

Verify cluster

Use pcs status to verify cluster and resources.

# pcs status
Cluster name: beegfs-ha
Cluster Summary:
 * Stack: corosync (Pacemaker is running)
 * Current DC: node-02-ic (version 2.1.7-5.2.el9_4-0f7f88312) - partition with quorum
 * Last updated: Mon Sep 30 18:34:03 2024 on node-01-ic
 * Last change:  Mon Sep 30 18:33:43 2024 by root via root on node-01-ic
 * 2 nodes configured
 * 22 resource instances configured
Node List:
 * Online: [ node-01-ic node-02-ic ]
Full List of Resources:
 * node-01.stonith      (stonith:fence_ipmilan):         Started node-02-ic
 * node-02.stonith      (stonith:fence_ipmilan):         Started node-01-ic
 * Resource Group: gr_beegfs_mt01:
    * ip_mgm            (ocf:heartbeat:IPaddr2):         Started node-01-ic
    * xiraid_rd_mt01    (ocf:xraid:raid):                Started node-01-ic
    * fs_mt01           (ocf:heartbeat:Filesystem):      Started node-01-ic
    * beegfs_mgm        (systemd:beegfs-mgmtd):          Started node-01-ic
    * bgfs_mt01         (systemd:beegfs-meta@mt01):      Started node-01-ic
 * Resource Group: gr_beegfs_mt02:
    * xiraid_rd_mt02    (ocf:xraid:raid):                Started node-02-ic
    * fs_mt02           (ocf:heartbeat:Filesystem):      Started node-02-ic
    * bgfs_mt02         (systemd:beegfs-meta@mt02):      Started node-02-ic
 * Resource Group: gr_beegfs_st01:
    * xiraid_rd_st01    (ocf:xraid:raid):                Started node-01-ic
    * fs_st01           (ocf:heartbeat:Filesystem):      Started node-01-ic
    * bgfs_st01         (systemd:beegfs-storage@st01):   Started node-01-ic
 * Resource Group: gr_beegfs_st02:
    * xiraid_rd_st02    (ocf:xraid:raid):                Started node-02-ic
    * fs_st02           (ocf:heartbeat:Filesystem):      Started node-02-ic
    * bgfs_st02         (systemd:beegfs-storage@st02):   Started node-02-ic
 * Resource Group: gr_beegfs_st11:
    * xiraid_rd_st11    (ocf:xraid:raid):                Started node-01-ic
    * fs_st11           (ocf:heartbeat:Filesystem):      Started node-01-ic
    * bgfs_st11         (systemd:beegfs-storage@st11):   Started node-01-ic
 * Resource Group: gr_beegfs_st12:
    * xiraid_rd_st12    (ocf:xraid:raid):                Started node-02-ic
    * fs_st12           (ocf:heartbeat:Filesystem):      Started node-02-ic
    * bgfs_st12         (systemd:beegfs-storage@st12):   Started node-02-ic
Daemon Status:
 corosync: active/disabled
 pacemaker: active/disabled
 pcsd: active/enabled

Client systems setup

Install kernel-headers for the currently loaded kernel.

# yum install kernel-devel-$(uname -r)

Download the BeeGFS repository file.

# wget -O /etc/yum.repos.d/beegfs_rhel9.repo https://www.beegfs.io/release/beegfs_7.4.4/dists/beegfs-rhel9.repo

Add the public BeeGFS GPG key to package manager.

# rpm --import https://www.beegfs.io/release/beegfs_7.4.4/gpg/GPG-KEY-beegfs

Install the client packages from the repository.

# yum install beegfs-client beegfs-helperd beegfs-utils

Configure firewalld.

# firewall-cmd --permanent --add-port={{8004,8008,8011,8012,8021,8022,8023,8024}/{tcp,udp},{8006,30865}/tcp,33000-65000/udp}
# firewall-cmd --reload

And it is recommended to disable SELinux.

# setenforce 0
# vi /etc/sysconfig/selinux
...
SELINUX=disabled
...

Setup client service with specifying the management service:

# /opt/beegfs/sbin/beegfs-setup-client -m 100.100.100.10

The client mount directory is defined in a separate configuration file.

By default, BeeGFS will be mounted to /mnt/beegfs.

If you want to mount the file system to a different location, you should edit the following configuration file:

# vi /etc/beegfs/beegfs-mounts.conf

If the client system has multiply network interfaces, create file /etc/beegfs/connInterfacesFile2.conf to specify client service interface:

# vi /etc/beegfs/connInterfacesFile2.conf

And specify IP interface:

ens19

Copy connauthfile from the management node to client system and edit BeeGFS Client configuration file by specifying authentication file connAuthFile = /etc/beegfs/connauthfile and connInterfacesFile = /etc/beegfs/connInterfacesFile2.conf

# vi /etc/beegfs/beegfs-client.conf

And edit BeeGFS Helperd configuration file by specifying authentication file connAuthFile = /etc/beegfs/connauthfile

# vi /etc/beegfs/beegfs-helperd.conf

Start the client services.

# systemctl start beegfs-helperd
# systemctl start beegfs-client

Check the BeeGFS connection status.

# beegfs-check-servers
beegfs-check-servers
Management
==========
mgm-srv [ID: 1]: reachable at 100.100.100.10:8008 (protocol: TCP)
Metadata
==========
sr_meta_01 [ID: 1]: reachable at 100.100.100.10:8011 (protocol: TCP)
sr_meta_02 [ID: 2]: reachable at 100.100.100.10:8012 (protocol: TCP)
Storage
==========
sr_stor_01 [ID: 101]: reachable at 100.100.100.10:8021 (protocol: TCP)
sr_stor_02 [ID: 102]: reachable at 100.100.100.10:8022 (protocol: TCP)
sr_stor_11 [ID: 111]: reachable at 100.100.100.10:8023 (protocol: TCP)
sr_stor_12 [ID: 112]: reachable at 100.100.100.10:8024 (protocol: TCP)

# beegfs-net
mgmt_nodes
=============
mgm-srv [ID: 1]
   Connections: TCP: 1 (100.100.100.10:8008);
meta_nodes
=============
sr_meta_01 [ID: 1]
   Connections: TCP: 1 (100.100.100.11:8011);
sr_meta_02 [ID: 2]
   Connections: <none>
storage_nodes
=============
sr_stor_01 [ID: 101]
   Connections: TCP: 1 (100.100.100.10:8021);
sr_stor_02 [ID: 102]
   Connections: TCP: 1 (100.100.100.10:8022);
sr_stor_11 [ID: 111]
   Connections: TCP: 1 (100.100.100.10:8023);
sr_stor_12 [ID: 112]
   Connections: TCP: 1 (100.100.100.10:8024);

Setting up additional clients can be done in a similar manner.

Conclusion

In conclusion, the deployment of a highly available BeeGFS solution using xiRAID Classic 4.1 showcases the capability of combining high-performance RAID storage with advanced cluster management. This setup ensures reliable data access, even in the event of node failures, by leveraging Pacemaker’s cluster resources and the robustness of xiRAID’s RAID configurations. By following the outlined steps for configuration, installation, and tuning, users can achieve a resilient and efficient storage solution, making it well-suited for demanding parallel file system environments.