Monitoring xiRAID and Lustre in a High-Availability Cluster with Prometheus and Grafana

May 26, 2026

Back to all posts

High-availability storage clusters need monitoring that understands how resources move between nodes. In a dual-node environment with xiRAID arrays and Lustre, the same configuration may be visible on both nodes, while a given RAID array or Lustre target is active on only one node at a time.

This creates a common monitoring challenge. Prometheus exporters running on both nodes may discover inactive resources and report them as offline or failed, even though the cluster is operating normally. Without additional filtering in PromQL and Grafana, this can lead to duplicated metrics and false alerts.

In this guide, we show how to configure monitoring for a high-availability cluster with xiRAID and Lustre using Prometheus exporters and Grafana dashboards. RAID arrays and Lustre OSTs are evenly distributed across the cluster nodes, while a dedicated monitoring node is used for metric collection, storage, querying, and visualization.

Monitoring xiRAID and Lustre in a High-Availability Cluster with Prometheus and Grafana

The installation and configuration of the cluster, xiRAID, and Lustre are beyond the scope of this guide. For detailed instructions on setting up the cluster environment, please refer to the corresponding article in our blog.

The setup uses three Prometheus exporters on each cluster node to collect storage, filesystem, and system-level metrics:

xiRAID exporter

GitHub: https://github.com/E4-Computer-Engineering/xiraid-exporter

The xiraid_exporter is a Prometheus exporter designed by E4 Computer Engineering to collect detailed metrics from xiRAID software RAID systems. It retrieves data via the native xicli interface and exposes it in Prometheus format, enabling integration into modern monitoring stacks.

Features

  • Modular collector architecture with the ability to enable or disable specific collectors based on monitoring needs
  • Monitoring of RAID array state, configuration, and activity (including active/inactive status and device states)
  • Collection of detailed performance and health metrics, such as memory usage, device health, and wear levels
  • Tracking of drive-level issues, including faulty sector counts per device
  • License monitoring, including validity, expiration, and disk usage limits
  • Optional collection of advanced data such as pool configuration, mail alerts, and system settings
  • Exposure of all metrics in Prometheus-compatible format for visualization and alerting in Grafana and similar visualization/alerting systems

Lustre exporter

GitHub: https://github.com/GSI-HPC/lustre_exporter

The lustre_exporter is a Prometheus exporter designed to collect detailed metrics from the Lustre filesystem. It gathers performance, health, and operational data from various Lustre components by reading procfs and sysfs interfaces on each node.

Features

  • Collection of metrics from all major Lustre components: OST, MDT, MDS, MGS, clients, LNet, and general system metrics
  • Flexible metric granularity via configurable collectors (disabled, core, extended) to control metric volume and overhead
  • Export of detailed performance and health data, including I/O statistics, metadata operations, and network activity
  • Ability to selectively enable/disable problematic or unnecessary metrics for stability and compatibility

Node exporter

GitHub: https://github.com/prometheus/node_exporter

The node_exporter is a Prometheus exporter designed to collect hardware and operating system metrics from *nix-based systems.

Features

  • Collection of core system metrics, including CPU, memory, disk I/O, filesystem usage, and network statistics
  • Wide range of built-in collectors for hardware, kernel, and subsystem metrics (e.g., diskstats, meminfo, loadavg, filesystem)
  • Modular and extensible architecture with optional collectors that can be enabled or disabled based on performance and monitoring requirements

Prometheus and Grafana

For metric collection and storage, Prometheus is deployed on a dedicated monitoring node external to the cluster. For visualization and analysis, Grafana is used.

Prometheus is responsible for scraping metrics from exporters deployed on each cluster node, storing them as time-series data, and enabling flexible querying via PromQL. Grafana, in turn, provides a powerful interface for building dashboards, correlating metrics from multiple sources, and configuring alerting, allowing operators to gain clear and actionable insights into the state of the system.

The main challenge is that Prometheus sees the two cluster nodes as independent exporter targets. In reality, some resources are active on only one node at a time. If queries do not account for this, metrics may be duplicated and inactive resources may trigger false alerts. For accurate visualization and alerting, PromQL queries should treat a resource as failed only when it is unavailable on both nodes.

Example:

max by (raid_name, instance) (xiraid_raid_active == 1)

This query selects only xiRAID arrays that are in the active state (== 1) in the xiraid_raid_active metric and aggregates them by raid_name and instance.

It ensures metrics are deduplicated per RAID per exporter instance, so only a single active signal per node is considered

or on(raid_name)

This performs a logical merge of two independent result sets using raid_name as the matching key.

It ensures that per-node active-state evaluation is combined with the cluster-wide evaluation of the same RAID object.

max by (raid_name) (xiraid_raid_active)

This computes the maximum state of each RAID array across the entire cluster, grouped only by raid_name.

It serves as a fallback cluster-level view, indicating whether the RAID is active on any node.

xiRAID dashboard example

xiRAID dashboard example

Lustre dashboard example

Lustre dashboard example

The following section shows one possible implementation of this monitoring setup, with exporters installed on both cluster nodes and Prometheus/Grafana deployed on a dedicated monitoring node.

Installation guide

1. node_exporter installation

Install on both cluster nodes

1.1 Packages installation

cd /opt
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*.linux-amd64.tar.gz
tar -xvf node_exporter-*.tar.gz
mv node_exporter-* node_exporter
cp node_exporter/node_exporter /usr/local/bin/

1.2 systemd unit

nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter

[Service]
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Enable systemd service:

systemctl daemon-reexec  
systemctl enable --now node_exporter

Verify:

http://NODE_IP:9100/metrics

2. xiraid-exporter installation

Install on both cluster nodes (where xiRAID is installed)

2.1 Prerequisite

xiRAID Classic should be installed.

2.2 Packages installation

dnf install -y git golang

cd /opt
git clone https://github.com/E4-Computer-Engineering/xiraid-exporter.git
cd xiraid-exporter/cmd/xiraid-exporter
go build -o xiraid-exporter .
cp xiraid-exporter /usr/local/bin/

2.3 Test run

xiraid-exporter

Verify:

http://NODE_IP:9827/metrics

2.4 systemd unit

nano /etc/systemd/system/xiraid_exporter.service

[Unit]
Description=XiRAID Exporter

[Service]  
ExecStart=/usr/local/bin/xiraid-exporter \
  --collector.raid \
  --collector.raid-extended \
  --collector.drive-faulty

Restart=always

[Install]
WantedBy=multi-user.target

Enable systemd service:

systemctl daemon-reexec
systemctl enable --now xiraid_exporter

3. lustre_exporter installation

Install on both cluster nodes (MDS + OSS)

3.1 Packages installation

dnf install -y make

mkdir -p /opt/go
export GOPATH=/opt/go

cd $GOPATH
mkdir -p src/github.com/GSI-HPC
cd src/github.com/GSI-HPC

git clone https://github.com/GSI-HPC/lustre_exporter.git
cd lustre_exporter

GOTOOLCHAIN=auto make build
cp lustre_exporter /usr/local/bin/

3.2 Test run

lustre_exporter

Verify:

http://NODE_IP:9169/metrics

3.3 systemd unit

nano /etc/systemd/system/lustre_exporter.service

[Unit]
Description=Lustre Exporter

[Service]  
ExecStart=/usr/local/bin/lustre_exporter \
  --collector.ost=extended \
  --collector.mdt=extended \
  --collector.mds=extended \
  --collector.lnet=extended

Restart=always

[Install]
WantedBy=multi-user.target

Enable systemd service:

systemctl daemon-reexec
systemctl enable --now lustre_exporter

Add a rule to firewalld:

firewall-cmd --permanent --add-port={9100/tcp,9827/tcp,9169/tcp}
firewall-cmd --reload

4. Prometheus

4.1 Packages installation

cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v3.11.2/prometheus-3.11.2.linux-amd64.tar.gz
tar -xvf prometheus-*.tar.gz
mv prometheus-* prometheus

4.2 systemd unit

Create a new user

useradd --no-create-home --shell /bin/false prometheus
chown -R prometheus:prometheus /opt/prometheus

Create a systemd unit

nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data

[Install]
WantedBy=multi-user.target

4.3 Configuration

Add the exporters to the Prometheus configuration file (please verify that the cluster nodes' DNS names are resolvable)

nano /opt/prometheus/prometheus.yml

...
 - job_name: 'node'
    static_configs:
      - targets:
          - node1:9100
          - node2:9100

  - job_name: 'lustre'
    static_configs:
      - targets:
          - node1:9169
          - node2:9169

  - job_name: 'xiraid'
    static_configs:
      - targets:
          - node1:9827
          - node2:9827
        labels:
          cluster: xiraid-cluster

Restart the Prometheus service:

systemctl restart prometheus

5. Grafana

5.1 Packages installation

yum install -y https://dl.grafana.com/grafana-enterprise/release/13.0.1/grafana-enterprise_13.0.1_24542347077_linux_amd64.rpm
systemctl enable --now grafana-server

Add a rule to firewalld:

firewall-cmd --permanent --add-port={9090/tcp,3000/tcp}
firewall-cmd --reload

5.2 Prometheus connection

In the Grafana web interface, navigate to “Connections > Data sources” and add a Prometheus data source. Set the Prometheus server URL to: http://localhost:9090.

5.3 Dashboards

All components are now installed, configured, and connected. You can proceed to customize your dashboards according to your preferences and requirements.

As a reference, you may use the following sample dashboards:

To import a dashboard JSON file, navigate in the Grafana web interface to “Dashboards > New > Import”, and select the JSON file.

Conclusion

Monitoring xiRAID and Lustre in a high-availability cluster requires more than simply deploying exporters on each node. Because resources can be configured on both nodes while active on only one, Prometheus queries and Grafana dashboards must account for cluster state to avoid duplicated metrics and false alerts.

By combining the xiRAID exporter, Lustre exporter, Node exporter, Prometheus, and Grafana, administrators can build a unified view of RAID health, Lustre activity, and node-level performance. With the right PromQL logic, the monitoring system reflects the actual state of the HA cluster and provides a reliable foundation for alerting, troubleshooting, and day-to-day operations.