High-availability storage clusters need monitoring that understands how resources move between nodes. In a dual-node environment with xiRAID arrays and Lustre, the same configuration may be visible on both nodes, while a given RAID array or Lustre target is active on only one node at a time.
This creates a common monitoring challenge. Prometheus exporters running on both nodes may discover inactive resources and report them as offline or failed, even though the cluster is operating normally. Without additional filtering in PromQL and Grafana, this can lead to duplicated metrics and false alerts.
In this guide, we show how to configure monitoring for a high-availability cluster with xiRAID and Lustre using Prometheus exporters and Grafana dashboards. RAID arrays and Lustre OSTs are evenly distributed across the cluster nodes, while a dedicated monitoring node is used for metric collection, storage, querying, and visualization.
The installation and configuration of the cluster, xiRAID, and Lustre are beyond the scope of this guide. For detailed instructions on setting up the cluster environment, please refer to the corresponding article in our blog.
The setup uses three Prometheus exporters on each cluster node to collect storage, filesystem, and system-level metrics:
- xiRAID exporter, developed by our partner E4 Computer Engineering
- Lustre exporter, a fork of HewlettPackard/lustre_exporter maintained by GSI IT HPC
- Node exporter, a basic hardware and OS metrics exporter by Prometheus
xiRAID exporter
GitHub: https://github.com/E4-Computer-Engineering/xiraid-exporter
The xiraid_exporter is a Prometheus exporter designed by E4 Computer Engineering to collect detailed metrics from xiRAID software RAID systems. It retrieves data via the native xicli interface and exposes it in Prometheus format, enabling integration into modern monitoring stacks.
Features
- Modular collector architecture with the ability to enable or disable specific collectors based on monitoring needs
- Monitoring of RAID array state, configuration, and activity (including active/inactive status and device states)
- Collection of detailed performance and health metrics, such as memory usage, device health, and wear levels
- Tracking of drive-level issues, including faulty sector counts per device
- License monitoring, including validity, expiration, and disk usage limits
- Optional collection of advanced data such as pool configuration, mail alerts, and system settings
- Exposure of all metrics in Prometheus-compatible format for visualization and alerting in Grafana and similar visualization/alerting systems
Lustre exporter
GitHub: https://github.com/GSI-HPC/lustre_exporter
The lustre_exporter is a Prometheus exporter designed to collect detailed metrics from the Lustre filesystem. It gathers performance, health, and operational data from various Lustre components by reading procfs and sysfs interfaces on each node.
Features
- Collection of metrics from all major Lustre components: OST, MDT, MDS, MGS, clients, LNet, and general system metrics
- Flexible metric granularity via configurable collectors (disabled, core, extended) to control metric volume and overhead
- Export of detailed performance and health data, including I/O statistics, metadata operations, and network activity
- Ability to selectively enable/disable problematic or unnecessary metrics for stability and compatibility
Node exporter
GitHub: https://github.com/prometheus/
The node_exporter is a Prometheus exporter designed to collect hardware and operating system metrics from *nix-based systems.
Features
- Collection of core system metrics, including CPU, memory, disk I/O, filesystem usage, and network statistics
- Wide range of built-in collectors for hardware, kernel, and subsystem metrics (e.g., diskstats, meminfo, loadavg, filesystem)
- Modular and extensible architecture with optional collectors that can be enabled or disabled based on performance and monitoring requirements
Prometheus and Grafana
For metric collection and storage, Prometheus is deployed on a dedicated monitoring node external to the cluster. For visualization and analysis, Grafana is used.
Prometheus is responsible for scraping metrics from exporters deployed on each cluster node, storing them as time-series data, and enabling flexible querying via PromQL. Grafana, in turn, provides a powerful interface for building dashboards, correlating metrics from multiple sources, and configuring alerting, allowing operators to gain clear and actionable insights into the state of the system.
The main challenge is that Prometheus sees the two cluster nodes as independent exporter targets. In reality, some resources are active on only one node at a time. If queries do not account for this, metrics may be duplicated and inactive resources may trigger false alerts. For accurate visualization and alerting, PromQL queries should treat a resource as failed only when it is unavailable on both nodes.
Example:
This query selects only xiRAID arrays that are in the active state (== 1) in the xiraid_raid_active metric and aggregates them by raid_name and instance.
It ensures metrics are deduplicated per RAID per exporter instance, so only a single active signal per node is considered
This performs a logical merge of two independent result sets using raid_name as the matching key.
It ensures that per-node active-state evaluation is combined with the cluster-wide evaluation of the same RAID object.
This computes the maximum state of each RAID array across the entire cluster, grouped only by raid_name.
It serves as a fallback cluster-level view, indicating whether the RAID is active on any node.
xiRAID dashboard example
Lustre dashboard example
The following section shows one possible implementation of this monitoring setup, with exporters installed on both cluster nodes and Prometheus/Grafana deployed on a dedicated monitoring node.
Installation guide
1. node_exporter installation
Install on both cluster nodes
1.1 Packages installation
cd /opt wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*.linux-amd64.tar.gz tar -xvf node_exporter-*.tar.gz mv node_exporter-* node_exporter cp node_exporter/node_exporter /usr/local/bin/
1.2 systemd unit
nano /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter [Service] ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Enable systemd service:
systemctl daemon-reexec systemctl enable --now node_exporter
Verify:
http://NODE_IP:9100/metrics
2. xiraid-exporter installation
Install on both cluster nodes (where xiRAID is installed)
2.1 Prerequisite
xiRAID Classic should be installed.
2.2 Packages installation
dnf install -y git golang cd /opt git clone https://github.com/E4-Computer-Engineering/xiraid-exporter.git cd xiraid-exporter/cmd/xiraid-exporter go build -o xiraid-exporter . cp xiraid-exporter /usr/local/bin/
2.3 Test run
Verify:
http://NODE_IP:9827/metrics
2.4 systemd unit
nano /etc/systemd/system/xiraid_exporter.service [Unit] Description=XiRAID Exporter [Service] ExecStart=/usr/local/bin/xiraid-exporter \ --collector.raid \ --collector.raid-extended \ --collector.drive-faulty Restart=always [Install] WantedBy=multi-user.target
Enable systemd service:
systemctl daemon-reexec systemctl enable --now xiraid_exporter
3. lustre_exporter installation
Install on both cluster nodes (MDS + OSS)
3.1 Packages installation
dnf install -y make mkdir -p /opt/go export GOPATH=/opt/go cd $GOPATH mkdir -p src/github.com/GSI-HPC cd src/github.com/GSI-HPC git clone https://github.com/GSI-HPC/lustre_exporter.git cd lustre_exporter GOTOOLCHAIN=auto make build cp lustre_exporter /usr/local/bin/
3.2 Test run
Verify:
http://NODE_IP:9169/metrics
3.3 systemd unit
nano /etc/systemd/system/lustre_exporter.service [Unit] Description=Lustre Exporter [Service] ExecStart=/usr/local/bin/lustre_exporter \ --collector.ost=extended \ --collector.mdt=extended \ --collector.mds=extended \ --collector.lnet=extended Restart=always [Install] WantedBy=multi-user.target
Enable systemd service:
systemctl daemon-reexec systemctl enable --now lustre_exporter
Add a rule to firewalld:
firewall-cmd --permanent --add-port={9100/tcp,9827/tcp,9169/tcp}
firewall-cmd --reload
4. Prometheus
4.1 Packages installation
cd /opt wget https://github.com/prometheus/prometheus/releases/download/v3.11.2/prometheus-3.11.2.linux-amd64.tar.gz tar -xvf prometheus-*.tar.gz mv prometheus-* prometheus
4.2 systemd unit
Create a new user
useradd --no-create-home --shell /bin/false prometheus chown -R prometheus:prometheus /opt/prometheus
Create a systemd unit
nano /etc/systemd/system/prometheus.service [Unit] Description=Prometheus After=network.target [Service] User=prometheus ExecStart=/opt/prometheus/prometheus \ --config.file=/opt/prometheus/prometheus.yml \ --storage.tsdb.path=/opt/prometheus/data [Install] WantedBy=multi-user.target
4.3 Configuration
Add the exporters to the Prometheus configuration file (please verify that the cluster nodes' DNS names are resolvable)
nano /opt/prometheus/prometheus.yml
...
- job_name: 'node'
static_configs:
- targets:
- node1:9100
- node2:9100
- job_name: 'lustre'
static_configs:
- targets:
- node1:9169
- node2:9169
- job_name: 'xiraid'
static_configs:
- targets:
- node1:9827
- node2:9827
labels:
cluster: xiraid-cluster
Restart the Prometheus service:
systemctl restart prometheus
5. Grafana
5.1 Packages installation
yum install -y https://dl.grafana.com/grafana-enterprise/release/13.0.1/grafana-enterprise_13.0.1_24542347077_linux_amd64.rpm systemctl enable --now grafana-server
Add a rule to firewalld:
firewall-cmd --permanent --add-port={9090/tcp,3000/tcp}
firewall-cmd --reload
5.2 Prometheus connection
In the Grafana web interface, navigate to “Connections > Data sources” and add a Prometheus data source. Set the Prometheus server URL to: http://localhost:9090.
5.3 Dashboards
All components are now installed, configured, and connected. You can proceed to customize your dashboards according to your preferences and requirements.
As a reference, you may use the following sample dashboards:
- xiRAID dashboard: xiraid_ha.json
- Lustre dashboard: lustre.json
To import a dashboard JSON file, navigate in the Grafana web interface to “Dashboards > New > Import”, and select the JSON file.
Conclusion
Monitoring xiRAID and Lustre in a high-availability cluster requires more than simply deploying exporters on each node. Because resources can be configured on both nodes while active on only one, Prometheus queries and Grafana dashboards must account for cluster state to avoid duplicated metrics and false alerts.
By combining the xiRAID exporter, Lustre exporter, Node exporter, Prometheus, and Grafana, administrators can build a unified view of RAID health, Lustre activity, and node-level performance. With the right PromQL logic, the monitoring system reflects the actual state of the HA cluster and provides a reliable foundation for alerting, troubleshooting, and day-to-day operations.