All-flash Multinode High Availability for Lustre Disaggregated Implementations

May 21, 2025

Back to all posts

In today’s data-driven world, high-performance storage systems are the backbone of modern computing environments. Whether it’s for scientific research, AI/ML workloads, or enterprise applications, organizations require storage solutions that deliver both speed and reliability. Lustre, one of the most widely used parallel file systems, has long been a go-to choice for high-performance computing environments. However, as workloads grow more demanding, ensuring high availability (HA) and performance in Lustre deployments becomes increasingly critical.

In this blog post, we’ll explore how all-flash multinode HA can be achieved in Lustre disaggregated implementations. We’ll dive into key concepts like Lustre clustering, NVMe shared access options, Storage Bridge Bay (SBB) NVMe systems, dual-node clusters with EBOFs, and multinode cluster configurations. By the end of this post, you’ll have a clear understanding of the trade-offs and best practices for building robust Lustre storage architectures that meet the demands of modern workloads.

Lustre Clustering

High availability is not just a feature, it’s a necessity in mission-critical environments. When designing Lustre clusters, two primary objectives must guide your architecture: service continuity in case of a server failure (a must-have) and minimal performance degradation after a failover (a should-have). These expectations form the foundation of a reliable Lustre HA setup.

Lustre clusters traditionally rely on failover pairs either Active-Passive (A-P) or Active-Active (A-A) to achieve these goals. A common approach is to use two-node Pacemaker HA clusters as building blocks. Shared storage plays a crucial role here, whether it’s through shared LUNs from a disk array accessed via SAN or a shared set of drives combined into virtual devices using local data protection engines on Lustre servers. However, achieving seamless failover without compromising performance is not easy. This approach works well for many use cases, it comes with challenges, particularly when it comes to performance during failover scenarios.

The failover pairs are designed to handle server failures smoothly, but there’s a trade-off: performance degradation. When one server in a failover pair fails, the remaining server must handle the increased workload. This can lead to performance drops of up to 50%, which is far from ideal for latency-sensitive applications. To mitigate this risk, each server in a failover pair must be oversized to handle twice the expected workload. Specifically:

  • up to 2x the LNET connections throughput
  • up to 2x the drive connections throughput
  • up to 2x the amount of RAM
  • up to 2x the amount of CPU

This over-provisioning ensures that the surviving node can maintain acceptable performance during failover, but it comes at a steep cost. Oversizing hardware significantly increases the overall system price, making this approach less attractive for budget-conscious organizations.

Moreover, performance degradation in a single failover pair often ripples across the entire filesystem or pool, depending on striping and pooling configurations. In essence, when a single failover pair experiences performance degradation, the impact isn’t isolated; it affects the entire Lustre filesystem or specific pools, depending on how data is striped and distributed. For example:

  • If striping is configured across multiple pools, only the affected pool may suffer degraded performance.
  • Conversely, if striping spans the entire filesystem, the entire system could feel the effects of a single node’s failure.

This highlights the need for robust failover mechanisms and intelligent striping strategies to minimize the radius of performance degradation during failover events.

NVMe Shared Access Options

As Lustre HA setups increasingly adopt NVMe drives for their unparalleled performance, understanding how to effectively share these drives becomes a critical design consideration. The architecture of shared NVMe storage can significantly impact both the reliability and performance of the system. To address this, there are two options: dual-ported NVMe drives and EBOF systems.

Dual-Ported Drives

Dual-ported NVMe drives provide an adaptive solution by enabling connections to multiple hosts simultaneously. These drives can operate in two modes: 1x4 (4 PCIe lanes per one single port) or 2x2 (2 PCIe lanes per two ports). In single-port mode, the drive delivers full performance to one host, while dual-port mode allows simultaneous access from two hosts but at the cost of reduced throughput per port. Unlike SAS drives, dual-ported NVMe drives cannot achieve maximum performance from a single port when configured in dual-port mode. This limitation must be carefully considered during system design, as it directly impacts failover performance and overall system efficiency.

Ethernet Bunch of Flash (EBOF)

For environments seeking an alternative to direct-attached NVMe drives, EBOF systems offer a compelling solution. EBOFs leverage the NVMe-oF protocol, supporting both RDMA and TCP/IP for efficient Ethernet-based NVMe access. Most EBOFs are designed with redundancy in mind, featuring two IO modules to ensure high availability. However, they typically require dual-ported drives to fully utilize their redundant architecture. While EBOFs eliminate some of the complexities associated with direct-attached storage, they introduce additional configuration requirements on the initiator side, such as connection management and multipath settings.

Both dual-ported drives and EBOFs present viable pathways to achieving shared NVMe access in Lustre HA setups. The choice between them ultimately depends on factors like performance demands, budgetary considerations, and the existing infrastructure. By carefully evaluating these options, organizations can build robust Lustre clusters that deliver both speed and reliability, even in the face of hardware failures.

Storage Bridge Bay NVMe Systems

As we’ve explored the challenges of shared NVMe access through dual-ported drives and EBOFs, it’s clear that achieving HA in Lustre deployments requires careful consideration of both performance and redundancy. Another compelling option for organizations seeking a balance between simplicity and efficiency is Storage Bridge Bay (SBB) NVMe systems. These systems integrate servers and storage into a single chassis, offering a streamlined approach to shared dual port drives while addressing some of the limitations of traditional architectures.

Some SBB models go a step further by incorporating internal network connections between the servers, facilitating seamless communication for failover and synchronization. Additionally, all servers are equipped with IPMI (Intelligent Platform Management Interface) for effective fencing, ensuring that failed nodes are isolated to prevent data corruption during failovers.

SBB systems are widely available from multiple vendors, including industry leaders like Celestica, Supermicro, and Viking Enterprise Solutions, making them an accessible choice for a variety of deployment scenarios.

During failover scenarios, SBB systems exhibit significant performance limitations that administrators must anticipate. When one server fails, the surviving node must handle both nodes’ I/O operations through limited PCIe lanes coming to NVMes, creating a substantial performance bottleneck.

However, SBB systems come with certain peculiarities that must be addressed to ensure optimal performance and reliability: for example, a single NVMe drive is connected by only two PCIe lanes to each server, which limits the ability to achieve full NVMe performance from a single server. To work around this limitation, you can split each NVMe drive into two namespaces. The first namespace can be used in RAIDs managed by the first server, while the second namespace is allocated to RAIDs on the second server. However, this approach doubles the number of Lustre OSDs, increasing management complexity.

Like any solution, SBB systems have their strengths and weaknesses.

Advantages:

  • Simple solution in a single box.
  • Small datacenter footprint (occupying minimal datacenter space).
  • Direct drives to servers connection minimizing latency and eliminating intermediate bottlenecks.

Disadvantages:

  • Specific hardware demanding additional NVMe configuration for maximum performance.
  • Storage layer performance degradation during failover is unavoidable.
  • Servers are co-located and share some components.

In the next section, we’ll explore how dual-node clusters with EBOF systems offer an alternative approach to achieving high availability in Lustre deployments, providing greater flexibility and scalability for modern workloads.

Dual-Node Clusters with EBOFs

While SBB NVMe systems provide a streamlined and efficient solution for shared storage, their limitations particularly in performance during failover scenarios may require alternative approaches. For organizations aiming to build Lustre HA setups that balance flexibility and scalability, dual-node clusters leveraging EBOF systems can be an alternative.

EBOFs are essentially bunches of NVMe drives connected to Ethernet, leveraging the NVMe-oF protocol. They support both RDMA and TCP/IP protocols via RoCE, providing efficient and low-latency access to NVMe storage. Most EBOFs support up to 24 drives and typically require dual-ported drives to ensure redundancy. Each of 2 IO modules in a modern EBOF usually has 3-6 network interfaces, with port speeds ranging from 100 GbE to 200 GbE.

In a dual-node cluster configuration, EBOFs can be connected directly to the servers or via switches. When using direct connections, it’s crucial to ensure that NIC speeds match on both sides to avoid bottlenecks. Alternatively, using a single switch introduces a potential single point of failure, so redundant networking is recommended for mission-critical environments.

However, while EBOFs provide significant advantages, careful attention must be paid to the network cards used for EBOF connections, as they can become bottlenecks if not properly sized. Ensuring proper network sizing is key to maintaining performance, especially during failover scenarios.

Dual-node clusters with EBOFs strike a balance between performance, flexibility, and practicality, but they also come with trade-offs:

Advantages:

  • EBOFs are widely available hardware (unlike SBB systems).
  • No additional NVMe configuration needed to get the full performance from the drives.
  • No hard limit on the number of NVMe drives that can be used in the configuration, allowing for future expansion.
  • With proper network sizing, performance degradation during failover can be minimized or even eliminated.

Disadvantages:

  • Redundant network for EBOFs to servers connection is required.
  • Managing EBOFs and their associated networking infrastructure demands more expertise.
  • EBOFs and their accompanying switches consume more rack space, which could be a concern in space-constrained environments.
  • To prevent performance degradation during failover, each node must have excess CPU power, memory, and network throughput resources that remain underutilized most of the time.

In the next section, we’ll delve into multinode clusters, exploring how they expand on the dual-node concept to provide even greater scalability and fault tolerance.

Using EBOFs: EBOF Redundancy

As mentioned above, most modern EBOFs are designed with two IO modules to provide redundancy, ensuring that even if one module fails, the other can maintain access to the drives. However, a common challenge arises from the fact that many EBOFs share a common NVMe backplane. While this design simplifies internal connectivity, it introduces a potential point of failure. In projects where multiple EBOFs are deployed, the probability of an entire EBOF unit failing increases proportionally with the number of units in use.

To mitigate this risk, careful RAID configuration strategies must be employed to ensure that the system remains operational even in a degraded state should an EBOF fail.

The key to safeguarding against the impact of an EBOF failure lies in how drives are distributed across RAID groups. By limiting the number of drives used from each EBOF within a RAID array, you can ensure that the system continues functioning, even in a degraded mode, should an EBOF go offline. Here’s how this strategy works:

  • RAID1 and RAID5: Use no more than one drive from each EBOF.
  • RAID6: Use no more than two drives from each EBOF.
  • RAID7.3: Use no more than three drives from each EBOF.

This approach ensures that if an EBOF fails, the RAID array will enter a degraded state but will remain operational, preserving data availability and minimizing disruption.

Multinode Clusters

While dual-node clusters have long been the standard for Lustre high-availability setups, they are not the only option. As workloads grow more complex and demanding, multinode clusters offer a scalable and flexible alternative that can better meet the needs of modern HPC environments.

Traditional Lustre clustering relies on failover pairs, two-node configurations where one node takes over in the event of a failure. However, Pacemaker + Corosync framework allow larger clusters creation, supporting up to 16 or even 32 nodes (depending on the version). This scalability opens the door to more sophisticated HA strategies, particularly when combined with EBOFs.

EBOFs enable NVMe drives to be shared across multiple nodes, creating a distributed storage architecture that enhances both performance and fault tolerance. Lustre has long-supported configurations with multiple service nodes, a capability we utilized in our tests with version 2.15.6, making it easier to distribute workloads and ensure continuity in case of hardware failures.

There are two primary approaches to place cluster resources at multinode clusters nodes for Lustre high-availability: non-dedicated redundant nodes and dedicated redundant nodes. Each offers distinct advantages depending on your specific requirements.

Multinode Clusters with Non-Dedicated Redundant Nodes

In this configuration, all nodes in the cluster can participate in normal operations while simultaneously providing failover capabilities. This approach maximizes resource utilization during normal operations.For a practical example of a multinode cluster environment, we invite you to explore our solution brief created in collaboration with Ingrasys. This document provides an in-depth look at a real-world implementation leveraging the Ingrasys ES2000 EBOF and xiRAID technology: https://xinnor.io/files/white-papers/Ingrasys_xiRAID_multi-node_solution.pdf. This resource serves as a good reference for understanding how multinode clusters can be configured to deliver high performance, fault tolerance, and seamless scalability in Lustre deployments.

Multinode Clusters with Dedicated Redundant Nodes

This architecture is designed to address the limitations of other approaches by introducing a dedicated node that acts as a failover mechanism, ensuring uninterrupted service and continuity, even in the event of a server failure.

In this setup, an extra node is reserved exclusively for failover purposes. When a failure occurs whether due to hardware issues or network disruptions, the dedicated redundant node takes over the responsibilities of the failed node. This ensures uninterrupted service and eliminates performance degradation during failovers, making it an ideal choice for mission-critical environments.

Advantages:

  • The components required for this setup are standard and easily accessible, reducing procurement challenges.
  • Full drive performance is achieved without the need for complex tuning or adjustments, simplifying deployment.
  • Good scalability: there’s no hard cap on the number of NVMe drives that can be incorporated into the configuration, allowing for future growth.
  • Each server is sized exactly to handle its workload, ensuring efficient resource utilization without unnecessary over-provisioning.
  • Low redundancy overhead, only one dedicated redundant node is required for failover, minimizing resource waste.
  • In the event of a server failure, the dedicated redundant node ensures uninterrupted performance, maintaining system stability.

Disadvantages:

  • A robust and redundant network infrastructure is essential to maintain connectivity between EBOFs and servers, adding complexity to the design.
  • Increased operational complexity, managing dedicated redundant nodes introduces additional layers of operational overhead, requiring skilled personnel for maintenance.
  • The inclusion of a redundant node consumes extra rack space, which may be a concern in space-constrained environments.

Conclusion

Building high-availability Lustre storage systems requires careful consideration of performance, redundancy, and scalability. From dual-node clusters to multinode setups with EBOFs, each architecture offers unique trade-offs to meet the demands of modern workloads. Solutions like xiRAID further enhance these architectures by providing advanced RAID management and robust data protection, ensuring both speed and reliability even during hardware failures. By choosing the right configuration and tools, organizations can achieve a resilient storage infrastructure that supports their most critical applications.