Case Studies

About GWDG and the Emmy Cluster

The Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) is the joint IT service and competence center for the Max Planck Society and the University of Göttingen.

At GWDG, researchers rely on the Emmy HPC cluster made of over 1,5k compute nodes with 150K CPU cores to run data-intensive simulations and analytics at scale. To serve this complex compute cluster and prepare for its future upgrade, GWDG introduced a new storage cluster in 2025: Workspace Lustre MDC, designed to deliver high throughput and low latency for modern HPC and AI workflows.

About GWDG and the Emmy Cluster

Storage Challenge

The GWDG’s goal was straightforward and demanding: to implement a multi-petabyte high-performance storage solution that could:

  • Eliminate I/O bottlenecks for mixed HPC/AI workloads.
  • Provide high availability without single points of failure.
  • Transition to the next generation network speed with minimal disruption and investment.
  • Scale predictably in both capacity and performance.

Solution

After evaluating multiple options, GWDG selected MEGWARE’s storage architecture based on xiRAID and Lustre, running on Celestica SC6100 dual-node storage controllers.The key differentiator was xiRAID: purpose-built for high-performance flash workloads.

In this deployment, xiRAID delivers:

  • High sequential data throughput and random IOPS in normal operations.
  • Improve data availability, supported by faster drive rebuild in case of drive failures.
  • Ease of deployment: seamless integration with Pacemaker and Corosync to enable high availability (HA) functionality in a dual-node server deployment.
  • Scalability, with efficient protection schemes supporting linear growth in capacity and performance.

Paired with Lustre’s proven parallel filesystem, GWDG gained a storage platform that’s both ultra-fast and resilient, aligned with the operational requirements of a production research environment.

By fully exploiting the 2x100G OPA network connection per server, the new storage system surpasses the previous solution in all performance areas by a factor of more than 4.

Architecture

To implement this solution, MEGWARE selected Celestica SC6100, a Storage Bridge Bay server made of 2 independent server nodes sharing 24 NVMe drives in a single chassis.

Each NVMe drive is connected with 2 PCIe lanes to one node and the other 2 PCIe lanes to the other. To assure performance optimization, each drive is split into 2 namespaces, enabling concurrent utilization of each drive from both SBB nodes. As a result, each system exposes 48 virtual drives (namespaces) per server.

On each Celestica SC6100 system, the following RAID topology has been created:

  • 4 drives (8 namespaces) with 2 groups of RAID10 (2+2) for Metadata using write-intensive 6.4TB drives
  • 20 drives (40 namespaces) with 4 groups of RAID6 (8+2) for Object Storage using read-intensive 15.36TB drives

4 Omni-Path 100Gb/s per system (2 per server)

Overall usable capacity per SC6100 system: 245TB

By integrating xiRAID with Pacemaker and Corosync, the cluster is protected not only against multiple drive failures, but also against a complete server-node failure. If one of the 2 nodes fails, the active RAID groups on that server node automatically fail over to the other node. When the connection with the failed node is re-established, the original RAID groups are failed back to that node.

GWDG - Architecture

Conclusion

The architecture developed by MEGWARE is designed to enable easy upgrade of the network speed to unlock the full potential of the all-NVMe storage layer. This extreme level of flexibility and adaptability has been key in the GWDG’s decision process. With the new all-NVMe solution from MEGWARE, GWDG has implemented a Storage Cluster that:

  • Eliminates I/O bottleneck for AI and HPC workloads.
  • Provides true High Availability, removing any single point of failure.
  • Delivers multi-PB NVMe capacity at the cost of a traditional hybrid system.
  • Enables a smooth way to the 2026 cluster refresh, including upgrades to faster interconnect networks.

By standardizing on MEGWARE’s all-NVMe Lustre architecture protected by xiRAID, GWDG gained a storage foundation that combines performance, resilience, and upgradeability, supporting today’s production research needs while staying ready for the next generation of compute and networking.

Looking ahead, the switch to the CN5000 400G interconnect generation will deliver a further step-change in storage performance. GWDG is also preparing for an upcoming HPC system based on MEGWARE's Eureka DLC warm-water cooled nodes, featuring more than 300 nodes with dual-socket AMD EPYC Turin 9745 (128-core) and Venice (128-core) CPUs—a next-generation platform built to meet the growing demands of AI and research computing. In addition, GWDG plans to deploy a dedicated GPU partition with at least 10 nodes equipped with 8× NVIDIA B200 GPUs each, further accelerating large-scale AI workloads and data-intensive simulations.