Training large language models requires enormous amounts of GPU memory: not just for the model parameters themselves, but for optimizer states, gradients, and activation tensors that can balloon memory requirements to 18 bytes per parameter or more. When your model outgrows available VRAM, the conventional wisdom is to either buy more GPUs or accept that training is impossible. But what if you could offload some of that memory burden to your CPU's RAM or even to NVMe storage? DeepSpeed and similar frameworks promise exactly that, but the critical question remains: does offloading actually work in practice, or does it create more problems than it solves?
This post presents empirical results from training experiments with Llama-style models ranging from 1B to 16B parameters on a workstation with an NVIDIA RTX 6000 Pro GPU. We explore three offloading strategies: moving optimizer states to CPU RAM, leveraging NVMe storage (including RAID configurations), and using ZenFlow, a gradient-priority technique that keeps the most impactful gradients on GPU while deferring others. The findings reveal when offloading makes sense, when it becomes a bottleneck, and how techniques like gradient accumulation and selective gradient prioritization can dramatically change the performance equation. If you've ever wondered whether that NVMe SSD could actually serve as viable training memory, read on.
Offloading strategies in DeepSpeed
DeepSpeed’s framework supports offloading different training states from GPU VRAM to CPU RAM or NVMe. The goal is to fit larger models or longer contexts on limited GPU memory, at the cost of additional data movement over PCIe and more complex scheduling.
The central tradeoff is memory capacity versus training speed, understanding when this exchange makes sense is critical.
Parameter offload
Model parameters are stored primarily in CPU memory (or NVMe) and streamed into GPU memory on demand. However, it forces frequent PCIe transfers for layer weights, and the savings can be modest because the dominant memory consumer in Adam-based training is usually optimizer state rather than parameters. So, we test this only in addition to other scenarios.
Offloading optimizer states + CPU computation
Optimizer state (e.g., gradients, Adam moments, and FP32 master weights) is stored in CPU RAM (or on NVMe with RAM buffers) instead of GPU VRAM. The CPU then computes the optimizer update.
But while the GPU finishes a backward pass quickly, the optimizer step can become CPU-bound. This creates a bottleneck: the GPU sits idle waiting for the CPU to finish parameter updates. Two factors can mitigate this:
- Gradient accumulation is high enough to overlap CPU updates with subsequent GPU forward/backward work.
- Offload scheduling is tuned to overlap CPU work with GPU computation.
Offloading “unimportant” gradients (ZenFlow)
ZenFlow is a recent DeepSpeed extension that exploits the observation that a small subset of gradients contributes most of the update signal. It:
- Keeps “important” gradients (those with large impact, measured by gradient norm) on the GPU and updates them immediately.
- Treats the remaining “less important” gradients as lower priority work that can be offloaded to CPU, and updated asynchronously, over multiple steps, in the background.
Empirical findings motivating this design show that a small fraction of gradients can contribute a very large portion of the total gradient norm. Therefore, deferring the remaining gradients often has limited impact on convergence. In our experiments, the framework is configured to treat 10% of gradients as important (selection and update scheduling are handled automatically). This approach can increase the total number of optimization steps required to reach the same quality, but the increase is generally expected to be small.
AI model training basics
Understanding memory consumption requires looking at what happens during a training step and what data structures persist in memory.
A large language model (LLM) is typically a Transformer-based neural network trained to predict the next token in a sequence given a context. The network consists of parameters (weights) that are iteratively updated to minimize a loss function.
A single training step usually consists of:
- Forward pass: Inputs are embedded and passed through all layers. A loss value is computed. Parameters are read but not modified.
- Backward pass: Automatic differentiation computes gradients (partial derivatives of the loss function with respect to each parameter). These gradients indicate how parameters should change to reduce the loss and are later consumed by the optimizer.
- Optimizer step: An optimizer updates parameters using gradients. Gradients are then cleared or overwritten in the next iteration.
Modern adaptive optimizers (e.g., Adam) maintain additional per-parameter state. For each parameter vector, Adam tracks the first and second moment estimates of the gradient. These additional tensors are the core reason training requires significantly more memory than inference.
In mixed precision training, the model typically runs forward/backward in FP16/BF16 for speed and bandwidth, but the update math is kept in FP32 for numerical stability: a FP32 “master” copy of the weights is updated by the optimizer so small parameter updates are not rounded to zero (FP16 has limited mantissa and dynamic range), and the FP16/BF16 weights used in the model are refreshed by casting from this FP32 master.
So, here is memory estimate (parameter-wise) at training:
- weights: BF16/FP16 -> 2 B
- gradient buffers -> 4 B
- Adam moments: FP32 -> 2x4B = 8 B
- master weights: FP32 -> 4 B
Total: 2 + 4 + 8 + 4 = 18 bytes per parameter (excluding activations and temporary buffers).
But it is the lower bound for estimation. Activation memory (intermediate tensors from the forward pass) is an additional component and scales with sequence length, batch size, and model depth. And also, some additional persistent/temporary buffers are needed for training process.
Several techniques are used to fit available memory:
- Gradient checkpointing: instead of storing all intermediate activations from the forward pass, the model stores only a subset (checkpoints). During the backward pass, missing intermediate activations are recomputed on the fly from these checkpoints. This reduces memory usage and enables larger models or longer sequences to fit in GPU VRAM, at the cost of additional compute. In our experiments, this option is enabled.
- Gradient accumulation: multiple forward/backward passes are run; gradients are accumulated across these micro-steps, and only then a single optimizer step is performed. This emulates a larger batch size when VRAM only allows small micro-batches. In our experiments, we use 1, 4, or 8 accumulation steps.
Model and dataset
A decoder-only Llama 3.1-style Transformer is built using the LlamaConfig class from Hugging Face Transformers and is based on the Llama 3.1-8B architecture. The model is initialized from scratch (no pre-trained weights) for different sizes: 1B, 2B, 3B, 4B, 8B, and 16B. Adam is used as the optimizer.
For data, WikiText-2 is used. Extracted from Wikipedia “Good” and “Featured” articles, it is designed as a language modeling benchmark with natural text (no heavy preprocessing). Text is tokenized and then packed into fixed-length blocks (4096 tokens). Because the goal of these experiments is systems behavior, we do not target model quality, only trying to stress the hardware.
Hardware & software setup
CPU: Intel Core i5-14600K (6 P-cores, 8 E-cores)
Motherboard: Intel Z790 chipset (PCIe 5.0 ×16 slot)
Memory: 96 GB DDR5 6400 (2 × 48 GB)
GPU: NVIDIA RTX 6000 Pro (Blackwell), Workstation Edition, 96 GB VRAM
SSDs: 2× DRAM-less NVMe SSDs (2 TB each, PCIe 4.0 ×4) + 1× Samsung 980 PRO 2 TB
OS: Ubuntu 24.04.3 LTS (Linux 6.14.0 33 generic)
NVIDIA driver: 580.95.05
CUDA: 12.8
DeepSpeed: 0.18.3
Pre-test preparation
Two practical constraints shaped the experimental design.
First, even when optimizer state is offloaded from GPU, gradients still need a landing place during backpropagation. DeepSpeed maintains gradient partitions in CPU RAM until the optimizer step. For our experiments up to 16B parameters, the available 96 GB of system RAM is sufficient. But for larger models gradient buffering in RAM is a limiting factor.
Second, optimizer calculations are memory-bandwidth-bound. On this hardware, 6 P-cores saturate the dual-channel DDR5 bandwidth. The table below shows optimizer step time for Llama 3.1 2B training as a function of CPU threads (lower is better):
| Number of CPU threads | Optimizer step time, ms |
|---|---|
| 1 | 1833.75 |
| 2 | 1322.14 |
| 4 | 1172.36 |
| 6 | 1096.52 |
Therefore, OpenMP parallelism is configured to use 6 P-cores for CPU optimizer computation. For NVMe I/O, 4 E-cores are reserved.
Results
Offload to RAM
Here, three scenarios that keep offload within system memory are compared:
- GPU baseline (no offload).
- full optimizer-state and gradient offload to RAM with CPU optimizer computation.
- RAM offload combined with ZenFlow (top 10% gradients updated on GPU).
GPU-baseline, no offload
Two scenarios of offloading: left – full optimizer offload to RAM, right - RAM offload combined with ZenFlow
Performance for 1B model size
Performance for 3B model size
We can see that full optimizer offload to RAM increases step time, particularly at higher model size, because the optimizer update becomes CPU/memory-bandwidth bound. ZenFlow reduces the penalty substantially and, in these measurements, is competitive with (and even slightly faster than) the GPU baseline at the same gradient accumulation.
Offload to RAM + NVMe (or RAID0 over NVMe)
Performance for 1B model size
Performance for 3B model size
Offloading optimizer state to NVMe provides a large latency/bandwidth penalty, especially for the ~3B model. RAID0 improves step time relative to a single drive, but without ZenFlow it remains substantially slower than RAM-only offload. With ZenFlow enabled, RAID0 NVMe offload moves into the same ballpark as RAM + ZenFlow and can approach the GPU baseline at higher accumulation.
Optimizer offloading vs parameters + optimizer offloading for 16B model
The table below shows resulted total step time, optimizer step time (including time for copying data between RAM and NVMe), allocated space on NVMes, IO size during step and IO pattern in 4 cases:
- Parameters on GPU, offload all optimizer states to single NVMe, optimizer calculations use CPU.
- Parameters on GPU, offload all optimizer states to RAID0 over 3 NVMes, optimizer calculations use CPU.
- Offload parameters and all optimizer states to single NVMe, optimizer calculations use CPU.
- Offload parameters and all optimizer states to RAID0 over 3 NVMes, optimizer calculations use CPU.
| Case | Space on NVMes, GB | IO size during step | Total step time, s | Optimizer step time, s | IO pattern, IOPS | ||||
|---|---|---|---|---|---|---|---|---|---|
| Read, GB | Write, GB | R, 512KB | W, 512KB | R, Other | W, Other | ||||
| Opt-state to NVMe | 211.9 | 158.9 | 317.9 | 107.71 | 97.66 | 325488 | 650993 | 141 | 421 |
| 2. Opt-state to RAID0 | 49.66 | 42.78 | |||||||
| 3. Params + Opt to NVMe | 238.4 | 290.9 | 397.4 | 138.98 | 114.78 | 595728 | 813741 | 28 | 502 |
| 4. Params + Opt to RAID0 | 64.41 | 50.60 | |||||||
Across both offload modes, the key takeaway is that once optimizer (and especially parameters) are placed on NVMe, the training step becomes dominated by storage movement rather than compute. This is reflected in how closely the total step time tracks the optimizer-step time: the optimizer phase becomes the critical path, so improvements in the storage data path translate almost directly into faster iterations.
RAID0 over multiple NVMes consistently delivers a 2x class reduction in step time versus a single drive, indicating the workload is primarily throughput-bound and benefits from device-level parallelism. Practically, this means NVMe offload should be treated as a bandwidth engineering problem: to keep iteration time under control as offloaded volume increases (e.g., when parameters are also offloaded), you need enough parallel NVMe throughput to prevent storage from becoming the limiting factor.
NVMe traces collected during optimizer-state and parameter offloading show that the dominant I/O pattern is large, sequential reads and writes that align with the configured DeepSpeed AIO block size (e.g., 512 KB). A smaller fraction of requests uses sub-block sizes. These are expected and are mainly caused by filesystem metadata/journaling updates, file creation/extension, etc. Therefore, even with a 512 KB block size configured, the observed request stream is not strictly composed of 512 KB operations.
Conclusion
These experiments demonstrate that offloading strategy and hardware configuration determine the possibility of training larger models on limited GPU memory.
Full optimizer-state offload to CPU RAM increases step time as model size grows, because the optimizer update becomes CPU and memory-bandwidth bound on this platform. Gradient accumulation is an effective lever when CPU optimizer computation is used: it amortizes optimizer overhead and can improve overlap between CPU update and GPU forward/backward work.
ZenFlow substantially reduces the offload penalty by prioritizing updates for the most important gradients on GPU and deferring the rest; in these measurements, RAM offload + ZenFlow approaches the GPU baseline for small and medium model sizes at the same accumulation setting.
NVMe optimizer-state offload introduces a large latency/bandwidth penalty relative to RAM-only offload; RAID0 improves throughput, but the key inflection point is combining RAID0 with ZenFlow. For the 1B and 3B runs in this post, RAID0 NVMe + ZenFlow stays within ~1.6-1.8x of RAM + ZenFlow and is close to the GPU baseline at higher accumulation (a common production knob). This indicates that, when gradient-priority techniques like ZenFlow are available, NVMe can be a meaningful offload tier (capacity extension), not only a last-resort spillover behind DRAM.
When VRAM is sufficient to keep model parameters resident on GPU, parameter offload to NVMe should be avoided: it increases I/O volume and adds an additional PCIe/storage bottleneck on top of optimizer-state traffic. Parameter offload becomes relevant mainly as a capacity-last resort for models that otherwise do not fit.
The Economics of Scaling
Large models, such as those reaching 1 trillion parameters, naturally demand colossal amounts of memory, on the order of multiple petabytes.
Yet, we see that the performance of a single storage device is insufficient to bring ZenFlow-based offload close to baseline performance, requiring us to aggregate roughly four storage devices per GPU to fully utilize PCI Express bandwidth.
On the other hand, rising costs of flash storage and memory prevent us from using isolated “islands” of flash dedicated to different tasks. In training, we must also perform regular, high-performance checkpoints which is an additional challenge. Thus, we need a storage subsystem that provides volumes both for offloading (e.g., via ZenFlow) and for these checkpoints.
These volumes may be unprotected for offload aggregation (e.g., RAID0) or protected for checkpoint safety (e.g., RAID5/6+).
We cannot afford underutilized “dark flash.” This is where xiRAID Opus by Xinnor comes into play: aggregating multiple networked volumes into high-performance virtual volumes, whether unprotected (RAID0) or protected (RAID5/6+). With NVMe-over-RDMA and GPU-Direct Storage, these volumes ensure maximum resource utilization. xiRAID Opus thus allows us to scale storage efficiently, balancing both performance and reliability as we grow.
For large-scale modern projects where we lack sufficient HBM, VRAM, and where flash becomes increasingly costly, using solutions like DeepSpeed ZenFlow and disaggregated storage based on xiRAID Opus allows us to dramatically reduce infrastructure costs while remaining relatively safe due to fault protection, and deliver approximately the same speed across all stages of model training.
Future perspectives
- Evaluate DeepNVMe and GPUDirect Storage paths (when supported) to reduce CPU involvement and improve NVMe offload throughput and latency.
- Scale the study to larger model sizes on another hardware platform.
- Extend storage experiments to network-backed storage (NVMe-oF) to quantify when remote bandwidth/latency becomes the limiting factor.
- Perform systematic AIO tuning (block size, queue depth, thread count, filesystem options).