# Resilient LLM Training with NeMo Framework

This notebook demonstrates how to use NeMo's resiliency features for robust LLM training. It covers:

1. **Crash Recovery**: Using in-job restart capabilities to automatically recover from failures during training
2. **Straggler Detection**: Identifying and handling slow/stuck processes in distributed training
3. **Checkpointing**: Implementing asynchronous checkpointing for efficient model saving

The demo uses a small LLaMA model and simulated crashes to showcase these features in action. We'll walk through:
- Setting up a local executor with fault tolerance enabled
- Configuring the straggler detection callbacks
- Launching distributed training with resiliency features
- Monitoring training progress and recovery from failures
- Analyzing logs and checkpoints

This demonstrates how NeMo makes LLM training more robust and production-ready by handling common failure modes automatically.

NeMo Framework integrates resiliency features from the [NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext) to minimize training disruptions and handle failures gracefully.

The key features include
- Fault Tolerance: Automatically resumes training from the last checkpoint in case of interruptions.
- Straggler Detection: Identifies and mitigates slow-performing nodes to ensure efficient training.

For detailed documentation on these resiliency features, see the [NeMo Framework Resiliency Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/resiliency.html)

In [1]:
%%bash
# delete old checkpoints and prepare for a fresh run
rm -rf /tmp/nemo_run/checkpoints/

# 1. Setup a simple training job and demostrate successful training

In [2]:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

# Required Libraries
import argparse
import copy
import math
import os
from functools import partial
from typing import Any
import torch

import nemo_run as run
from lightning.pytorch.callbacks import Callback

from nemo.collections import llm
from nemo.collections.llm.recipes.callbacks.common import straggler_det_callback
from nemo.lightning.run import plugins

from crash_simulator import CrashSimulationCallback
from preemption_simulator import PreemptionSimulationCallback

print("Required libraries loaded.")

  from .autonotebook import tqdm as notebook_tqdm
      cm = get_cmap("Set1")
    


Required libraries loaded.


## 1.1 Define the executor

Define and initialize a local executor, which is used to manage distributed computing tasks. The executor encapsulates configurations for launching jobs (e.g. number of devices, environment variables, task distribution).

In [3]:
def local_executor(devices: int = 8) -> run.LocalExecutor:
    """
    Factory method for creating a LocalExecutor instance. 
    This sets up environment variables and configures the number of devices.

    Args:
        devices (int): Number of devices to be used per node.

    Returns:
        run.LocalExecutor: Configured local executor object.
    """
    env_vars = {
        "TRANSFORMERS_OFFLINE": "1",   # Run Transformer models offline
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",  # Optimize PyTorch NCCL
        "NCCL_NVLS_ENABLE": "0",      # Experimental NCCL environment variable
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", 
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }
    # Create LocalExecutor with the `ft` launcher
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
    return executor

# Initialize the executor based on the arguments
executor = local_executor(devices=8)

print("Executor setup complete.")

Executor setup complete.


## 1.2 Model setup
Load and configure a LLAMA pretrain recipe. We choose a small 54M parameter llama3 based model for faster execution. This model is obtained by reducing the sequence length, number of layers, hidden size and number of attention heads from the original llama3 8B model configuration as defined in the [Llama3Config8B class](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/llama.py).

In [4]:
# Create a small LLAMA3 model configuration
def small_llama_cfg() -> llm.GPTConfig:
    """Small 54M parameter model"""
    return run.Config(
        llm.Llama3Config8B,
        rotary_base=500_000,
        seq_length=128,
        num_layers=4,
        hidden_size=768,
        ffn_hidden_size=2688,
        num_attention_heads=16,
        init_method_std=0.023,
    )


## 1.3 Modify the training recipe
`pretrain` is a partial function that takes in the experiment name and checkpoint directory, and returns a pretrain recipe. It is setup to use `num_nodes=1` and `num_gpus_per_node=8` by default but this can be changed by modifying the `num_nodes` and `num_gpus_per_node` arguments. This demo uses the llama3 8b pretrain recipe as defined in the `llama31_8b.pretrain_recipe` [module](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py). This defaults to using a mock dataset: [MockDataModule](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/mock.py) but please refer to the [Llama3_8b recipe](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py) for instructions on how to use a custom dataset. Since we are using a mock dataset, we set the `max_steps` to 20 so we can run the experiment in a reasonable time.

We also disable validation sanity checks to reduce startup time, and set tensor model parallel size to 2 and context parallel size to 1.

In [5]:
# Experiment name
exp_name = "resiliency-in-pretraining-demo"

# Preliminary setup for the LLAMA pretrain recipe
pretrain = partial(llm.llama31_8b.pretrain_recipe, num_nodes=1, num_gpus_per_node=8)(
    name=exp_name, dir="/tmp/nemo_run/checkpoints"
)
pretrain.model = run.Config(llm.LlamaModel, small_llama_cfg())
pretrain.trainer.strategy.tensor_model_parallel_size = 2
pretrain.trainer.strategy.context_parallel_size = 1
pretrain.trainer.num_sanity_val_steps = 0
pretrain.broadcast(max_steps=20)
pretrain.trainer.limit_val_batches = 2
pretrain.trainer.log_every_n_steps = 1
pretrain.trainer.val_check_interval = 10
print("Model recipe setup complete.")

Model recipe setup complete.


## 1.4 Running the Experiment
Run the entire pretraining experiment. Depending on the arguments passed:
- If `dryrun` is True, it performs a dry run (to validate configurations).
- Otherwise, it launches the actual training run locally.

In [6]:
def run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False):
    """
    Run the pretraining experiment either as a dry run or actual training.
    
    Args:
        exp_name: Name of the experiment
        pretrain: Pretrain configuration object
        executor: Executor to run the experiment
        run_plugins: List of runtime plugins
        dryrun: Boolean flag to perform a dry run
    """
    with run.Experiment(f"{exp_name}") as exp:
        # Add the pretrain job to the experiment
        exp.add(
            pretrain,
            executor=executor,
            name=exp_name,
            plugins=run_plugins,
            tail_logs=True,
        )

        # Execute the experiment based on the dryrun flag
        if dryrun:
            print("Performing dry run ...")
            exp.dryrun()
        else:
            print("Launching training run ...")
            exp.run(sequential=True, detach=True)
            print("Experiment executed successfully.")

Note: This run genrally fails the first time around since we are using a Mock dataset and it cannot find the tokenizer files. So the error is usually `FileNotFoundError: [Errno 2] No such file or directory: 'gpt2-merges.txt'`.

To avoid this, you can manually download the following files before launching a run

In [7]:
%%bash
mkdir -p /root/.cache/torch/megatron
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json && mv gpt2-vocab.json /root/.cache/torch/megatron/megatron-gpt-345m_vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt && mv gpt2-merges.txt /root/.cache/torch/megatron/megatron-gpt-345m_merges

--2025-03-07 22:33:43--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.209.88, 54.231.195.168, 52.217.224.64, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.209.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘gpt2-vocab.json’

     0K .......... .......... .......... .......... ..........  4%  824K 1s
    50K .......... .......... .......... .......... ..........  9%  814K 1s
   100K .......... .......... .......... .......... .......... 14%  810K 1s
   150K .......... .......... .......... .......... .......... 19%  311M 1s
   200K .......... .......... .......... .......... .......... 24%  110M 1s
   250K .......... .......... .......... .......... .......... 29%  268M 0s
   300K .......... .......... .......... .......... .......... 34%  821K 0s
   350K .......... .......... .......... .......... .......... 39%  104M 

In [8]:
# run the experiment
run_plugins = []
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-xb3fnk3npq9wn to finish [log=True]...


ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] 
ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************
ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************
ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:
ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194]   entrypoint       : nemo_run.core.runners.fdl_runner
ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194]   min_nodes        : 1
ining-demo/0 I0307 22:33:45.902000 170829 torch/di

Job resiliency-in-pretraining-demo-xb3fnk3npq9wn finished: SUCCEEDED


## 1.5 Cleanup and save clean states

In [9]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [10]:
pretrain_trainer_callbacks = copy.deepcopy(pretrain.trainer.callbacks)
pretrain_trainer_callbacks
run_plugins = []

# 2. Demostrate Fault tolerance with crash detection and in-job restart
The [Fault Tolerance plugin](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/run/plugins.py)
- Detects hangs/crashes during training and relaunches the training job without manual intervention
- It uses NVIDIA Resiliency Extension's `ft_launcher` which has been integrated into [NeMo-Run](https://github.com/NVIDIA/NeMo-Run) as [FaultTolerance](https://github.com/NVIDIA/NeMo-Run/blob/main/nemo_run/core/execution/launcher.py).
- It also uses the `FaultToleranceCallback` from NVIDIA Resiliency Extension which sets up the heartbeats

## 2.1 Setup FaultTolerancePlugin
These env vars need to be set as well -
- `FAULT_TOL_CFG_PATH` is the path to the fault tolerance config file. If it is empty, default configuration is used
- `FAULT_TOL_FINISHED_FLAG_FILE` is the path where the fault tolerance package writes when a run is successfully completed so as to not trigger a re-launch.

In [11]:
# Add FaultTolerancePlugin plugin and setup required env vars
run_plugins = [plugins.FaultTolerancePlugin()]

os.environ["FAULT_TOL_CFG_PATH"] = "/tmp/sample_job_ft_cfg.yml"
os.environ["FAULT_TOL_FINISHED_FLAG_FILE"] = "/tmp/sample_job_finished_flag"

## 2.2 Setup the crash simulator and run the experiment
We use the `CrashSimulationCallback` to simulate a crash during training. This callback is configured to crash the process at step 17 if a crash has not already occurred.

Expected workflow:
- Start training: Trainer Step counter = 0
- After 10 trainer steps: Trainer Step counter = 10 -> save checkpoint
- After 17 trainer steps: Trainer Step counter = 17 -> crash simulated, set `has_simulated_crash_happened` to `True`
- Automatic in-job restart from checkpoint at step 10: Trainer step counter = 10
- After 17 trainer steps:Trainer Step counter = 17 -> no crash simulated as `has_simulated_crash_happened == True`
- After 20 trainer steps: Trainer Step counter = 20 -> successfully completes training

In [12]:
# Enable a crash simulation callback
pretrain.trainer.callbacks.append(run.Config(CrashSimulationCallback, crash_step=17))

# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-ghpmrpzqnhtb0 to finish [log=True]...


ining-demo/0 *****************************************
ining-demo/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 *****************************************
ining-demo/0 [2025-03-07 22:36:48,816] [INFO] [ft_launcher@4809917ea058] [default] starting workers for entrypoint: python
ining-demo/0 [2025-03-07 22:36:48,817] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous'ing worker group
ining-demo/0 [2025-03-07 22:36:49,126] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous complete for workers. Result:
ining-demo/0   restart_count=0
ining-demo/0   master_addr=4809917ea058
ining-demo/0   master_port=47865
ining-demo/0   group_rank=0
ining-demo/0   group_world_size=1
ining-demo/0   local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   global_ranks=[0, 1, 2, 3,

Job resiliency-in-pretraining-demo-ghpmrpzqnhtb0 finished: SUCCEEDED


## 2.3 Cleanup

In [13]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [14]:
# restore pretrain.trainer.callbacks and drop Crash Simulation
pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)
run_plugins = []
pretrain_trainer_callbacks

[<Config[TimingCallback()]>]

# 3. Demonstrate Straggler Detection
The [Straggler Detection Callback](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/callbacks/common.py):
- Monitors training performance across nodes
- Identifies ranks that are running slower than others ("stragglers")
- Wraps NVIDIA Resiliency Extension's straggler detection functionality in a NeMo-compatible interface


## 3.1 Setup and run an experiment
To simulate straggler nodes in a distributed computing environment, we can try two different ways
1. Increase detection sensitivity: Adjust the straggler detection thresholds (e.g., gpu_relative_perf_threshold and gpu_individual_perf_threshold) from 0.7 to 0.99. This makes the system more sensitive to performance variations, effectively simulating a higher occurrence of stragglers without modifying hardware settings.
2. Manually reduce the performance of specific GPUs using the nvidia-smi utility. This process involves lowering the clock speeds of both the GPU core and memory.

#### Steps to manually Straggle GPUs for experimentation
1. First, check the current clock speeds:
`!nvidia-smi --query-gpu=index,clocks.current.sm,clocks.current.memory --format=csv`   
2. Lock the GPU core clock to a lower frequency:
`!nvidia-smi -i <gpu_indices> --lock-gpu-clocks=<lower_frequency>`
3. Lock the GPU memory clock to a lower frequency:
`!nvidia-smi -i <gpu_indices> --lock-memory-clocks=<lower_frequency>`

Replace <lower_frequency> with a value lower than the maximum clock speed for both commands.

#### Resetting GPU Clocks
After your experiment, make sure to reset the GPU and memory clocks to their default values:

`!nvidia-smi --reset-gpu-clocks`<br>
`!nvidia-smi --reset-memory-clocks`

These commands will restore the default clock settings for both the GPU core and memory

## 3.2 Increase Detection Sensitivity
We can force a mock straggler to be detected by adjusting stragggler detection thresholds to be extremely senstive. 

In [15]:
# Automatically detect and mitigate mock stragglers during training
pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1, gpu_relative_perf_threshold=0.99, gpu_individual_perf_threshold=0.99))

# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-nr166h9790700
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-nr166h9790700
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-nr166h9790700 to finish [log=True]...


ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] 
ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************
ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************
ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:
ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194]   entrypoint       : nemo_run.core.runners.fdl_runner
ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194]   min_nodes        : 1
ining-demo/0 I0307 22:41:27.659000 192252 torch/di

Job resiliency-in-pretraining-demo-nr166h9790700 finished: FAILED


## 3.2 Cleanup

In [16]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [17]:
# restore pretrain.trainer.callbacks and drop Straggler Detection callback
pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)
run_plugins = []
pretrain.trainer.callbacks

[<Config[TimingCallback()]>]

## 3.3 Manually reduce the performance of specific GPUs using the nvidia-smi utility

In [18]:
### Simulating the Straggling GPUs
!nvidia-smi --query-gpu=index,clocks.current.sm,clocks.current.memory --format=csv

index, clocks.current.sm [MHz], clocks.current.memory [MHz]
0, 450 MHz, 9000 MHz
1, 675 MHz, 9000 MHz
2, 630 MHz, 9000 MHz
3, 465 MHz, 9000 MHz
4, 285 MHz, 405 MHz
5, 2370 MHz, 9000 MHz
6, 2130 MHz, 9000 MHz
7, 2400 MHz, 9000 MHz


In [19]:
!nvidia-smi -i 0,2,4,6 --lock-gpu-clocks=150

The current user does not have permission to change clocks for GPU 00000000:01:00.0.
Terminating early due to previous errors.


In [20]:
# Automatically detect and mitigate mock stragglers during training
# gpu_relative_perf_threshold and gpu_individual_perf_threshold default to 0.7 if not set explicitly
pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1))

# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-n9hk3pk4hcz23c to finish [log=True]...


ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] 
ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************
ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************
ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:
ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194]   entrypoint       : nemo_run.core.runners.fdl_runner
ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194]   min_nodes        : 1
ining-demo/0 I0307 22:44:09.270000 199266 torch/di

Job resiliency-in-pretraining-demo-n9hk3pk4hcz23c finished: SUCCEEDED


The straggler detection system identifies GPUs that are lagging behind in performance, halts the job to prevent inefficiencies, and provides detailed information about which GPUs are struggling. It monitors GPU performance of ranks to pinpoint slower ranks that may hinder overall training efficiency, thus enabling targeted optimization for distributed training setups.

In [21]:
### !!!! IMPORTANT !!!! ###
### Reset the GPU clocks
!nvidia-smi --reset-gpu-clocks
!nvidia-smi --reset-memory-clocks

The current user does not have permission to change clocks for GPU 00000000:01:00.0.
Terminating early due to previous errors.
The current user does not have permission to change clocks for GPU 00000000:01:00.0.
Terminating early due to previous errors.


## 4.2 Cleanup

In [22]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [23]:
# restore pretrain.trainer.callbacks
pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)
run_plugins = []
pretrain.trainer.callbacks

[<Config[TimingCallback()]>]

# 4. Demonstrate Preemption
The [Preemption Plugin](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/preemption.py) provides graceful shutdown capabilities:
- Monitors for shutdown signals (default: `signal.SIGTERM`)
- Saves a checkpoint when a shutdown signal is received
- Ensures training progress is preserved before termination

## 4.1 Setup the preemption simulator
We use the `PreemptionSimulationCallback` to simulate a `signal.SIGTERM` during training. This callback is configured to raise a `signal.SIGTERM` at step 4.

Expected workflow:
- Start training: Trainer Step counter = 0
- After 4 trainer steps: Trainer Step counter = 10 -> raise `signal.SIGTERM` -> Preemption callback saves an async checkpoint before gracefully exiting

In [24]:
# Add Preemption plugin
run_plugins = [plugins.PreemptionPlugin()]

# Enable a preemption simulation callback
pretrain.trainer.callbacks.append(run.Config(PreemptionSimulationCallback, preemption_step=4))

## 4.2 Run the experiment

In [25]:
# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-bngwzzcstc0p3 to finish [log=True]...


ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] 
ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************
ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************
ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:
ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194]   entrypoint       : nemo_run.core.runners.fdl_runner
ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194]   min_nodes        : 1
ining-demo/0 I0307 22:47:14.814000 207080 torch/di

Job resiliency-in-pretraining-demo-bngwzzcstc0p3 finished: SUCCEEDED


## 4.2 Cleanup

In [26]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [27]:
# restore pretrain.trainer.callbacks
pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)
run_plugins = []
pretrain.trainer.callbacks

[<Config[TimingCallback()]>]

# 5. Discuss asynchronous distributed checkpointing
Checkpointing is important for recovering from failures, but traditional checkpointing has drawbacks:

1. Training pauses while saving checkpoints
2. To minimize these pauses, checkpoints are usually only saved once per epoch
3. If training fails between checkpoints, work must be redone from the last checkpoint

For example, with:
- 500 steps per epoch
- 10 seconds per step
- 3 epochs total

Best case (no failures):
- Training time = 15,000 seconds (500 steps × 10 seconds × 3 epochs)

Worst case (failure at step 799):
- Must redo nearly 2 full epochs
- Training time = 20,000 seconds (nearly 5,000 seconds wasted)

Asynchronous checkpointing solves these problems by:
- Saving checkpoints without pausing training
- Using fast distributed checkpointing via Megatron-Core
- Allowing frequent checkpoints with minimal overhead

This means you can checkpoint often to minimize lost work, without slowing down training.

For more details, see:
- [Megatron-Core distributed checkpointing](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html)
- [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/checkpoints/dist_ckpt.rst)

Note: NeMo enables asynchronous and parallel checkpointing by default through MegatronStrategy's 
ckpt_async_save and ckpt_parallel_save options, so users automatically get these benefits
without any additional configuration needed.
