{ "cells": [ { "cell_type": "markdown", "id": "d63d934c-f709-4e6d-aa44-03f6b1926180", "metadata": {}, "source": [ "# Resilient LLM Training with NeMo Framework\n", "\n", "This notebook demonstrates how to use NeMo's resiliency features for robust LLM training. It covers:\n", "\n", "1. **Crash Recovery**: Using in-job restart capabilities to automatically recover from failures during training\n", "2. **Straggler Detection**: Identifying and handling slow/stuck processes in distributed training\n", "3. **Checkpointing**: Implementing asynchronous checkpointing for efficient model saving\n", "\n", "The demo uses a small LLaMA model and simulated crashes to showcase these features in action. We'll walk through:\n", "- Setting up a local executor with fault tolerance enabled\n", "- Configuring the straggler detection callbacks\n", "- Launching distributed training with resiliency features\n", "- Monitoring training progress and recovery from failures\n", "- Analyzing logs and checkpoints\n", "\n", "This demonstrates how NeMo makes LLM training more robust and production-ready by handling common failure modes automatically.\n", "\n", "NeMo Framework integrates resiliency features from the [NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext) to minimize training disruptions and handle failures gracefully.\n", "\n", "The key features include\n", "- Fault Tolerance: Automatically resumes training from the last checkpoint in case of interruptions.\n", "- Straggler Detection: Identifies and mitigates slow-performing nodes to ensure efficient training.\n", "\n", "For detailed documentation on these resiliency features, see the [NeMo Framework Resiliency Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/resiliency.html)" ] }, { "cell_type": "code", "execution_count": 1, "id": "f3f9cea8-a917-4c81-b80e-4fc52ce3359c", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints and prepare for a fresh run\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "markdown", "id": "e03a6466-45a4-4987-a5ac-10cdd4dbaf86", "metadata": {}, "source": [ "# 1. Setup a simple training job and demostrate successful training" ] }, { "cell_type": "code", "execution_count": 2, "id": "2dfb9b8b-3359-4d00-88fa-abf0e24f7850", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "[NeMo W 2025-03-07 22:33:41 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", " cm = get_cmap(\"Set1\")\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Required libraries loaded.\n" ] } ], "source": [ "# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.\n", "\n", "# Required Libraries\n", "import argparse\n", "import copy\n", "import math\n", "import os\n", "from functools import partial\n", "from typing import Any\n", "import torch\n", "\n", "import nemo_run as run\n", "from lightning.pytorch.callbacks import Callback\n", "\n", "from nemo.collections import llm\n", "from nemo.collections.llm.recipes.callbacks.common import straggler_det_callback\n", "from nemo.lightning.run import plugins\n", "\n", "from crash_simulator import CrashSimulationCallback\n", "from preemption_simulator import PreemptionSimulationCallback\n", "\n", "print(\"Required libraries loaded.\")" ] }, { "cell_type": "markdown", "id": "e2a19a6d-8df8-4930-bb50-d622b6b72af7", "metadata": {}, "source": [ "## 1.1 Define the executor\n", "\n", "Define and initialize a local executor, which is used to manage distributed computing tasks. The executor encapsulates configurations for launching jobs (e.g. number of devices, environment variables, task distribution)." ] }, { "cell_type": "code", "execution_count": 3, "id": "8740b1b8-0f89-40a9-a361-a88a6371d073", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executor setup complete.\n" ] } ], "source": [ "def local_executor(devices: int = 8) -> run.LocalExecutor:\n", " \"\"\"\n", " Factory method for creating a LocalExecutor instance. \n", " This sets up environment variables and configures the number of devices.\n", "\n", " Args:\n", " devices (int): Number of devices to be used per node.\n", "\n", " Returns:\n", " run.LocalExecutor: Configured local executor object.\n", " \"\"\"\n", " env_vars = {\n", " \"TRANSFORMERS_OFFLINE\": \"1\", # Run Transformer models offline\n", " \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\", # Optimize PyTorch NCCL\n", " \"NCCL_NVLS_ENABLE\": \"0\", # Experimental NCCL environment variable\n", " \"NVTE_DP_AMAX_REDUCE_INTERVAL\": \"0\", \n", " \"NVTE_ASYNC_AMAX_REDUCTION\": \"1\",\n", " }\n", " # Create LocalExecutor with the `ft` launcher\n", " executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\", env_vars=env_vars)\n", " return executor\n", "\n", "# Initialize the executor based on the arguments\n", "executor = local_executor(devices=8)\n", "\n", "print(\"Executor setup complete.\")" ] }, { "cell_type": "markdown", "id": "994ffd38-5f97-4001-ad86-46b686edb0e8", "metadata": {}, "source": [ "## 1.2 Model setup\n", "Load and configure a LLAMA pretrain recipe. We choose a small 54M parameter llama3 based model for faster execution. This model is obtained by reducing the sequence length, number of layers, hidden size and number of attention heads from the original llama3 8B model configuration as defined in the [Llama3Config8B class](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/llama.py)." ] }, { "cell_type": "code", "execution_count": 4, "id": "52245a88", "metadata": {}, "outputs": [], "source": [ "# Create a small LLAMA3 model configuration\n", "def small_llama_cfg() -> llm.GPTConfig:\n", " \"\"\"Small 54M parameter model\"\"\"\n", " return run.Config(\n", " llm.Llama3Config8B,\n", " rotary_base=500_000,\n", " seq_length=128,\n", " num_layers=4,\n", " hidden_size=768,\n", " ffn_hidden_size=2688,\n", " num_attention_heads=16,\n", " init_method_std=0.023,\n", " )\n" ] }, { "cell_type": "markdown", "id": "b6988ce3", "metadata": {}, "source": [ "## 1.3 Modify the training recipe\n", "`pretrain` is a partial function that takes in the experiment name and checkpoint directory, and returns a pretrain recipe. It is setup to use `num_nodes=1` and `num_gpus_per_node=8` by default but this can be changed by modifying the `num_nodes` and `num_gpus_per_node` arguments. This demo uses the llama3 8b pretrain recipe as defined in the `llama31_8b.pretrain_recipe` [module](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py). This defaults to using a mock dataset: [MockDataModule](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/mock.py) but please refer to the [Llama3_8b recipe](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py) for instructions on how to use a custom dataset. Since we are using a mock dataset, we set the `max_steps` to 20 so we can run the experiment in a reasonable time.\n", "\n", "We also disable validation sanity checks to reduce startup time, and set tensor model parallel size to 2 and context parallel size to 1." ] }, { "cell_type": "code", "execution_count": 5, "id": "f5c99ffc-3718-4383-b77a-161f387ce302", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model recipe setup complete.\n" ] } ], "source": [ "# Experiment name\n", "exp_name = \"resiliency-in-pretraining-demo\"\n", "\n", "# Preliminary setup for the LLAMA pretrain recipe\n", "pretrain = partial(llm.llama31_8b.pretrain_recipe, num_nodes=1, num_gpus_per_node=8)(\n", " name=exp_name, dir=\"/tmp/nemo_run/checkpoints\"\n", ")\n", "pretrain.model = run.Config(llm.LlamaModel, small_llama_cfg())\n", "pretrain.trainer.strategy.tensor_model_parallel_size = 2\n", "pretrain.trainer.strategy.context_parallel_size = 1\n", "pretrain.trainer.num_sanity_val_steps = 0\n", "pretrain.broadcast(max_steps=20)\n", "pretrain.trainer.limit_val_batches = 2\n", "pretrain.trainer.log_every_n_steps = 1\n", "pretrain.trainer.val_check_interval = 10\n", "print(\"Model recipe setup complete.\")" ] }, { "cell_type": "markdown", "id": "46ae75f7-1a91-4429-bfbf-3ebe62bce123", "metadata": {}, "source": [ "## 1.4 Running the Experiment\n", "Run the entire pretraining experiment. Depending on the arguments passed:\n", "- If `dryrun` is True, it performs a dry run (to validate configurations).\n", "- Otherwise, it launches the actual training run locally." ] }, { "cell_type": "code", "execution_count": 6, "id": "03887dd7-a23b-44c9-825a-311849729531", "metadata": {}, "outputs": [], "source": [ "def run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False):\n", " \"\"\"\n", " Run the pretraining experiment either as a dry run or actual training.\n", " \n", " Args:\n", " exp_name: Name of the experiment\n", " pretrain: Pretrain configuration object\n", " executor: Executor to run the experiment\n", " run_plugins: List of runtime plugins\n", " dryrun: Boolean flag to perform a dry run\n", " \"\"\"\n", " with run.Experiment(f\"{exp_name}\") as exp:\n", " # Add the pretrain job to the experiment\n", " exp.add(\n", " pretrain,\n", " executor=executor,\n", " name=exp_name,\n", " plugins=run_plugins,\n", " tail_logs=True,\n", " )\n", "\n", " # Execute the experiment based on the dryrun flag\n", " if dryrun:\n", " print(\"Performing dry run ...\")\n", " exp.dryrun()\n", " else:\n", " print(\"Launching training run ...\")\n", " exp.run(sequential=True, detach=True)\n", " print(\"Experiment executed successfully.\")" ] }, { "cell_type": "markdown", "id": "6998265d-628a-4a68-bfd2-c40c107c2a43", "metadata": {}, "source": [ "Note: This run genrally fails the first time around since we are using a Mock dataset and it cannot find the tokenizer files. So the error is usually `FileNotFoundError: [Errno 2] No such file or directory: 'gpt2-merges.txt'`.\n", "\n", "To avoid this, you can manually download the following files before launching a run" ] }, { "cell_type": "code", "execution_count": 7, "id": "27a817a6-61dc-4122-8a03-dade66d0cb03", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "--2025-03-07 22:33:43-- https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n", "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.209.88, 54.231.195.168, 52.217.224.64, ...\n", "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.209.88|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1042301 (1018K) [application/json]\n", "Saving to: ‘gpt2-vocab.json’\n", "\n", " 0K .......... .......... .......... .......... .......... 4% 824K 1s\n", " 50K .......... .......... .......... .......... .......... 9% 814K 1s\n", " 100K .......... .......... .......... .......... .......... 14% 810K 1s\n", " 150K .......... .......... .......... .......... .......... 19% 311M 1s\n", " 200K .......... .......... .......... .......... .......... 24% 110M 1s\n", " 250K .......... .......... .......... .......... .......... 29% 268M 0s\n", " 300K .......... .......... .......... .......... .......... 34% 821K 0s\n", " 350K .......... .......... .......... .......... .......... 39% 104M 0s\n", " 400K .......... .......... .......... .......... .......... 44% 331M 0s\n", " 450K .......... .......... .......... .......... .......... 49% 602M 0s\n", " 500K .......... .......... .......... .......... .......... 54% 518M 0s\n", " 550K .......... .......... .......... .......... .......... 58% 545M 0s\n", " 600K .......... .......... .......... .......... .......... 63% 829K 0s\n", " 650K .......... .......... .......... .......... .......... 68% 111M 0s\n", " 700K .......... .......... .......... .......... .......... 73% 328M 0s\n", " 750K .......... .......... .......... .......... .......... 78% 325M 0s\n", " 800K .......... .......... .......... .......... .......... 83% 277M 0s\n", " 850K .......... .......... .......... .......... .......... 88% 333M 0s\n", " 900K .......... .......... .......... .......... .......... 93% 320M 0s\n", " 950K .......... .......... .......... .......... .......... 98% 487M 0s\n", " 1000K .......... ....... 100% 616M=0.3s\n", "\n", "2025-03-07 22:33:43 (3.23 MB/s) - ‘gpt2-vocab.json’ saved [1042301/1042301]\n", "\n", "--2025-03-07 22:33:43-- https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\n", "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.224.64, 52.217.163.184, 52.216.207.197, ...\n", "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.224.64|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 456318 (446K) [text/plain]\n", "Saving to: ‘gpt2-merges.txt’\n", "\n", " 0K .......... .......... .......... .......... .......... 11% 801K 0s\n", " 50K .......... .......... .......... .......... .......... 22% 786K 0s\n", " 100K .......... .......... .......... .......... .......... 33% 786K 0s\n", " 150K .......... .......... .......... .......... .......... 44% 272M 0s\n", " 200K .......... .......... .......... .......... .......... 56% 454M 0s\n", " 250K .......... .......... .......... .......... .......... 67% 797K 0s\n", " 300K .......... .......... .......... .......... .......... 78% 106M 0s\n", " 350K .......... .......... .......... .......... .......... 89% 296M 0s\n", " 400K .......... .......... .......... .......... ..... 100% 334M=0.3s\n", "\n", "2025-03-07 22:33:44 (1.72 MB/s) - ‘gpt2-merges.txt’ saved [456318/456318]\n", "\n" ] } ], "source": [ "%%bash\n", "mkdir -p /root/.cache/torch/megatron\n", "wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json && mv gpt2-vocab.json /root/.cache/torch/megatron/megatron-gpt-345m_vocab\n", "wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt && mv gpt2-merges.txt /root/.cache/torch/megatron/megatron-gpt-345m_merges" ] }, { "cell_type": "code", "execution_count": 8, "id": "2836df0e-3a43-4dfd-9534-4633f5aa2441", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741386824 ──────\n",
       "
\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:33:44]  Cannot detach from this experiment. Please keep it running until completion.          experiment.py:651\n",
       "
\n" ], "text/plain": [ "\u001b[2;36m[22:33:44]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=506192;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=636450;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
           Launching job resiliency-in-pretraining-demo for experiment                            experiment.py:724\n",
       "           resiliency-in-pretraining-demo                                                                          \n",
       "
\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=291970;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=992860;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg: \n", " Structured Error Msg: \n", " UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment executed successfully.\n" ] }, { "data": { "text/html": [ "
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741386824 to finish ────────────────────\n",
       "
\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741386824 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Experiment Status for resiliency-in-pretraining-demo_1741386824\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741386824\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "Task 0: resiliency-in-pretraining-demo\n",
       "- Status: RUNNING\n",
       "- Executor: LocalExecutor\n",
       "- Job id: resiliency-in-pretraining-demo-xb3fnk3npq9wn\n",
       "- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n", "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", "- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-xb3fnk3npq9wn\n", "- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Waiting for job resiliency-in-pretraining-demo-xb3fnk3npq9wn to finish [log=True]...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] \n", "ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] min_nodes : 1\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] max_nodes : 1\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] run_id : 8931\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] max_restarts : 0\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn/torchelastic/resiliency-in-pretraining-demo\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n", "ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] \n", "ining-demo/0 I0307 22:33:45.906000 170829 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n", "ining-demo/0 I0307 22:33:45.906000 170829 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] master_port=42531\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] \n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n", "ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:55 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:34:10 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:34:10 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.31\n", "ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.549 | consumed_samples: 1024\n", "ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.556 | consumed_samples: 1536\n", "ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.561 | consumed_samples: 2048\n", "ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.566 | consumed_samples: 2560\n", "ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.559 | consumed_samples: 3072\n", "ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.568 | consumed_samples: 3584\n", "ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.565 | consumed_samples: 4096\n", "ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.572 | consumed_samples: 4608\n", "ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.572 | consumed_samples: 5120\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:35:19 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:35:22 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n", "ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.568 | consumed_samples: 5632\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:35:29 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 6144\n", "ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.577 | consumed_samples: 6656\n", "ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.575 | consumed_samples: 7168\n", "ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.58 | consumed_samples: 7680\n", "ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 8192\n", "ining-demo/0 [default0]:Training epoch 0, iteration 16/19 | lr: 2.549e-06 | global_batch_size: 512 | global_step: 16 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 8704\n", "ining-demo/0 [default0]:Training epoch 0, iteration 17/19 | lr: 2.699e-06 | global_batch_size: 512 | global_step: 17 | reduced_train_loss: 11.03 | train_step_timing in s: 7.584 | consumed_samples: 9216\n", "ining-demo/0 [default0]:Training epoch 0, iteration 18/19 | lr: 2.849e-06 | global_batch_size: 512 | global_step: 18 | reduced_train_loss: 11.03 | train_step_timing in s: 7.579 | consumed_samples: 9728\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:36:30 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:36:31 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt\n", "ining-demo/0 [default0]:Training epoch 1, iteration 0/19 | lr: 2.999e-06 | global_batch_size: 512 | global_step: 19 | reduced_train_loss: 11.03 | train_step_timing in s: 7.592 | consumed_samples: 10240\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:36:39 model_checkpoint:522] Async checkpoint save for step 19 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:`Trainer.fit` stopped: `max_steps=20` reached.\n", "ining-demo/0 I0307 22:36:44.958000 170829 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n", "ining-demo/0 I0307 22:36:44.958000 170829 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n", "ining-demo/0 I0307 22:36:44.959000 170829 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.0002548694610595703 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Job resiliency-in-pretraining-demo-xb3fnk3npq9wn finished: SUCCEEDED\n" ] }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']                              \n",
       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
       "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741386824\")                                   \n",
       "experiment.status() # Gets the overall status                                                                      \n",
       "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task                             \n",
       "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running                   \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
       "nemo experiment status resiliency-in-pretraining-demo_1741386824                                                   \n",
       "nemo experiment logs resiliency-in-pretraining-demo_1741386824 0                                                   \n",
       "nemo experiment cancel resiliency-in-pretraining-demo_1741386824 0                                                 \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# run the experiment\n", "run_plugins = []\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "029deebc-505f-4609-a274-38825bb91971", "metadata": {}, "source": [ "## 1.5 Cleanup and save clean states" ] }, { "cell_type": "code", "execution_count": 9, "id": "0bf0f185-b943-4050-b74d-6d8a4c0333bc", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 10, "id": "57feef98-da24-4ac5-b218-9dad6249d7c1", "metadata": {}, "outputs": [], "source": [ "pretrain_trainer_callbacks = copy.deepcopy(pretrain.trainer.callbacks)\n", "pretrain_trainer_callbacks\n", "run_plugins = []" ] }, { "cell_type": "markdown", "id": "71a72081-6f2e-46e7-9bec-2122b1d45acf", "metadata": {}, "source": [ "# 2. Demostrate Fault tolerance with crash detection and in-job restart\n", "The [Fault Tolerance plugin](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/run/plugins.py)\n", "- Detects hangs/crashes during training and relaunches the training job without manual intervention\n", "- It uses NVIDIA Resiliency Extension's `ft_launcher` which has been integrated into [NeMo-Run](https://github.com/NVIDIA/NeMo-Run) as [FaultTolerance](https://github.com/NVIDIA/NeMo-Run/blob/main/nemo_run/core/execution/launcher.py).\n", "- It also uses the `FaultToleranceCallback` from NVIDIA Resiliency Extension which sets up the heartbeats" ] }, { "cell_type": "markdown", "id": "5d4c0e6b-551b-46d9-8c3e-73626c5ab147", "metadata": {}, "source": [ "## 2.1 Setup FaultTolerancePlugin\n", "These env vars need to be set as well -\n", "- `FAULT_TOL_CFG_PATH` is the path to the fault tolerance config file. If it is empty, default configuration is used\n", "- `FAULT_TOL_FINISHED_FLAG_FILE` is the path where the fault tolerance package writes when a run is successfully completed so as to not trigger a re-launch." ] }, { "cell_type": "code", "execution_count": 11, "id": "99a1e083", "metadata": {}, "outputs": [], "source": [ "# Add FaultTolerancePlugin plugin and setup required env vars\n", "run_plugins = [plugins.FaultTolerancePlugin()]\n", "\n", "os.environ[\"FAULT_TOL_CFG_PATH\"] = \"/tmp/sample_job_ft_cfg.yml\"\n", "os.environ[\"FAULT_TOL_FINISHED_FLAG_FILE\"] = \"/tmp/sample_job_finished_flag\"" ] }, { "cell_type": "markdown", "id": "bd88a1a5", "metadata": {}, "source": [ "## 2.2 Setup the crash simulator and run the experiment\n", "We use the `CrashSimulationCallback` to simulate a crash during training. This callback is configured to crash the process at step 17 if a crash has not already occurred.\n", "\n", "Expected workflow:\n", "- Start training: Trainer Step counter = 0\n", "- After 10 trainer steps: Trainer Step counter = 10 -> save checkpoint\n", "- After 17 trainer steps: Trainer Step counter = 17 -> crash simulated, set `has_simulated_crash_happened` to `True`\n", "- Automatic in-job restart from checkpoint at step 10: Trainer step counter = 10\n", "- After 17 trainer steps:Trainer Step counter = 17 -> no crash simulated as `has_simulated_crash_happened == True`\n", "- After 20 trainer steps: Trainer Step counter = 20 -> successfully completes training" ] }, { "cell_type": "code", "execution_count": 12, "id": "dd2943a2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387007 ──────\n",
       "
\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:36:47]  Cannot detach from this experiment. Please keep it running until completion.          experiment.py:651\n",
       "
\n" ], "text/plain": [ "\u001b[2;36m[22:36:47]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=246338;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=43525;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
           Launching job resiliency-in-pretraining-demo for experiment                            experiment.py:724\n",
       "           resiliency-in-pretraining-demo                                                                          \n",
       "
\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=900784;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=467950;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg: \n", " Structured Error Msg: \n", " UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment executed successfully.\n" ] }, { "data": { "text/html": [ "
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387007 to finish ────────────────────\n",
       "
\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387007 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Experiment Status for resiliency-in-pretraining-demo_1741387007\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387007\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "Task 0: resiliency-in-pretraining-demo\n",
       "- Status: RUNNING\n",
       "- Executor: LocalExecutor\n",
       "- Job id: resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n",
       "- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n", "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", "- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n", "- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Waiting for job resiliency-in-pretraining-demo-ghpmrpzqnhtb0 to finish [log=True]...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ining-demo/0 [2025-03-07 22:36:48,812] [WARNING] [ft_launcher@4809917ea058] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.\n", "ining-demo/0 [2025-03-07 22:36:48,812] [WARNING] [ft_launcher@4809917ea058] \n", "ining-demo/0 *****************************************\n", "ining-demo/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "ining-demo/0 *****************************************\n", "ining-demo/0 [2025-03-07 22:36:48,816] [INFO] [ft_launcher@4809917ea058] [default] starting workers for entrypoint: python\n", "ining-demo/0 [2025-03-07 22:36:48,817] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous'ing worker group\n", "ining-demo/0 [2025-03-07 22:36:49,126] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 restart_count=0\n", "ining-demo/0 master_addr=4809917ea058\n", "ining-demo/0 master_port=47865\n", "ining-demo/0 group_rank=0\n", "ining-demo/0 group_world_size=1\n", "ining-demo/0 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 \n", "ining-demo/0 [2025-03-07 22:36:49,126] [INFO] [ft_launcher@4809917ea058] [default] Starting worker group\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker0 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/0/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker1 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/1/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker2 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/2/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker3 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/3/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker4 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/4/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker5 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/5/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,009] [INFO] [ft_launcher@4809917ea058] Setting worker6 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/6/error.json\n", "ining-demo/0 [2025-03-07 22:37:00,009] [INFO] [ft_launcher@4809917ea058] Setting worker7 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/7/error.json\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:09 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default2]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default6]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:10 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:10 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default4]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default1]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:11 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:11 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default7]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n", "ining-demo/0 [default3]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default5]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance dir: /tmp/nemo_run/checkpoints\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance client initialized. Timeouts: HeartbeatTimeouts(initial=1800.00, subsequent=300.00, were_calculated=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:24 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:24 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.51\n", "ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.667 | consumed_samples: 1024\n", "ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.582 | consumed_samples: 1536\n", "ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.585 | consumed_samples: 2048\n", "ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 2560\n", "ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.587 | consumed_samples: 3072\n", "ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 3584\n", "ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.584 | consumed_samples: 4096\n", "ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 4608\n", "ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 5120\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:38:34 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:38:36 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n", "ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 5632\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:38:44 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 6144\n", "ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 6656\n", "ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.63 | consumed_samples: 7168\n", "ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.587 | consumed_samples: 7680\n", "ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 8192\n", "ining-demo/0 [default2]:[rank2]: Traceback (most recent call last):\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default2]:[rank2]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default2]:[rank2]: exec(code, run_globals)\n", "ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default2]:[rank2]: fdl_runner_app()\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default2]:[rank2]: raise e\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default2]:[rank2]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default2]:[rank2]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default2]:[rank2]: return _main(\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default2]:[rank2]: rv = self.invoke(ctx)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default2]:[rank2]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default2]:[rank2]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default2]:[rank2]: return callback(**use_params)\n", "ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default2]:[rank2]: fdl_fn()\n", "ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default2]:[rank2]: return train(\n", "ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default2]:[rank2]: trainer.fit(model, data)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default2]:[rank2]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default2]:[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default2]:[rank2]: return function(*args, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default2]:[rank2]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default2]:[rank2]: results = self._run_stage()\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default2]:[rank2]: self.fit_loop.run()\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default2]:[rank2]: self.advance()\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default2]:[rank2]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default2]:[rank2]: self.advance(data_fetcher)\n", "ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default2]:[rank2]: super().advance(data_fetcher)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default2]:[rank2]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default2]:[rank2]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default2]:[rank2]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default2]:[rank2]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default2]:[rank2]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default2]:[rank2]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default7]:[rank7]: Traceback (most recent call last):\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default7]:[rank7]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default7]:[rank7]: exec(code, run_globals)\n", "ining-demo/0 [default7]:[rank7]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default7]:[rank7]: fdl_runner_app()\n", "ining-demo/0 [default1]:[rank1]: Traceback (most recent call last):\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default7]:[rank7]: raise e\n", "ining-demo/0 [default1]:[rank1]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default7]:[rank7]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: exec(code, run_globals)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default1]:[rank1]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default7]:[rank7]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: fdl_runner_app()\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default7]:[rank7]: return _main(\n", "ining-demo/0 [default1]:[rank1]: raise e\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default7]:[rank7]: rv = self.invoke(ctx)\n", "ining-demo/0 [default1]:[rank1]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default7]:[rank7]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default1]:[rank1]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default7]:[rank7]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: return _main(\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default7]:[rank7]: return callback(**use_params)\n", "ining-demo/0 [default1]:[rank1]: rv = self.invoke(ctx)\n", "ining-demo/0 [default7]:[rank7]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default7]:[rank7]: fdl_fn()\n", "ining-demo/0 [default1]:[rank1]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default7]:[rank7]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default7]:[rank7]: return train(\n", "ining-demo/0 [default1]:[rank1]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default7]:[rank7]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default7]:[rank7]: trainer.fit(model, data)\n", "ining-demo/0 [default1]:[rank1]: return callback(**use_params)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default1]:[rank1]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default7]:[rank7]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default1]:[rank1]: fdl_fn()\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default1]:[rank1]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default7]:[rank7]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: return train(\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default1]:[rank1]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default7]:[rank7]: return function(*args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: trainer.fit(model, data)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default7]:[rank7]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default1]:[rank1]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default7]:[rank7]: results = self._run_stage()\n", "ining-demo/0 [default1]:[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default7]:[rank7]: self.fit_loop.run()\n", "ining-demo/0 [default1]:[rank1]: return function(*args, **kwargs)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default7]:[rank7]: self.advance()\n", "ining-demo/0 [default1]:[rank1]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default7]:[rank7]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default1]:[rank1]: results = self._run_stage()\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default7]:[rank7]: self.advance(data_fetcher)\n", "ining-demo/0 [default1]:[rank1]: self.fit_loop.run()\n", "ining-demo/0 [default7]:[rank7]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default7]:[rank7]: super().advance(data_fetcher)\n", "ining-demo/0 [default1]:[rank1]: self.advance()\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default7]:[rank7]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default1]:[rank1]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default7]:[rank7]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default7]:[rank7]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: self.advance(data_fetcher)\n", "ining-demo/0 [default7]:[rank7]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default1]:[rank1]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default7]:[rank7]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default1]:[rank1]: super().advance(data_fetcher)\n", "ining-demo/0 [default7]:[rank7]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default1]:[rank1]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default1]:[rank1]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default1]:[rank1]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default1]:[rank1]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default1]:[rank1]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default1]:[rank1]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default0]:[rank0]: Traceback (most recent call last):\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default0]:[rank0]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default0]:[rank0]: exec(code, run_globals)\n", "ining-demo/0 [default0]:[rank0]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default0]:[rank0]: fdl_runner_app()\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default0]:[rank0]: raise e\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default0]:[rank0]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default0]:[rank0]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default0]:[rank0]: return _main(\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default0]:[rank0]: rv = self.invoke(ctx)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default0]:[rank0]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default0]:[rank0]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default0]:[rank0]: return callback(**use_params)\n", "ining-demo/0 [default0]:[rank0]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default0]:[rank0]: fdl_fn()\n", "ining-demo/0 [default0]:[rank0]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default0]:[rank0]: return train(\n", "ining-demo/0 [default0]:[rank0]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default0]:[rank0]: trainer.fit(model, data)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default0]:[rank0]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default0]:[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default0]:[rank0]: return function(*args, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default0]:[rank0]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default0]:[rank0]: results = self._run_stage()\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default0]:[rank0]: self.fit_loop.run()\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default0]:[rank0]: self.advance()\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default0]:[rank0]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default0]:[rank0]: self.advance(data_fetcher)\n", "ining-demo/0 [default0]:[rank0]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default0]:[rank0]: super().advance(data_fetcher)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default0]:[rank0]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default0]:[rank0]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default0]:[rank0]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default0]:[rank0]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default0]:[rank0]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default0]:[rank0]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default6]:[rank6]: Traceback (most recent call last):\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default6]:[rank6]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default6]:[rank6]: exec(code, run_globals)\n", "ining-demo/0 [default6]:[rank6]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default6]:[rank6]: fdl_runner_app()\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default6]:[rank6]: raise e\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default6]:[rank6]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: Traceback (most recent call last):\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default6]:[rank6]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default6]:[rank6]: return _main(\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default4]:[rank4]: exec(code, run_globals)\n", "ining-demo/0 [default6]:[rank6]: rv = self.invoke(ctx)\n", "ining-demo/0 [default4]:[rank4]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default4]:[rank4]: fdl_runner_app()\n", "ining-demo/0 [default6]:[rank6]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default4]:[rank4]: raise e\n", "ining-demo/0 [default6]:[rank6]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default4]:[rank4]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default6]:[rank6]: return callback(**use_params)\n", "ining-demo/0 [default6]:[rank6]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default4]:[rank4]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default6]:[rank6]: fdl_fn()\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default6]:[rank6]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default4]:[rank4]: return _main(\n", "ining-demo/0 [default6]:[rank6]: return train(\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default6]:[rank6]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default4]:[rank4]: rv = self.invoke(ctx)\n", "ining-demo/0 [default6]:[rank6]: trainer.fit(model, data)\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default4]:[rank4]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default6]:[rank6]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default4]:[rank4]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default6]:[rank6]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default4]:[rank4]: return callback(**use_params)\n", "ining-demo/0 [default6]:[rank6]: return function(*args, **kwargs)\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default4]:[rank4]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default6]:[rank6]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default4]:[rank4]: fdl_fn()\n", "ining-demo/0 [default4]:[rank4]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default4]:[rank4]: return train(\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default4]:[rank4]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default6]:[rank6]: results = self._run_stage()\n", "ining-demo/0 [default4]:[rank4]: trainer.fit(model, data)\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default3]:[rank3]: Traceback (most recent call last):\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default6]:[rank6]: self.fit_loop.run()\n", "ining-demo/0 [default4]:[rank4]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default3]:[rank3]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default6]:[rank6]: self.advance()\n", "ining-demo/0 [default4]:[rank4]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: exec(code, run_globals)\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default3]:[rank3]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default6]:[rank6]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default4]:[rank4]: return function(*args, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: fdl_runner_app()\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default6]:[rank6]: self.advance(data_fetcher)\n", "ining-demo/0 [default4]:[rank4]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default3]:[rank3]: raise e\n", "ining-demo/0 [default6]:[rank6]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default6]:[rank6]: super().advance(data_fetcher)\n", "ining-demo/0 [default3]:[rank3]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: results = self._run_stage()\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default6]:[rank6]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default3]:[rank3]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: self.fit_loop.run()\n", "ining-demo/0 [default6]:[rank6]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default6]:[rank6]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: return _main(\n", "ining-demo/0 [default4]:[rank4]: self.advance()\n", "ining-demo/0 [default6]:[rank6]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default6]:[rank6]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default3]:[rank3]: rv = self.invoke(ctx)\n", "ining-demo/0 [default4]:[rank4]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default6]:[rank6]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default4]:[rank4]: self.advance(data_fetcher)\n", "ining-demo/0 [default4]:[rank4]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default3]:[rank3]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default4]:[rank4]: super().advance(data_fetcher)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default3]:[rank3]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default4]:[rank4]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default4]:[rank4]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default3]:[rank3]: return callback(**use_params)\n", "ining-demo/0 [default4]:[rank4]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default4]:[rank4]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default3]:[rank3]: fdl_fn()\n", "ining-demo/0 [default4]:[rank4]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default3]:[rank3]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default4]:[rank4]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default3]:[rank3]: return train(\n", "ining-demo/0 [default3]:[rank3]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default3]:[rank3]: trainer.fit(model, data)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default3]:[rank3]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default3]:[rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default3]:[rank3]: return function(*args, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default3]:[rank3]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default3]:[rank3]: results = self._run_stage()\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default3]:[rank3]: self.fit_loop.run()\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default3]:[rank3]: self.advance()\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default3]:[rank3]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default3]:[rank3]: self.advance(data_fetcher)\n", "ining-demo/0 [default3]:[rank3]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default3]:[rank3]: super().advance(data_fetcher)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default3]:[rank3]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default3]:[rank3]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default3]:[rank3]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default3]:[rank3]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default3]:[rank3]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default3]:[rank3]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [default5]:[rank5]: Traceback (most recent call last):\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n", "ining-demo/0 [default5]:[rank5]: return _run_code(code, main_globals, None,\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n", "ining-demo/0 [default5]:[rank5]: exec(code, run_globals)\n", "ining-demo/0 [default5]:[rank5]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "ining-demo/0 [default5]:[rank5]: fdl_runner_app()\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 338, in __call__\n", "ining-demo/0 [default5]:[rank5]: raise e\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 321, in __call__\n", "ining-demo/0 [default5]:[rank5]: return get_command(self)(*args, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n", "ining-demo/0 [default5]:[rank5]: return self.main(*args, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 665, in main\n", "ining-demo/0 [default5]:[rank5]: return _main(\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 197, in _main\n", "ining-demo/0 [default5]:[rank5]: rv = self.invoke(ctx)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n", "ining-demo/0 [default5]:[rank5]: return ctx.invoke(self.callback, **ctx.params)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n", "ining-demo/0 [default5]:[rank5]: return __callback(*args, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 703, in wrapper\n", "ining-demo/0 [default5]:[rank5]: return callback(**use_params)\n", "ining-demo/0 [default5]:[rank5]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 62, in fdl_direct_run\n", "ining-demo/0 [default5]:[rank5]: fdl_fn()\n", "ining-demo/0 [default5]:[rank5]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 150, in pretrain\n", "ining-demo/0 [default5]:[rank5]: return train(\n", "ining-demo/0 [default5]:[rank5]: File \"/opt/NeMo/nemo/collections/llm/api.py\", line 107, in train\n", "ining-demo/0 [default5]:[rank5]: trainer.fit(model, data)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 538, in fit\n", "ining-demo/0 [default5]:[rank5]: call._call_and_handle_interrupt(\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\n", "ining-demo/0 [default5]:[rank5]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py\", line 105, in launch\n", "ining-demo/0 [default5]:[rank5]: return function(*args, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 574, in _fit_impl\n", "ining-demo/0 [default5]:[rank5]: self._run(model, ckpt_path=ckpt_path)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 981, in _run\n", "ining-demo/0 [default5]:[rank5]: results = self._run_stage()\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py\", line 1025, in _run_stage\n", "ining-demo/0 [default5]:[rank5]: self.fit_loop.run()\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 205, in run\n", "ining-demo/0 [default5]:[rank5]: self.advance()\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py\", line 363, in advance\n", "ining-demo/0 [default5]:[rank5]: self.epoch_loop.run(self._data_fetcher)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 140, in run\n", "ining-demo/0 [default5]:[rank5]: self.advance(data_fetcher)\n", "ining-demo/0 [default5]:[rank5]: File \"/opt/NeMo/nemo/lightning/pytorch/trainer.py\", line 47, in advance\n", "ining-demo/0 [default5]:[rank5]: super().advance(data_fetcher)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py\", line 269, in advance\n", "ining-demo/0 [default5]:[rank5]: call._call_callback_hooks(trainer, \"on_train_batch_end\", batch_output, batch, batch_idx)\n", "ining-demo/0 [default5]:[rank5]: File \"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py\", line 218, in _call_callback_hooks\n", "ining-demo/0 [default5]:[rank5]: fn(trainer, trainer.lightning_module, *args, **kwargs)\n", "ining-demo/0 [default5]:[rank5]: File \"/gtc/NeMo/examples/llm/resiliency/crash_simulator.py\", line 26, in on_train_batch_end\n", "ining-demo/0 [default5]:[rank5]: raise Exception(f\"Simulating a crash at step {self.crash_step}!\")\n", "ining-demo/0 [default5]:[rank5]: Exception: Simulating a crash at step 17!\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178686 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178687 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178689 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178690 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178691 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:35,153] [WARNING] [ft_launcher@4809917ea058] Sending process 178692 closing signal SIGTERM\n", "ining-demo/0 [2025-03-07 22:39:36,176] [ERROR] [ft_launcher@4809917ea058] failed (exitcode: 1) local_rank: 2 (pid: 178688) of binary: /usr/bin/python\n", "ining-demo/0 [2025-03-07 22:39:36,177] [INFO] [ft_launcher@4809917ea058] [default] Worker group FAILED. 3/3 attempts left; will restart worker group\n", "ining-demo/0 [2025-03-07 22:39:36,177] [INFO] [ft_launcher@4809917ea058] [default] Stopping worker group\n", "ining-demo/0 [2025-03-07 22:39:36,185] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous'ing worker group\n", "ining-demo/0 [2025-03-07 22:39:36,214] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 restart_count=1\n", "ining-demo/0 master_addr=4809917ea058\n", "ining-demo/0 master_port=54851\n", "ining-demo/0 group_rank=0\n", "ining-demo/0 group_world_size=1\n", "ining-demo/0 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 \n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] [default] Starting worker group\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker0 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/0/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker1 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/1/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker2 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/2/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker3 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/3/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker4 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/4/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,215] [INFO] [ft_launcher@4809917ea058] Setting worker5 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/5/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,216] [INFO] [ft_launcher@4809917ea058] Setting worker6 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/6/error.json\n", "ining-demo/0 [2025-03-07 22:39:36,216] [INFO] [ft_launcher@4809917ea058] Setting worker7 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_1/7/error.json\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:45 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default7]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default5]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default3]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default1]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default2]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:46 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:46 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default0]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:46 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:46 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:46 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:46 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:46 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:46 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default6]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default4]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:48 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance dir: /tmp/nemo_run/checkpoints\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance client initialized. Timeouts: HeartbeatTimeouts(initial=1800.00, subsequent=300.00, were_calculated=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:49 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default0]:Restoring states from the checkpoint path at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last/weights\n", "ining-demo/0 [default5]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default7]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default3]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default1]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default6]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default2]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default4]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default0]:Resuming from checkpoint, setting has_simulated_crash_happened to True!\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:39:50 distrib_optimizer:705] Loading distributed optimizer sharded state of type fully_sharded_model_space\n", "ining-demo/0 [default0]:Restored all states from the checkpoint at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last/weights\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:39:50 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py:161: You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing the `state_dict` / `load_state_dict` interface.\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:40:00 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:40:00 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | consumed_samples: 5632 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 10.26\n", "ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | consumed_samples: 6144 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.581\n", "ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | consumed_samples: 6656 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.593\n", "ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | consumed_samples: 7168 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.592\n", "ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | consumed_samples: 7680 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.591\n", "ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | consumed_samples: 8192 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.593\n", "ining-demo/0 [default0]:Training epoch 0, iteration 16/19 | lr: 2.549e-06 | consumed_samples: 8704 | global_batch_size: 512 | global_step: 16 | reduced_train_loss: 11.03 | train_step_timing in s: 7.603\n", "ining-demo/0 [default0]:Training epoch 0, iteration 17/19 | lr: 2.699e-06 | consumed_samples: 9216 | global_batch_size: 512 | global_step: 17 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59\n", "ining-demo/0 [default0]:Training epoch 0, iteration 18/19 | lr: 2.849e-06 | consumed_samples: 9728 | global_batch_size: 512 | global_step: 18 | reduced_train_loss: 11.03 | train_step_timing in s: 7.591\n", "ining-demo/0 [default0]:Training epoch 0, iteration 19/19 | lr: 2.999e-06 | consumed_samples: 10240 | global_batch_size: 512 | global_step: 19 | reduced_train_loss: 11.03 | train_step_timing in s: 7.592\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:12 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=19-consumed_samples=10240.0-last.ckpt\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:12 dist_ckpt_io:174] Pending async checkpoint saves. Finalizing them synchronously now\n", "ining-demo/0 [default0]:`Trainer.fit` stopped: `max_steps=20` reached.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:13 model_checkpoint:522] Async checkpoint save for step 20 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=19-consumed_samples=10240.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [2025-03-07 22:41:21,513] [INFO] [ft_launcher@4809917ea058] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n", "ining-demo/0 [2025-03-07 22:41:21,513] [INFO] [ft_launcher@4809917ea058] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n", "ining-demo/0 [2025-03-07 22:41:21,513] [INFO] [ft_launcher@4809917ea058] Done waiting for other agents. Elapsed: 0.00013136863708496094 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Job resiliency-in-pretraining-demo-ghpmrpzqnhtb0 finished: SUCCEEDED\n" ] }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']                              \n",
       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
       "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387007\")                                   \n",
       "experiment.status() # Gets the overall status                                                                      \n",
       "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task                             \n",
       "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running                   \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
       "nemo experiment status resiliency-in-pretraining-demo_1741387007                                                   \n",
       "nemo experiment logs resiliency-in-pretraining-demo_1741387007 0                                                   \n",
       "nemo experiment cancel resiliency-in-pretraining-demo_1741387007 0                                                 \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Enable a crash simulation callback\n", "pretrain.trainer.callbacks.append(run.Config(CrashSimulationCallback, crash_step=17))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "947346fd-d62f-42e6-848b-b0e1ea2bf0e8", "metadata": {}, "source": [ "## 2.3 Cleanup" ] }, { "cell_type": "code", "execution_count": 13, "id": "452b9478-25bc-491d-a9a6-71fd698cac29", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 14, "id": "af2f22b1-8cc7-4281-a2f5-e3afb603e44b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restore pretrain.trainer.callbacks and drop Crash Simulation\n", "pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)\n", "run_plugins = []\n", "pretrain_trainer_callbacks" ] }, { "cell_type": "markdown", "id": "5aef835d-dc96-416b-9b68-f3fb94037e71", "metadata": {}, "source": [ "# 3. Demonstrate Straggler Detection\n", "The [Straggler Detection Callback](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/callbacks/common.py):\n", "- Monitors training performance across nodes\n", "- Identifies ranks that are running slower than others (\"stragglers\")\n", "- Wraps NVIDIA Resiliency Extension's straggler detection functionality in a NeMo-compatible interface\n" ] }, { "cell_type": "markdown", "id": "0b6efeba-5f66-4fb8-82ff-94361881a970", "metadata": {}, "source": [ "## 3.1 Setup and run an experiment\n", "To simulate straggler nodes in a distributed computing environment, we can try two different ways\n", "1. Increase detection sensitivity: Adjust the straggler detection thresholds (e.g., gpu_relative_perf_threshold and gpu_individual_perf_threshold) from 0.7 to 0.99. This makes the system more sensitive to performance variations, effectively simulating a higher occurrence of stragglers without modifying hardware settings.\n", "2. Manually reduce the performance of specific GPUs using the nvidia-smi utility. This process involves lowering the clock speeds of both the GPU core and memory.\n", "\n", "#### Steps to manually Straggle GPUs for experimentation\n", "1. First, check the current clock speeds:\n", "`!nvidia-smi --query-gpu=index,clocks.current.sm,clocks.current.memory --format=csv` \n", "2. Lock the GPU core clock to a lower frequency:\n", "`!nvidia-smi -i --lock-gpu-clocks=`\n", "3. Lock the GPU memory clock to a lower frequency:\n", "`!nvidia-smi -i --lock-memory-clocks=`\n", "\n", "Replace with a value lower than the maximum clock speed for both commands.\n", "\n", "#### Resetting GPU Clocks\n", "After your experiment, make sure to reset the GPU and memory clocks to their default values:\n", "\n", "`!nvidia-smi --reset-gpu-clocks`
\n", "`!nvidia-smi --reset-memory-clocks`\n", "\n", "These commands will restore the default clock settings for both the GPU core and memory" ] }, { "cell_type": "markdown", "id": "0ef2f8be-3c5f-4552-bf4b-b9504cb7c308", "metadata": {}, "source": [ "## 3.2 Increase Detection Sensitivity\n", "We can force a mock straggler to be detected by adjusting stragggler detection thresholds to be extremely senstive. " ] }, { "cell_type": "code", "execution_count": 15, "id": "840d26c5-7ce0-4ba0-a3c4-e1ec7c8ace90", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387286 ──────\n",
       "
\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:41:26]  Cannot detach from this experiment. Please keep it running until completion.          experiment.py:651\n",
       "
\n" ], "text/plain": [ "\u001b[2;36m[22:41:26]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=101587;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=506023;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
           Launching job resiliency-in-pretraining-demo for experiment                            experiment.py:724\n",
       "           resiliency-in-pretraining-demo                                                                          \n",
       "
\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=230211;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=581800;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-nr166h9790700\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg: \n", " Structured Error Msg: \n", " UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-nr166h9790700\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment executed successfully.\n" ] }, { "data": { "text/html": [ "
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387286 to finish ────────────────────\n",
       "
\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387286 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Experiment Status for resiliency-in-pretraining-demo_1741387286\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387286\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "Task 0: resiliency-in-pretraining-demo\n",
       "- Status: RUNNING\n",
       "- Executor: LocalExecutor\n",
       "- Job id: resiliency-in-pretraining-demo-nr166h9790700\n",
       "- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n", "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", "- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-nr166h9790700\n", "- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Waiting for job resiliency-in-pretraining-demo-nr166h9790700 to finish [log=True]...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] \n", "ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] min_nodes : 1\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] max_nodes : 1\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] run_id : 2769\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] max_restarts : 0\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-nr166h9790700/torchelastic/resiliency-in-pretraining-demo\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n", "ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] \n", "ining-demo/0 I0307 22:41:27.663000 192252 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n", "ining-demo/0 I0307 22:41:27.663000 192252 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] master_port=46031\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] \n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n", "ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:37 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:52 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:52 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.4\n", "ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.591 | consumed_samples: 1024\n", "ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.593 | consumed_samples: 1536\n", "ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 2048\n", "ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.595 | consumed_samples: 2560\n", "ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 3072\n", "ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.605 | consumed_samples: 3584\n", "ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.61 | consumed_samples: 4096\n", "ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.608 | consumed_samples: 4608\n", "ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 5120\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:01 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:04 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n", "ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 5632\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:12 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 6144\n", "ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 6656\n", "ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.604 | consumed_samples: 7168\n", "ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.605 | consumed_samples: 7680\n", "ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.611 | consumed_samples: 8192\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:144] \n", "ining-demo/0 [default0]: GPU relative performance:\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.97\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:153] \n", "ining-demo/0 [default0]: GPU individual performance:\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:58 straggler_det_callback:106] STRAGGLER DETECTION WARNING: Some GPUs have worse relative performance. Affected ranks: {StragglerId(rank=5, node='4809917ea058'), StragglerId(rank=6, node='4809917ea058'), StragglerId(rank=1, node='4809917ea058'), StragglerId(rank=4, node='4809917ea058'), StragglerId(rank=0, node='4809917ea058'), StragglerId(rank=2, node='4809917ea058')}\n", "ining-demo/0 [default0]:[NeMo E 2025-03-07 22:43:58 straggler_det_callback:239] Detected stragglers. Terminating training...\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=16-consumed_samples=8704.0-last.ckpt\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:245] Async checkpointing detected, waiting for it to complete...\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:58 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 model_checkpoint:522] Async checkpoint save for step 17 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=16-consumed_samples=8704.0-last.ckpt) finalized successfully.\n", "ining-demo/0 W0307 22:44:03.158000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192323 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.163000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192324 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.170000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192325 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.176000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192326 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.176000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192327 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.178000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192328 closing signal SIGTERM\n", "ining-demo/0 W0307 22:44:03.179000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192329 closing signal SIGTERM\n", "ining-demo/0 E0307 22:44:04.918000 192252 torch/distributed/elastic/multiprocessing/api.py:862] failed (exitcode: 1) local_rank: 0 (pid: 192322) of binary: /usr/bin/python\n", "ining-demo/0 I0307 22:44:04.924000 192252 torch/distributed/elastic/multiprocessing/errors/__init__.py:368] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)\n", "ining-demo/0 Traceback (most recent call last):\n", "ining-demo/0 File \"/usr/local/bin/torchrun\", line 33, in \n", "ining-demo/0 sys.exit(load_entry_point('torch==2.5.0a0+e000cf0ad9.nv24.10', 'console_scripts', 'torchrun')())\n", "ining-demo/0 File \"/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n", "ining-demo/0 return f(*args, **kwargs)\n", "ining-demo/0 File \"/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py\", line 919, in main\n", "ining-demo/0 run(args)\n", "ining-demo/0 File \"/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py\", line 910, in run\n", "ining-demo/0 elastic_launch(\n", "ining-demo/0 File \"/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py\", line 138, in __call__\n", "ining-demo/0 return launch_agent(self._config, self._entrypoint, list(args))\n", "ining-demo/0 File \"/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py\", line 269, in launch_agent\n", "ining-demo/0 raise ChildFailedError(\n", "ining-demo/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: \n", "ining-demo/0 ============================================================\n", "ining-demo/0 nemo_run.core.runners.fdl_runner FAILED\n", "ining-demo/0 ------------------------------------------------------------\n", "ining-demo/0 Failures:\n", "ining-demo/0 \n", "ining-demo/0 ------------------------------------------------------------\n", "ining-demo/0 Root Cause (first observed failure):\n", "ining-demo/0 [0]:\n", "ining-demo/0 time : 2025-03-07_22:44:03\n", "ining-demo/0 host : 4809917ea058\n", "ining-demo/0 rank : 0 (local_rank: 0)\n", "ining-demo/0 exitcode : 1 (pid: 192322)\n", "ining-demo/0 error_file: \n", "ining-demo/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html\n", "ining-demo/0 ============================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Job resiliency-in-pretraining-demo-nr166h9790700 finished: FAILED\n" ] }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']                              \n",
       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
       "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387286\")                                   \n",
       "experiment.status() # Gets the overall status                                                                      \n",
       "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task                             \n",
       "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running                   \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
       "nemo experiment status resiliency-in-pretraining-demo_1741387286                                                   \n",
       "nemo experiment logs resiliency-in-pretraining-demo_1741387286 0                                                   \n",
       "nemo experiment cancel resiliency-in-pretraining-demo_1741387286 0                                                 \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Automatically detect and mitigate mock stragglers during training\n", "pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1, gpu_relative_perf_threshold=0.99, gpu_individual_perf_threshold=0.99))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "87f7577e-3dec-4e42-90cd-98e3fa481000", "metadata": {}, "source": [ "## 3.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 16, "id": "3fb5eea6-f2f7-4b7b-9c6d-a50a398d1c48", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 17, "id": "988af6cf-d834-4902-a208-3cf82d32381d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restore pretrain.trainer.callbacks and drop Straggler Detection callback\n", "pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)\n", "run_plugins = []\n", "pretrain.trainer.callbacks" ] }, { "cell_type": "markdown", "id": "37c4d592-9a77-4840-9e8d-1bbac2f3aff0", "metadata": {}, "source": [ "## 3.3 Manually reduce the performance of specific GPUs using the nvidia-smi utility" ] }, { "cell_type": "code", "execution_count": 18, "id": "8c9ba50b-06e8-4758-ae59-0524b6d87f3f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index, clocks.current.sm [MHz], clocks.current.memory [MHz]\n", "0, 450 MHz, 9000 MHz\n", "1, 675 MHz, 9000 MHz\n", "2, 630 MHz, 9000 MHz\n", "3, 465 MHz, 9000 MHz\n", "4, 285 MHz, 405 MHz\n", "5, 2370 MHz, 9000 MHz\n", "6, 2130 MHz, 9000 MHz\n", "7, 2400 MHz, 9000 MHz\n" ] } ], "source": [ "### Simulating the Straggling GPUs\n", "!nvidia-smi --query-gpu=index,clocks.current.sm,clocks.current.memory --format=csv" ] }, { "cell_type": "code", "execution_count": 19, "id": "e47a0720-b27e-48dd-9c3b-6bccffc2e102", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The current user does not have permission to change clocks for GPU 00000000:01:00.0.\n", "Terminating early due to previous errors.\n" ] } ], "source": [ "!nvidia-smi -i 0,2,4,6 --lock-gpu-clocks=150" ] }, { "cell_type": "code", "execution_count": 20, "id": "27ae0c09-eb74-4dd5-a7cc-7e24766e4e3e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387447 ──────\n",
       "
\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:44:07]  Cannot detach from this experiment. Please keep it running until completion.          experiment.py:651\n",
       "
\n" ], "text/plain": [ "\u001b[2;36m[22:44:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=154117;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=798019;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
           Launching job resiliency-in-pretraining-demo for experiment                            experiment.py:724\n",
       "           resiliency-in-pretraining-demo                                                                          \n",
       "
\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=357545;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=831697;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg: \n", " Structured Error Msg: \n", " UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment executed successfully.\n" ] }, { "data": { "text/html": [ "
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387447 to finish ────────────────────\n",
       "
\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387447 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Experiment Status for resiliency-in-pretraining-demo_1741387447\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387447\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "Task 0: resiliency-in-pretraining-demo\n",
       "- Status: RUNNING\n",
       "- Executor: LocalExecutor\n",
       "- Job id: resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n",
       "- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n", "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", "- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n", "- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Waiting for job resiliency-in-pretraining-demo-n9hk3pk4hcz23c to finish [log=True]...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] \n", "ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] min_nodes : 1\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] max_nodes : 1\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] run_id : 5723\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] max_restarts : 0\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c/torchelastic/resiliency-in-pretraining-demo\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n", "ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] \n", "ining-demo/0 I0307 22:44:09.274000 199266 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n", "ining-demo/0 I0307 22:44:09.274000 199266 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] master_port=36403\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] \n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n", "ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n", "ining-demo/0 I0307 22:44:09.378000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n", "ining-demo/0 I0307 22:44:09.378000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:18 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:33 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:33 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.4\n", "ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.615 | consumed_samples: 1024\n", "ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.625 | consumed_samples: 1536\n", "ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.618 | consumed_samples: 2048\n", "ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.624 | consumed_samples: 2560\n", "ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.629 | consumed_samples: 3072\n", "ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.63 | consumed_samples: 3584\n", "ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.632 | consumed_samples: 4096\n", "ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.636 | consumed_samples: 4608\n", "ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.625 | consumed_samples: 5120\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:45:43 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:45:46 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n", "ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.633 | consumed_samples: 5632\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:45:53 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.626 | consumed_samples: 6144\n", "ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.629 | consumed_samples: 6656\n", "ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 7168\n", "ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.626 | consumed_samples: 7680\n", "ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 8192\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:144] \n", "ining-demo/0 [default0]: GPU relative performance:\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:153] \n", "ining-demo/0 [default0]: GPU individual performance:\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:236] Straggler report processing time: 0.040 sec.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 16/19 | lr: 2.549e-06 | global_batch_size: 512 | global_step: 16 | reduced_train_loss: 11.03 | train_step_timing in s: 7.635 | consumed_samples: 8704\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:144] \n", "ining-demo/0 [default0]: GPU relative performance:\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:153] \n", "ining-demo/0 [default0]: GPU individual performance:\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:236] Straggler report processing time: 0.015 sec.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 17/19 | lr: 2.699e-06 | global_batch_size: 512 | global_step: 17 | reduced_train_loss: 11.03 | train_step_timing in s: 7.632 | consumed_samples: 9216\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:144] \n", "ining-demo/0 [default0]: GPU relative performance:\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:153] \n", "ining-demo/0 [default0]: GPU individual performance:\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:236] Straggler report processing time: 0.015 sec.\n", "ining-demo/0 [default0]:Training epoch 0, iteration 18/19 | lr: 2.849e-06 | global_batch_size: 512 | global_step: 18 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 9728\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:46:55 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:144] \n", "ining-demo/0 [default0]: GPU relative performance:\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:153] \n", "ining-demo/0 [default0]: GPU individual performance:\n", "ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n", "ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:236] Straggler report processing time: 0.014 sec.\n", "ining-demo/0 [default0]:Training epoch 1, iteration 0/19 | lr: 2.999e-06 | global_batch_size: 512 | global_step: 19 | reduced_train_loss: 11.03 | train_step_timing in s: 7.73 | consumed_samples: 10240\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 model_checkpoint:522] Async checkpoint save for step 19 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt) finalized successfully.\n", "ining-demo/0 [default0]:`Trainer.fit` stopped: `max_steps=20` reached.\n", "ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n", "ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n", "ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.00034165382385253906 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Job resiliency-in-pretraining-demo-n9hk3pk4hcz23c finished: SUCCEEDED\n" ] }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']                              \n",
       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
       "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387447\")                                   \n",
       "experiment.status() # Gets the overall status                                                                      \n",
       "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task                             \n",
       "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running                   \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
       "nemo experiment status resiliency-in-pretraining-demo_1741387447                                                   \n",
       "nemo experiment logs resiliency-in-pretraining-demo_1741387447 0                                                   \n",
       "nemo experiment cancel resiliency-in-pretraining-demo_1741387447 0                                                 \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Automatically detect and mitigate mock stragglers during training\n", "# gpu_relative_perf_threshold and gpu_individual_perf_threshold default to 0.7 if not set explicitly\n", "pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "e5b91070-e9f0-4fb0-ac3a-2460e1a948d9", "metadata": {}, "source": [ "The straggler detection system identifies GPUs that are lagging behind in performance, halts the job to prevent inefficiencies, and provides detailed information about which GPUs are struggling. It monitors GPU performance of ranks to pinpoint slower ranks that may hinder overall training efficiency, thus enabling targeted optimization for distributed training setups." ] }, { "cell_type": "code", "execution_count": 21, "id": "9ac65fee-c78c-4462-b372-a8377797bb0d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The current user does not have permission to change clocks for GPU 00000000:01:00.0.\n", "Terminating early due to previous errors.\n", "The current user does not have permission to change clocks for GPU 00000000:01:00.0.\n", "Terminating early due to previous errors.\n" ] } ], "source": [ "### !!!! IMPORTANT !!!! ###\n", "### Reset the GPU clocks\n", "!nvidia-smi --reset-gpu-clocks\n", "!nvidia-smi --reset-memory-clocks" ] }, { "cell_type": "markdown", "id": "a8838d46-3d6f-40d1-9741-d8702f6c8a45", "metadata": {}, "source": [ "## 4.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 22, "id": "937799e5-a4f0-40d8-88fe-508961d4d8f5", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 23, "id": "59138478-1bd1-47b9-87da-2c7516464221", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restore pretrain.trainer.callbacks\n", "pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)\n", "run_plugins = []\n", "pretrain.trainer.callbacks" ] }, { "cell_type": "markdown", "id": "ef6d3d72-eefa-4e5e-b800-8f81ded7fdc4", "metadata": {}, "source": [ "# 4. Demonstrate Preemption\n", "The [Preemption Plugin](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/preemption.py) provides graceful shutdown capabilities:\n", "- Monitors for shutdown signals (default: `signal.SIGTERM`)\n", "- Saves a checkpoint when a shutdown signal is received\n", "- Ensures training progress is preserved before termination" ] }, { "cell_type": "markdown", "id": "f430b91a-a19e-479a-862a-e009b69fd58f", "metadata": {}, "source": [ "## 4.1 Setup the preemption simulator\n", "We use the `PreemptionSimulationCallback` to simulate a `signal.SIGTERM` during training. This callback is configured to raise a `signal.SIGTERM` at step 4.\n", "\n", "Expected workflow:\n", "- Start training: Trainer Step counter = 0\n", "- After 4 trainer steps: Trainer Step counter = 10 -> raise `signal.SIGTERM` -> Preemption callback saves an async checkpoint before gracefully exiting" ] }, { "cell_type": "code", "execution_count": 24, "id": "f24a1d67-da9f-490c-91a7-b1b3a3f25273", "metadata": {}, "outputs": [], "source": [ "# Add Preemption plugin\n", "run_plugins = [plugins.PreemptionPlugin()]\n", "\n", "# Enable a preemption simulation callback\n", "pretrain.trainer.callbacks.append(run.Config(PreemptionSimulationCallback, preemption_step=4))" ] }, { "cell_type": "markdown", "id": "9ab72b7f-20b5-4513-a65a-69052b82b50d", "metadata": {}, "source": [ "## 4.2 Run the experiment" ] }, { "cell_type": "code", "execution_count": 25, "id": "145c4c35-4aff-4985-a5de-24e826f33fc1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387633 ──────\n",
       "
\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:47:13]  Cannot detach from this experiment. Please keep it running until completion.          experiment.py:651\n",
       "
\n" ], "text/plain": [ "\u001b[2;36m[22:47:13]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=867994;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=660744;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
           Launching job resiliency-in-pretraining-demo for experiment                            experiment.py:724\n",
       "           resiliency-in-pretraining-demo                                                                          \n",
       "
\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=583095;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=29280;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg: \n", " Structured Error Msg: \n", " UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment executed successfully.\n" ] }, { "data": { "text/html": [ "
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387633 to finish ────────────────────\n",
       "
\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387633 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Experiment Status for resiliency-in-pretraining-demo_1741387633\n",
       "
\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387633\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "Task 0: resiliency-in-pretraining-demo\n",
       "- Status: RUNNING\n",
       "- Executor: LocalExecutor\n",
       "- Job id: resiliency-in-pretraining-demo-bngwzzcstc0p3\n",
       "- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n", "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", "- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-bngwzzcstc0p3\n", "- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Waiting for job resiliency-in-pretraining-demo-bngwzzcstc0p3 to finish [log=True]...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] \n", "ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] min_nodes : 1\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] max_nodes : 1\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] run_id : 5493\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] max_restarts : 0\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3/torchelastic/resiliency-in-pretraining-demo\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n", "ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] \n", "ining-demo/0 I0307 22:47:14.819000 207080 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n", "ining-demo/0 I0307 22:47:14.819000 207080 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] master_port=39547\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] \n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n", "ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n", "ining-demo/0 I0307 22:47:14.946000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n", "ining-demo/0 I0307 22:47:14.946000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:24 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", "ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n", "ining-demo/0 [default0]: \n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n", "ining-demo/0 [default0]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default3]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default5]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n", "ining-demo/0 [default4]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default1]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default7]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default6]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default0]:GPU available: True (cuda), used: True\n", "ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n", "ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n", "ining-demo/0 [default2]:Setup to simulate a preemption if step == 4\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:440] Ranks 0 has data parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:448] Rank 0 has context parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:452] Ranks 0 has context parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:506] Rank 0 has embedding group: [0]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:515] Rank 0 has embedding rank: 0\n", "ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:distributed_backend=nccl\n", "ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n", "ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n", "ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n", "ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n", "ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n", "ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n", "ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n", "ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 num_microbatches_calculator:228] setting number of microbatches to constant 128\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n", "ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n", "ining-demo/0 [default0]: \tmodule.output_layer.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n", "ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n", "ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:\n", "ining-demo/0 [default0]: | Name | Type | Params | Mode \n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n", "ining-demo/0 [default0]:----------------------------------------\n", "ining-demo/0 [default0]:54.7 M Trainable params\n", "ining-demo/0 [default0]:0 Non-trainable params\n", "ining-demo/0 [default0]:54.7 M Total params\n", "ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n", "ining-demo/0 [default0]:91 Modules in train mode\n", "ining-demo/0 [default0]:0 Modules in eval mode\n", "ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:39 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:39 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n", "ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.43\n", "ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 1024\n", "ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 1536\n", "ining-demo/0 [default0]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:03 preemption:87] Received signal 15, initiating graceful stop\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:03 preemption:67] Preemption detected, saving checkpoint and exiting\n", "ining-demo/0 [default4]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default1]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default3]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default2]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default6]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default5]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default7]:Simulating preemption by raising a SIGTERM at step 4!\n", "ining-demo/0 [default0]:[NeMo W 2025-03-07 22:48:03 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:06 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=3-consumed_samples=2048.0-last.ckpt\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:06 preemption:73] Async checkpointing detected, waiting for it to complete\n", "ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:07 model_checkpoint:522] Async checkpoint save for step 4 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=3-consumed_samples=2048.0-last.ckpt) finalized successfully.\n", "ining-demo/0 I0307 22:48:12.869000 207080 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n", "ining-demo/0 I0307 22:48:12.870000 207080 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n", "ining-demo/0 I0307 22:48:12.870000 207080 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.00019741058349609375 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Job resiliency-in-pretraining-demo-bngwzzcstc0p3 finished: SUCCEEDED\n" ] }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']                              \n",
       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
       "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387633\")                                   \n",
       "experiment.status() # Gets the overall status                                                                      \n",
       "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task                             \n",
       "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running                   \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
                                                                                                                   \n",
       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
       "nemo experiment status resiliency-in-pretraining-demo_1741387633                                                   \n",
       "nemo experiment logs resiliency-in-pretraining-demo_1741387633 0                                                   \n",
       "nemo experiment cancel resiliency-in-pretraining-demo_1741387633 0                                                 \n",
       "                                                                                                                   \n",
       "
\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "1bf9c6f8-ef79-4499-845a-72d1a28e43b1", "metadata": {}, "source": [ "## 4.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 26, "id": "e5dbf5ea-9ef3-494e-b2d7-faafb1a1be58", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 27, "id": "80849855-972d-4121-ac77-65a10fe610a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restore pretrain.trainer.callbacks\n", "pretrain.trainer.callbacks = copy.deepcopy(pretrain_trainer_callbacks)\n", "run_plugins = []\n", "pretrain.trainer.callbacks" ] }, { "cell_type": "markdown", "id": "bf3acd19-c456-4a81-a933-12f11032bfb5", "metadata": {}, "source": [ "# 5. Discuss asynchronous distributed checkpointing\n", "Checkpointing is important for recovering from failures, but traditional checkpointing has drawbacks:\n", "\n", "1. Training pauses while saving checkpoints\n", "2. To minimize these pauses, checkpoints are usually only saved once per epoch\n", "3. If training fails between checkpoints, work must be redone from the last checkpoint\n", "\n", "For example, with:\n", "- 500 steps per epoch\n", "- 10 seconds per step\n", "- 3 epochs total\n", "\n", "Best case (no failures):\n", "- Training time = 15,000 seconds (500 steps × 10 seconds × 3 epochs)\n", "\n", "Worst case (failure at step 799):\n", "- Must redo nearly 2 full epochs\n", "- Training time = 20,000 seconds (nearly 5,000 seconds wasted)\n", "\n", "Asynchronous checkpointing solves these problems by:\n", "- Saving checkpoints without pausing training\n", "- Using fast distributed checkpointing via Megatron-Core\n", "- Allowing frequent checkpoints with minimal overhead\n", "\n", "This means you can checkpoint often to minimize lost work, without slowing down training.\n", "\n", "For more details, see:\n", "- [Megatron-Core distributed checkpointing](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html)\n", "- [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/checkpoints/dist_ckpt.rst)\n", "\n", "Note: NeMo enables asynchronous and parallel checkpointing by default through MegatronStrategy's \n", "ckpt_async_save and ckpt_parallel_save options, so users automatically get these benefits\n", "without any additional configuration needed.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }