{ "cells": [ { "cell_type": "markdown", "id": "d63d934c-f709-4e6d-aa44-03f6b1926180", "metadata": {}, "source": [ "# Resilient LLM Training with NeMo Framework\n", "\n", "This notebook demonstrates how to use NeMo's resiliency features for robust LLM training. It covers:\n", "\n", "1. **Crash Recovery**: Using in-job restart capabilities to automatically recover from failures during training\n", "2. **Straggler Detection**: Identifying and handling slow/stuck processes in distributed training\n", "3. **Checkpointing**: Implementing asynchronous checkpointing for efficient model saving\n", "\n", "The demo uses a small LLaMA model and simulated crashes to showcase these features in action. We'll walk through:\n", "- Setting up a local executor with fault tolerance enabled\n", "- Configuring the straggler detection callbacks\n", "- Launching distributed training with resiliency features\n", "- Monitoring training progress and recovery from failures\n", "- Analyzing logs and checkpoints\n", "\n", "This demonstrates how NeMo makes LLM training more robust and production-ready by handling common failure modes automatically.\n", "\n", "NeMo Framework integrates resiliency features from the [NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext) to minimize training disruptions and handle failures gracefully.\n", "\n", "The key features include\n", "- Fault Tolerance: Automatically resumes training from the last checkpoint in case of interruptions.\n", "- Straggler Detection: Identifies and mitigates slow-performing nodes to ensure efficient training.\n", "\n", "For detailed documentation on these resiliency features, see the [NeMo Framework Resiliency Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/resiliency.html)" ] }, { "cell_type": "code", "execution_count": 1, "id": "f3f9cea8-a917-4c81-b80e-4fc52ce3359c", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints and prepare for a fresh run\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "markdown", "id": "e03a6466-45a4-4987-a5ac-10cdd4dbaf86", "metadata": {}, "source": [ "# 1. Setup a simple training job and demostrate successful training" ] }, { "cell_type": "code", "execution_count": 2, "id": "2dfb9b8b-3359-4d00-88fa-abf0e24f7850", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "[NeMo W 2025-03-07 22:33:41 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n", " cm = get_cmap(\"Set1\")\n", " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Required libraries loaded.\n" ] } ], "source": [ "# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.\n", "\n", "# Required Libraries\n", "import argparse\n", "import copy\n", "import math\n", "import os\n", "from functools import partial\n", "from typing import Any\n", "import torch\n", "\n", "import nemo_run as run\n", "from lightning.pytorch.callbacks import Callback\n", "\n", "from nemo.collections import llm\n", "from nemo.collections.llm.recipes.callbacks.common import straggler_det_callback\n", "from nemo.lightning.run import plugins\n", "\n", "from crash_simulator import CrashSimulationCallback\n", "from preemption_simulator import PreemptionSimulationCallback\n", "\n", "print(\"Required libraries loaded.\")" ] }, { "cell_type": "markdown", "id": "e2a19a6d-8df8-4930-bb50-d622b6b72af7", "metadata": {}, "source": [ "## 1.1 Define the executor\n", "\n", "Define and initialize a local executor, which is used to manage distributed computing tasks. The executor encapsulates configurations for launching jobs (e.g. number of devices, environment variables, task distribution)." ] }, { "cell_type": "code", "execution_count": 3, "id": "8740b1b8-0f89-40a9-a361-a88a6371d073", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executor setup complete.\n" ] } ], "source": [ "def local_executor(devices: int = 8) -> run.LocalExecutor:\n", " \"\"\"\n", " Factory method for creating a LocalExecutor instance. \n", " This sets up environment variables and configures the number of devices.\n", "\n", " Args:\n", " devices (int): Number of devices to be used per node.\n", "\n", " Returns:\n", " run.LocalExecutor: Configured local executor object.\n", " \"\"\"\n", " env_vars = {\n", " \"TRANSFORMERS_OFFLINE\": \"1\", # Run Transformer models offline\n", " \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\", # Optimize PyTorch NCCL\n", " \"NCCL_NVLS_ENABLE\": \"0\", # Experimental NCCL environment variable\n", " \"NVTE_DP_AMAX_REDUCE_INTERVAL\": \"0\", \n", " \"NVTE_ASYNC_AMAX_REDUCTION\": \"1\",\n", " }\n", " # Create LocalExecutor with the `ft` launcher\n", " executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\", env_vars=env_vars)\n", " return executor\n", "\n", "# Initialize the executor based on the arguments\n", "executor = local_executor(devices=8)\n", "\n", "print(\"Executor setup complete.\")" ] }, { "cell_type": "markdown", "id": "994ffd38-5f97-4001-ad86-46b686edb0e8", "metadata": {}, "source": [ "## 1.2 Model setup\n", "Load and configure a LLAMA pretrain recipe. We choose a small 54M parameter llama3 based model for faster execution. This model is obtained by reducing the sequence length, number of layers, hidden size and number of attention heads from the original llama3 8B model configuration as defined in the [Llama3Config8B class](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/llama.py)." ] }, { "cell_type": "code", "execution_count": 4, "id": "52245a88", "metadata": {}, "outputs": [], "source": [ "# Create a small LLAMA3 model configuration\n", "def small_llama_cfg() -> llm.GPTConfig:\n", " \"\"\"Small 54M parameter model\"\"\"\n", " return run.Config(\n", " llm.Llama3Config8B,\n", " rotary_base=500_000,\n", " seq_length=128,\n", " num_layers=4,\n", " hidden_size=768,\n", " ffn_hidden_size=2688,\n", " num_attention_heads=16,\n", " init_method_std=0.023,\n", " )\n" ] }, { "cell_type": "markdown", "id": "b6988ce3", "metadata": {}, "source": [ "## 1.3 Modify the training recipe\n", "`pretrain` is a partial function that takes in the experiment name and checkpoint directory, and returns a pretrain recipe. It is setup to use `num_nodes=1` and `num_gpus_per_node=8` by default but this can be changed by modifying the `num_nodes` and `num_gpus_per_node` arguments. This demo uses the llama3 8b pretrain recipe as defined in the `llama31_8b.pretrain_recipe` [module](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py). This defaults to using a mock dataset: [MockDataModule](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/mock.py) but please refer to the [Llama3_8b recipe](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py) for instructions on how to use a custom dataset. Since we are using a mock dataset, we set the `max_steps` to 20 so we can run the experiment in a reasonable time.\n", "\n", "We also disable validation sanity checks to reduce startup time, and set tensor model parallel size to 2 and context parallel size to 1." ] }, { "cell_type": "code", "execution_count": 5, "id": "f5c99ffc-3718-4383-b77a-161f387ce302", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model recipe setup complete.\n" ] } ], "source": [ "# Experiment name\n", "exp_name = \"resiliency-in-pretraining-demo\"\n", "\n", "# Preliminary setup for the LLAMA pretrain recipe\n", "pretrain = partial(llm.llama31_8b.pretrain_recipe, num_nodes=1, num_gpus_per_node=8)(\n", " name=exp_name, dir=\"/tmp/nemo_run/checkpoints\"\n", ")\n", "pretrain.model = run.Config(llm.LlamaModel, small_llama_cfg())\n", "pretrain.trainer.strategy.tensor_model_parallel_size = 2\n", "pretrain.trainer.strategy.context_parallel_size = 1\n", "pretrain.trainer.num_sanity_val_steps = 0\n", "pretrain.broadcast(max_steps=20)\n", "pretrain.trainer.limit_val_batches = 2\n", "pretrain.trainer.log_every_n_steps = 1\n", "pretrain.trainer.val_check_interval = 10\n", "print(\"Model recipe setup complete.\")" ] }, { "cell_type": "markdown", "id": "46ae75f7-1a91-4429-bfbf-3ebe62bce123", "metadata": {}, "source": [ "## 1.4 Running the Experiment\n", "Run the entire pretraining experiment. Depending on the arguments passed:\n", "- If `dryrun` is True, it performs a dry run (to validate configurations).\n", "- Otherwise, it launches the actual training run locally." ] }, { "cell_type": "code", "execution_count": 6, "id": "03887dd7-a23b-44c9-825a-311849729531", "metadata": {}, "outputs": [], "source": [ "def run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False):\n", " \"\"\"\n", " Run the pretraining experiment either as a dry run or actual training.\n", " \n", " Args:\n", " exp_name: Name of the experiment\n", " pretrain: Pretrain configuration object\n", " executor: Executor to run the experiment\n", " run_plugins: List of runtime plugins\n", " dryrun: Boolean flag to perform a dry run\n", " \"\"\"\n", " with run.Experiment(f\"{exp_name}\") as exp:\n", " # Add the pretrain job to the experiment\n", " exp.add(\n", " pretrain,\n", " executor=executor,\n", " name=exp_name,\n", " plugins=run_plugins,\n", " tail_logs=True,\n", " )\n", "\n", " # Execute the experiment based on the dryrun flag\n", " if dryrun:\n", " print(\"Performing dry run ...\")\n", " exp.dryrun()\n", " else:\n", " print(\"Launching training run ...\")\n", " exp.run(sequential=True, detach=True)\n", " print(\"Experiment executed successfully.\")" ] }, { "cell_type": "markdown", "id": "6998265d-628a-4a68-bfd2-c40c107c2a43", "metadata": {}, "source": [ "Note: This run genrally fails the first time around since we are using a Mock dataset and it cannot find the tokenizer files. So the error is usually `FileNotFoundError: [Errno 2] No such file or directory: 'gpt2-merges.txt'`.\n", "\n", "To avoid this, you can manually download the following files before launching a run" ] }, { "cell_type": "code", "execution_count": 7, "id": "27a817a6-61dc-4122-8a03-dade66d0cb03", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "--2025-03-07 22:33:43-- https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n", "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.209.88, 54.231.195.168, 52.217.224.64, ...\n", "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.209.88|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1042301 (1018K) [application/json]\n", "Saving to: ‘gpt2-vocab.json’\n", "\n", " 0K .......... .......... .......... .......... .......... 4% 824K 1s\n", " 50K .......... .......... .......... .......... .......... 9% 814K 1s\n", " 100K .......... .......... .......... .......... .......... 14% 810K 1s\n", " 150K .......... .......... .......... .......... .......... 19% 311M 1s\n", " 200K .......... .......... .......... .......... .......... 24% 110M 1s\n", " 250K .......... .......... .......... .......... .......... 29% 268M 0s\n", " 300K .......... .......... .......... .......... .......... 34% 821K 0s\n", " 350K .......... .......... .......... .......... .......... 39% 104M 0s\n", " 400K .......... .......... .......... .......... .......... 44% 331M 0s\n", " 450K .......... .......... .......... .......... .......... 49% 602M 0s\n", " 500K .......... .......... .......... .......... .......... 54% 518M 0s\n", " 550K .......... .......... .......... .......... .......... 58% 545M 0s\n", " 600K .......... .......... .......... .......... .......... 63% 829K 0s\n", " 650K .......... .......... .......... .......... .......... 68% 111M 0s\n", " 700K .......... .......... .......... .......... .......... 73% 328M 0s\n", " 750K .......... .......... .......... .......... .......... 78% 325M 0s\n", " 800K .......... .......... .......... .......... .......... 83% 277M 0s\n", " 850K .......... .......... .......... .......... .......... 88% 333M 0s\n", " 900K .......... .......... .......... .......... .......... 93% 320M 0s\n", " 950K .......... .......... .......... .......... .......... 98% 487M 0s\n", " 1000K .......... ....... 100% 616M=0.3s\n", "\n", "2025-03-07 22:33:43 (3.23 MB/s) - ‘gpt2-vocab.json’ saved [1042301/1042301]\n", "\n", "--2025-03-07 22:33:43-- https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\n", "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.224.64, 52.217.163.184, 52.216.207.197, ...\n", "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.224.64|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 456318 (446K) [text/plain]\n", "Saving to: ‘gpt2-merges.txt’\n", "\n", " 0K .......... .......... .......... .......... .......... 11% 801K 0s\n", " 50K .......... .......... .......... .......... .......... 22% 786K 0s\n", " 100K .......... .......... .......... .......... .......... 33% 786K 0s\n", " 150K .......... .......... .......... .......... .......... 44% 272M 0s\n", " 200K .......... .......... .......... .......... .......... 56% 454M 0s\n", " 250K .......... .......... .......... .......... .......... 67% 797K 0s\n", " 300K .......... .......... .......... .......... .......... 78% 106M 0s\n", " 350K .......... .......... .......... .......... .......... 89% 296M 0s\n", " 400K .......... .......... .......... .......... ..... 100% 334M=0.3s\n", "\n", "2025-03-07 22:33:44 (1.72 MB/s) - ‘gpt2-merges.txt’ saved [456318/456318]\n", "\n" ] } ], "source": [ "%%bash\n", "mkdir -p /root/.cache/torch/megatron\n", "wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json && mv gpt2-vocab.json /root/.cache/torch/megatron/megatron-gpt-345m_vocab\n", "wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt && mv gpt2-merges.txt /root/.cache/torch/megatron/megatron-gpt-345m_merges" ] }, { "cell_type": "code", "execution_count": 8, "id": "2836df0e-3a43-4dfd-9534-4633f5aa2441", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741386824 ──────\n", "\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:33:44] Cannot detach from this experiment. Please keep it running until completion. experiment.py:651\n", "\n" ], "text/plain": [ "\u001b[2;36m[22:33:44]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=506192;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=636450;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
Launching job resiliency-in-pretraining-demo for experiment experiment.py:724\n", " resiliency-in-pretraining-demo \n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=291970;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=992860;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg:
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741386824 to finish ────────────────────\n", "\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741386824 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Experiment Status for resiliency-in-pretraining-demo_1741386824\n", "\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741386824\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"Task 0: resiliency-in-pretraining-demo\n",
"- Status: RUNNING\n",
"- Executor: LocalExecutor\n",
"- Job id: resiliency-in-pretraining-demo-xb3fnk3npq9wn\n",
"- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n",
"\n"
],
"text/plain": [
"\n",
"\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n",
"- \u001b[1;32mStatus\u001b[0m: RUNNING\n",
"- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n",
"- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-xb3fnk3npq9wn\n",
"- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Waiting for job resiliency-in-pretraining-demo-xb3fnk3npq9wn to finish [log=True]...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] \n",
"ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
"ining-demo/0 W0307 22:33:45.902000 170829 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] min_nodes : 1\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] max_nodes : 1\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] run_id : 8931\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] max_restarts : 0\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741386824/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-xb3fnk3npq9wn/torchelastic/resiliency-in-pretraining-demo\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n",
"ining-demo/0 I0307 22:33:45.902000 170829 torch/distributed/launcher/api.py:194] \n",
"ining-demo/0 I0307 22:33:45.906000 170829 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n",
"ining-demo/0 I0307 22:33:45.906000 170829 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] master_port=42531\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:525] \n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n",
"ining-demo/0 I0307 22:33:45.964000 170829 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:55 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
"ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:56 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n",
"ining-demo/0 [default0]:GPU available: True (cuda), used: True\n",
"ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n",
"ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:33:56 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n",
"ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n",
"ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:440] Ranks 0 has data parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:448] Rank 0 has context parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:452] Ranks 0 has context parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:506] Rank 0 has embedding group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:57 megatron_init:515] Rank 0 has embedding rank: 0\n",
"ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:distributed_backend=nccl\n",
"ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n",
"ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n",
"ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n",
"ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n",
"ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 num_microbatches_calculator:228] setting number of microbatches to constant 128\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n",
"ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.output_layer.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:33:59 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n",
"ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]: | Name | Type | Params | Mode \n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:54.7 M Trainable params\n",
"ining-demo/0 [default0]:0 Non-trainable params\n",
"ining-demo/0 [default0]:54.7 M Total params\n",
"ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n",
"ining-demo/0 [default0]:91 Modules in train mode\n",
"ining-demo/0 [default0]:0 Modules in eval mode\n",
"ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:34:10 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:34:10 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.31\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.549 | consumed_samples: 1024\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.556 | consumed_samples: 1536\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.561 | consumed_samples: 2048\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.566 | consumed_samples: 2560\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.559 | consumed_samples: 3072\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.568 | consumed_samples: 3584\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.565 | consumed_samples: 4096\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.572 | consumed_samples: 4608\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.572 | consumed_samples: 5120\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:35:19 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:35:22 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.568 | consumed_samples: 5632\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:35:29 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 6144\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.577 | consumed_samples: 6656\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.575 | consumed_samples: 7168\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.58 | consumed_samples: 7680\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 8192\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 16/19 | lr: 2.549e-06 | global_batch_size: 512 | global_step: 16 | reduced_train_loss: 11.03 | train_step_timing in s: 7.578 | consumed_samples: 8704\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 17/19 | lr: 2.699e-06 | global_batch_size: 512 | global_step: 17 | reduced_train_loss: 11.03 | train_step_timing in s: 7.584 | consumed_samples: 9216\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 18/19 | lr: 2.849e-06 | global_batch_size: 512 | global_step: 18 | reduced_train_loss: 11.03 | train_step_timing in s: 7.579 | consumed_samples: 9728\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:36:30 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:36:31 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt\n",
"ining-demo/0 [default0]:Training epoch 1, iteration 0/19 | lr: 2.999e-06 | global_batch_size: 512 | global_step: 19 | reduced_train_loss: 11.03 | train_step_timing in s: 7.592 | consumed_samples: 10240\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:36:39 model_checkpoint:522] Async checkpoint save for step 19 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:`Trainer.fit` stopped: `max_steps=20` reached.\n",
"ining-demo/0 I0307 22:36:44.958000 170829 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n",
"ining-demo/0 I0307 22:36:44.958000 170829 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n",
"ining-demo/0 I0307 22:36:44.959000 170829 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.0002548694610595703 seconds\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Job resiliency-in-pretraining-demo-xb3fnk3npq9wn finished: SUCCEEDED\n"
]
},
{
"data": {
"text/html": [
"\n", "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo'] \n", "# You can inspect and reconstruct this experiment at a later point in time using: \n", "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741386824\") \n", "experiment.status() # Gets the overall status \n", "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task \n", "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "# You can inspect this experiment at a later point in time using the CLI as well: \n", "nemo experiment status resiliency-in-pretraining-demo_1741386824 \n", "nemo experiment logs resiliency-in-pretraining-demo_1741386824 0 \n", "nemo experiment cancel resiliency-in-pretraining-demo_1741386824 0 \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741386824\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# run the experiment\n", "run_plugins = []\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "029deebc-505f-4609-a274-38825bb91971", "metadata": {}, "source": [ "## 1.5 Cleanup and save clean states" ] }, { "cell_type": "code", "execution_count": 9, "id": "0bf0f185-b943-4050-b74d-6d8a4c0333bc", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 10, "id": "57feef98-da24-4ac5-b218-9dad6249d7c1", "metadata": {}, "outputs": [], "source": [ "pretrain_trainer_callbacks = copy.deepcopy(pretrain.trainer.callbacks)\n", "pretrain_trainer_callbacks\n", "run_plugins = []" ] }, { "cell_type": "markdown", "id": "71a72081-6f2e-46e7-9bec-2122b1d45acf", "metadata": {}, "source": [ "# 2. Demostrate Fault tolerance with crash detection and in-job restart\n", "The [Fault Tolerance plugin](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/run/plugins.py)\n", "- Detects hangs/crashes during training and relaunches the training job without manual intervention\n", "- It uses NVIDIA Resiliency Extension's `ft_launcher` which has been integrated into [NeMo-Run](https://github.com/NVIDIA/NeMo-Run) as [FaultTolerance](https://github.com/NVIDIA/NeMo-Run/blob/main/nemo_run/core/execution/launcher.py).\n", "- It also uses the `FaultToleranceCallback` from NVIDIA Resiliency Extension which sets up the heartbeats" ] }, { "cell_type": "markdown", "id": "5d4c0e6b-551b-46d9-8c3e-73626c5ab147", "metadata": {}, "source": [ "## 2.1 Setup FaultTolerancePlugin\n", "These env vars need to be set as well -\n", "- `FAULT_TOL_CFG_PATH` is the path to the fault tolerance config file. If it is empty, default configuration is used\n", "- `FAULT_TOL_FINISHED_FLAG_FILE` is the path where the fault tolerance package writes when a run is successfully completed so as to not trigger a re-launch." ] }, { "cell_type": "code", "execution_count": 11, "id": "99a1e083", "metadata": {}, "outputs": [], "source": [ "# Add FaultTolerancePlugin plugin and setup required env vars\n", "run_plugins = [plugins.FaultTolerancePlugin()]\n", "\n", "os.environ[\"FAULT_TOL_CFG_PATH\"] = \"/tmp/sample_job_ft_cfg.yml\"\n", "os.environ[\"FAULT_TOL_FINISHED_FLAG_FILE\"] = \"/tmp/sample_job_finished_flag\"" ] }, { "cell_type": "markdown", "id": "bd88a1a5", "metadata": {}, "source": [ "## 2.2 Setup the crash simulator and run the experiment\n", "We use the `CrashSimulationCallback` to simulate a crash during training. This callback is configured to crash the process at step 17 if a crash has not already occurred.\n", "\n", "Expected workflow:\n", "- Start training: Trainer Step counter = 0\n", "- After 10 trainer steps: Trainer Step counter = 10 -> save checkpoint\n", "- After 17 trainer steps: Trainer Step counter = 17 -> crash simulated, set `has_simulated_crash_happened` to `True`\n", "- Automatic in-job restart from checkpoint at step 10: Trainer step counter = 10\n", "- After 17 trainer steps:Trainer Step counter = 17 -> no crash simulated as `has_simulated_crash_happened == True`\n", "- After 20 trainer steps: Trainer Step counter = 20 -> successfully completes training" ] }, { "cell_type": "code", "execution_count": 12, "id": "dd2943a2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387007 ──────\n", "\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:36:47] Cannot detach from this experiment. Please keep it running until completion. experiment.py:651\n", "\n" ], "text/plain": [ "\u001b[2;36m[22:36:47]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=246338;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=43525;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
Launching job resiliency-in-pretraining-demo for experiment experiment.py:724\n", " resiliency-in-pretraining-demo \n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=900784;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=467950;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg:
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387007 to finish ────────────────────\n", "\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387007 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Experiment Status for resiliency-in-pretraining-demo_1741387007\n", "\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387007\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"Task 0: resiliency-in-pretraining-demo\n",
"- Status: RUNNING\n",
"- Executor: LocalExecutor\n",
"- Job id: resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n",
"- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n",
"\n"
],
"text/plain": [
"\n",
"\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n",
"- \u001b[1;32mStatus\u001b[0m: RUNNING\n",
"- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n",
"- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-ghpmrpzqnhtb0\n",
"- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Waiting for job resiliency-in-pretraining-demo-ghpmrpzqnhtb0 to finish [log=True]...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ining-demo/0 [2025-03-07 22:36:48,812] [WARNING] [ft_launcher@4809917ea058] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.\n",
"ining-demo/0 [2025-03-07 22:36:48,812] [WARNING] [ft_launcher@4809917ea058] \n",
"ining-demo/0 *****************************************\n",
"ining-demo/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
"ining-demo/0 *****************************************\n",
"ining-demo/0 [2025-03-07 22:36:48,816] [INFO] [ft_launcher@4809917ea058] [default] starting workers for entrypoint: python\n",
"ining-demo/0 [2025-03-07 22:36:48,817] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous'ing worker group\n",
"ining-demo/0 [2025-03-07 22:36:49,126] [INFO] [ft_launcher@4809917ea058] [default] Rendezvous complete for workers. Result:\n",
"ining-demo/0 restart_count=0\n",
"ining-demo/0 master_addr=4809917ea058\n",
"ining-demo/0 master_port=47865\n",
"ining-demo/0 group_rank=0\n",
"ining-demo/0 group_world_size=1\n",
"ining-demo/0 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 \n",
"ining-demo/0 [2025-03-07 22:36:49,126] [INFO] [ft_launcher@4809917ea058] [default] Starting worker group\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker0 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/0/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker1 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/1/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker2 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/2/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker3 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/3/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker4 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/4/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,008] [INFO] [ft_launcher@4809917ea058] Setting worker5 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/5/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,009] [INFO] [ft_launcher@4809917ea058] Setting worker6 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/6/error.json\n",
"ining-demo/0 [2025-03-07 22:37:00,009] [INFO] [ft_launcher@4809917ea058] Setting worker7 reply file to: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387007/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-ghpmrpzqnhtb0/torchelastic/resiliency-in-pretraining-demo/6224_gnxzn9y9/attempt_0/7/error.json\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:09 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
"ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default2]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n",
"ining-demo/0 [default6]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:10 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:10 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n",
"ining-demo/0 [default4]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default1]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default0]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:11 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:11 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n",
"ining-demo/0 [default7]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default0]:GPU available: True (cuda), used: True\n",
"ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n",
"ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:11 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n",
"ining-demo/0 [default3]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default5]:Setup to simulate a crash if step == 17 and a crash hasn't been simulated before\n",
"ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:440] Ranks 0 has data parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:448] Rank 0 has context parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:452] Ranks 0 has context parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:506] Rank 0 has embedding group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:12 megatron_init:515] Rank 0 has embedding rank: 0\n",
"ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n",
"ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n",
"ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n",
"ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n",
"ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:distributed_backend=nccl\n",
"ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance dir: /tmp/nemo_run/checkpoints\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 fault_tolerance_callback:311] [FaultToleranceCallback@rank0] Fault tolerance client initialized. Timeouts: HeartbeatTimeouts(initial=1800.00, subsequent=300.00, were_calculated=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 num_microbatches_calculator:228] setting number of microbatches to constant 128\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n",
"ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n",
"ining-demo/0 [default0]: \tmodule.output_layer.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:37:13 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n",
"ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default0]: | Name | Type | Params | Mode \n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:54.7 M Trainable params\n",
"ining-demo/0 [default0]:0 Non-trainable params\n",
"ining-demo/0 [default0]:54.7 M Total params\n",
"ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n",
"ining-demo/0 [default0]:91 Modules in train mode\n",
"ining-demo/0 [default0]:0 Modules in eval mode\n",
"ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:24 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:37:24 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.51\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.667 | consumed_samples: 1024\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.582 | consumed_samples: 1536\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.585 | consumed_samples: 2048\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 2560\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.587 | consumed_samples: 3072\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 3584\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.584 | consumed_samples: 4096\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 4608\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.588 | consumed_samples: 5120\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:38:34 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:38:36 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 5632\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:38:44 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 6144\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 6656\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.63 | consumed_samples: 7168\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.587 | consumed_samples: 7680\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.589 | consumed_samples: 8192\n",
"ining-demo/0 [default2]:[rank2]: Traceback (most recent call last):\n",
"ining-demo/0 [default2]:[rank2]: File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n",
"ining-demo/0 [default2]:[rank2]: return _run_code(code, main_globals, None,\n",
"ining-demo/0 [default2]:[rank2]: File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n",
"ining-demo/0 [default2]:[rank2]: exec(code, run_globals)\n",
"ining-demo/0 [default2]:[rank2]: File \"/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py\", line 66, in \n", "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo'] \n", "# You can inspect and reconstruct this experiment at a later point in time using: \n", "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387007\") \n", "experiment.status() # Gets the overall status \n", "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task \n", "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "# You can inspect this experiment at a later point in time using the CLI as well: \n", "nemo experiment status resiliency-in-pretraining-demo_1741387007 \n", "nemo experiment logs resiliency-in-pretraining-demo_1741387007 0 \n", "nemo experiment cancel resiliency-in-pretraining-demo_1741387007 0 \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387007\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Enable a crash simulation callback\n", "pretrain.trainer.callbacks.append(run.Config(CrashSimulationCallback, crash_step=17))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "947346fd-d62f-42e6-848b-b0e1ea2bf0e8", "metadata": {}, "source": [ "## 2.3 Cleanup" ] }, { "cell_type": "code", "execution_count": 13, "id": "452b9478-25bc-491d-a9a6-71fd698cac29", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 14, "id": "af2f22b1-8cc7-4281-a2f5-e3afb603e44b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387286 ──────\n", "\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:41:26] Cannot detach from this experiment. Please keep it running until completion. experiment.py:651\n", "\n" ], "text/plain": [ "\u001b[2;36m[22:41:26]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=101587;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=506023;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
Launching job resiliency-in-pretraining-demo for experiment experiment.py:724\n", " resiliency-in-pretraining-demo \n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=230211;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=581800;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-nr166h9790700\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg:
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387286 to finish ────────────────────\n", "\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387286 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Experiment Status for resiliency-in-pretraining-demo_1741387286\n", "\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387286\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"Task 0: resiliency-in-pretraining-demo\n",
"- Status: RUNNING\n",
"- Executor: LocalExecutor\n",
"- Job id: resiliency-in-pretraining-demo-nr166h9790700\n",
"- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n",
"\n"
],
"text/plain": [
"\n",
"\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n",
"- \u001b[1;32mStatus\u001b[0m: RUNNING\n",
"- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n",
"- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-nr166h9790700\n",
"- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Waiting for job resiliency-in-pretraining-demo-nr166h9790700 to finish [log=True]...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] \n",
"ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
"ining-demo/0 W0307 22:41:27.659000 192252 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] min_nodes : 1\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] max_nodes : 1\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] run_id : 2769\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] max_restarts : 0\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387286/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-nr166h9790700/torchelastic/resiliency-in-pretraining-demo\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n",
"ining-demo/0 I0307 22:41:27.659000 192252 torch/distributed/launcher/api.py:194] \n",
"ining-demo/0 I0307 22:41:27.663000 192252 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n",
"ining-demo/0 I0307 22:41:27.663000 192252 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] master_port=46031\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:525] \n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n",
"ining-demo/0 I0307 22:41:27.733000 192252 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:37 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
"ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n",
"ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:38 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n",
"ining-demo/0 [default0]:GPU available: True (cuda), used: True\n",
"ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n",
"ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:38 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:440] Ranks 0 has data parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:448] Rank 0 has context parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:452] Ranks 0 has context parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:506] Rank 0 has embedding group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:39 megatron_init:515] Rank 0 has embedding rank: 0\n",
"ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:distributed_backend=nccl\n",
"ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n",
"ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n",
"ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n",
"ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n",
"ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n",
"ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 num_microbatches_calculator:228] setting number of microbatches to constant 128\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n",
"ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.output_layer.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:41:41 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n",
"ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default0]: | Name | Type | Params | Mode \n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:54.7 M Trainable params\n",
"ining-demo/0 [default0]:0 Non-trainable params\n",
"ining-demo/0 [default0]:54.7 M Total params\n",
"ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n",
"ining-demo/0 [default0]:91 Modules in train mode\n",
"ining-demo/0 [default0]:0 Modules in eval mode\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:52 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:41:52 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.4\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.591 | consumed_samples: 1024\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.593 | consumed_samples: 1536\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.59 | consumed_samples: 2048\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.595 | consumed_samples: 2560\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 3072\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.605 | consumed_samples: 3584\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.61 | consumed_samples: 4096\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.608 | consumed_samples: 4608\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 5120\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:01 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:04 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 5632\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:12 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 6144\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 6656\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.604 | consumed_samples: 7168\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.605 | consumed_samples: 7680\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.611 | consumed_samples: 8192\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:144] \n",
"ining-demo/0 [default0]: GPU relative performance:\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.97\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:153] \n",
"ining-demo/0 [default0]: GPU individual performance:\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:58 straggler_det_callback:106] STRAGGLER DETECTION WARNING: Some GPUs have worse relative performance. Affected ranks: {StragglerId(rank=5, node='4809917ea058'), StragglerId(rank=6, node='4809917ea058'), StragglerId(rank=1, node='4809917ea058'), StragglerId(rank=4, node='4809917ea058'), StragglerId(rank=0, node='4809917ea058'), StragglerId(rank=2, node='4809917ea058')}\n",
"ining-demo/0 [default0]:[NeMo E 2025-03-07 22:43:58 straggler_det_callback:239] Detected stragglers. Terminating training...\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=16-consumed_samples=8704.0-last.ckpt\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 straggler_det_callback:245] Async checkpointing detected, waiting for it to complete...\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:43:58 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:43:58 model_checkpoint:522] Async checkpoint save for step 17 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=16-consumed_samples=8704.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 W0307 22:44:03.158000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192323 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.163000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192324 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.170000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192325 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.176000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192326 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.176000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192327 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.178000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192328 closing signal SIGTERM\n",
"ining-demo/0 W0307 22:44:03.179000 192252 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 192329 closing signal SIGTERM\n",
"ining-demo/0 E0307 22:44:04.918000 192252 torch/distributed/elastic/multiprocessing/api.py:862] failed (exitcode: 1) local_rank: 0 (pid: 192322) of binary: /usr/bin/python\n",
"ining-demo/0 I0307 22:44:04.924000 192252 torch/distributed/elastic/multiprocessing/errors/__init__.py:368] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)\n",
"ining-demo/0 Traceback (most recent call last):\n",
"ining-demo/0 File \"/usr/local/bin/torchrun\", line 33, in \n", "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo'] \n", "# You can inspect and reconstruct this experiment at a later point in time using: \n", "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387286\") \n", "experiment.status() # Gets the overall status \n", "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task \n", "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "# You can inspect this experiment at a later point in time using the CLI as well: \n", "nemo experiment status resiliency-in-pretraining-demo_1741387286 \n", "nemo experiment logs resiliency-in-pretraining-demo_1741387286 0 \n", "nemo experiment cancel resiliency-in-pretraining-demo_1741387286 0 \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387286\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Automatically detect and mitigate mock stragglers during training\n", "pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1, gpu_relative_perf_threshold=0.99, gpu_individual_perf_threshold=0.99))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "87f7577e-3dec-4e42-90cd-98e3fa481000", "metadata": {}, "source": [ "## 3.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 16, "id": "3fb5eea6-f2f7-4b7b-9c6d-a50a398d1c48", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 17, "id": "988af6cf-d834-4902-a208-3cf82d32381d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387447 ──────\n", "\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:44:07] Cannot detach from this experiment. Please keep it running until completion. experiment.py:651\n", "\n" ], "text/plain": [ "\u001b[2;36m[22:44:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=154117;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=798019;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
Launching job resiliency-in-pretraining-demo for experiment experiment.py:724\n", " resiliency-in-pretraining-demo \n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=357545;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=831697;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg:
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387447 to finish ────────────────────\n", "\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387447 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Experiment Status for resiliency-in-pretraining-demo_1741387447\n", "\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387447\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"Task 0: resiliency-in-pretraining-demo\n",
"- Status: RUNNING\n",
"- Executor: LocalExecutor\n",
"- Job id: resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n",
"- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n",
"\n"
],
"text/plain": [
"\n",
"\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n",
"- \u001b[1;32mStatus\u001b[0m: RUNNING\n",
"- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n",
"- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-n9hk3pk4hcz23c\n",
"- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Waiting for job resiliency-in-pretraining-demo-n9hk3pk4hcz23c to finish [log=True]...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] \n",
"ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
"ining-demo/0 W0307 22:44:09.270000 199266 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] min_nodes : 1\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] max_nodes : 1\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] run_id : 5723\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] max_restarts : 0\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387447/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-n9hk3pk4hcz23c/torchelastic/resiliency-in-pretraining-demo\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n",
"ining-demo/0 I0307 22:44:09.270000 199266 torch/distributed/launcher/api.py:194] \n",
"ining-demo/0 I0307 22:44:09.274000 199266 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n",
"ining-demo/0 I0307 22:44:09.274000 199266 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] master_port=36403\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:525] \n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n",
"ining-demo/0 I0307 22:44:09.377000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n",
"ining-demo/0 I0307 22:44:09.378000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n",
"ining-demo/0 I0307 22:44:09.378000 199266 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:18 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
"ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n",
"ining-demo/0 [default0]:GPU available: True (cuda), used: True\n",
"ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n",
"ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:19 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:440] Ranks 0 has data parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:448] Rank 0 has context parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:452] Ranks 0 has context parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:506] Rank 0 has embedding group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:19 megatron_init:515] Rank 0 has embedding rank: 0\n",
"ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:distributed_backend=nccl\n",
"ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n",
"ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n",
"ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n",
"ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n",
"ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n",
"ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n",
"ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 num_microbatches_calculator:228] setting number of microbatches to constant 128\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n",
"ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n",
"ining-demo/0 [default0]: \tmodule.output_layer.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:44:23 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n",
"ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default0]: | Name | Type | Params | Mode \n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:54.7 M Trainable params\n",
"ining-demo/0 [default0]:0 Non-trainable params\n",
"ining-demo/0 [default0]:54.7 M Total params\n",
"ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n",
"ining-demo/0 [default0]:91 Modules in train mode\n",
"ining-demo/0 [default0]:0 Modules in eval mode\n",
"ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:33 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:44:33 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.4\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.615 | consumed_samples: 1024\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.625 | consumed_samples: 1536\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 3/19 | lr: 5.997e-07 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 7.618 | consumed_samples: 2048\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 4/19 | lr: 7.496e-07 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 7.624 | consumed_samples: 2560\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 5/19 | lr: 8.996e-07 | global_batch_size: 512 | global_step: 5 | reduced_train_loss: 11.03 | train_step_timing in s: 7.629 | consumed_samples: 3072\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 6/19 | lr: 1.049e-06 | global_batch_size: 512 | global_step: 6 | reduced_train_loss: 11.03 | train_step_timing in s: 7.63 | consumed_samples: 3584\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 7/19 | lr: 1.199e-06 | global_batch_size: 512 | global_step: 7 | reduced_train_loss: 11.03 | train_step_timing in s: 7.632 | consumed_samples: 4096\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 8/19 | lr: 1.349e-06 | global_batch_size: 512 | global_step: 8 | reduced_train_loss: 11.03 | train_step_timing in s: 7.636 | consumed_samples: 4608\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 9/19 | lr: 1.499e-06 | global_batch_size: 512 | global_step: 9 | reduced_train_loss: 11.03 | train_step_timing in s: 7.625 | consumed_samples: 5120\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:45:43 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:45:46 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 10/19 | lr: 1.649e-06 | global_batch_size: 512 | global_step: 10 | reduced_train_loss: 11.03 | train_step_timing in s: 7.633 | consumed_samples: 5632\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:45:53 model_checkpoint:522] Async checkpoint save for step 10 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=9-consumed_samples=5120.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 11/19 | lr: 1.799e-06 | global_batch_size: 512 | global_step: 11 | reduced_train_loss: 11.03 | train_step_timing in s: 7.626 | consumed_samples: 6144\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 12/19 | lr: 1.949e-06 | global_batch_size: 512 | global_step: 12 | reduced_train_loss: 11.03 | train_step_timing in s: 7.629 | consumed_samples: 6656\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 13/19 | lr: 2.099e-06 | global_batch_size: 512 | global_step: 13 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 7168\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 14/19 | lr: 2.249e-06 | global_batch_size: 512 | global_step: 14 | reduced_train_loss: 11.03 | train_step_timing in s: 7.626 | consumed_samples: 7680\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 15/19 | lr: 2.399e-06 | global_batch_size: 512 | global_step: 15 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 8192\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:144] \n",
"ining-demo/0 [default0]: GPU relative performance:\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:153] \n",
"ining-demo/0 [default0]: GPU individual performance:\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:39 straggler_det_callback:236] Straggler report processing time: 0.040 sec.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 16/19 | lr: 2.549e-06 | global_batch_size: 512 | global_step: 16 | reduced_train_loss: 11.03 | train_step_timing in s: 7.635 | consumed_samples: 8704\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:144] \n",
"ining-demo/0 [default0]: GPU relative performance:\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:153] \n",
"ining-demo/0 [default0]: GPU individual performance:\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:47 straggler_det_callback:236] Straggler report processing time: 0.015 sec.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 17/19 | lr: 2.699e-06 | global_batch_size: 512 | global_step: 17 | reduced_train_loss: 11.03 | train_step_timing in s: 7.632 | consumed_samples: 9216\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:144] \n",
"ining-demo/0 [default0]: GPU relative performance:\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.97\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:153] \n",
"ining-demo/0 [default0]: GPU individual performance:\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 straggler_det_callback:236] Straggler report processing time: 0.015 sec.\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 18/19 | lr: 2.849e-06 | global_batch_size: 512 | global_step: 18 | reduced_train_loss: 11.03 | train_step_timing in s: 7.637 | consumed_samples: 9728\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:46:55 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:46:55 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:144] \n",
"ining-demo/0 [default0]: GPU relative performance:\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=0.98\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:153] \n",
"ining-demo/0 [default0]: GPU individual performance:\n",
"ining-demo/0 [default0]: Rank=3 Node=4809917ea058 Score=0.99\n",
"ining-demo/0 [default0]: Rank=2 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=1 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=6 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=4 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=5 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=7 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: Rank=0 Node=4809917ea058 Score=1.00\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 straggler_det_callback:236] Straggler report processing time: 0.014 sec.\n",
"ining-demo/0 [default0]:Training epoch 1, iteration 0/19 | lr: 2.999e-06 | global_batch_size: 512 | global_step: 19 | reduced_train_loss: 11.03 | train_step_timing in s: 7.73 | consumed_samples: 10240\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:03 model_checkpoint:522] Async checkpoint save for step 19 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=18-consumed_samples=9728.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 [default0]:`Trainer.fit` stopped: `max_steps=20` reached.\n",
"ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n",
"ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n",
"ining-demo/0 I0307 22:47:10.001000 199266 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.00034165382385253906 seconds\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Job resiliency-in-pretraining-demo-n9hk3pk4hcz23c finished: SUCCEEDED\n"
]
},
{
"data": {
"text/html": [
"\n", "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo'] \n", "# You can inspect and reconstruct this experiment at a later point in time using: \n", "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387447\") \n", "experiment.status() # Gets the overall status \n", "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task \n", "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "# You can inspect this experiment at a later point in time using the CLI as well: \n", "nemo experiment status resiliency-in-pretraining-demo_1741387447 \n", "nemo experiment logs resiliency-in-pretraining-demo_1741387447 0 \n", "nemo experiment cancel resiliency-in-pretraining-demo_1741387447 0 \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387447\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Automatically detect and mitigate mock stragglers during training\n", "# gpu_relative_perf_threshold and gpu_individual_perf_threshold default to 0.7 if not set explicitly\n", "pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1))\n", "\n", "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "e5b91070-e9f0-4fb0-ac3a-2460e1a948d9", "metadata": {}, "source": [ "The straggler detection system identifies GPUs that are lagging behind in performance, halts the job to prevent inefficiencies, and provides detailed information about which GPUs are struggling. It monitors GPU performance of ranks to pinpoint slower ranks that may hinder overall training efficiency, thus enabling targeted optimization for distributed training setups." ] }, { "cell_type": "code", "execution_count": 21, "id": "9ac65fee-c78c-4462-b372-a8377797bb0d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The current user does not have permission to change clocks for GPU 00000000:01:00.0.\n", "Terminating early due to previous errors.\n", "The current user does not have permission to change clocks for GPU 00000000:01:00.0.\n", "Terminating early due to previous errors.\n" ] } ], "source": [ "### !!!! IMPORTANT !!!! ###\n", "### Reset the GPU clocks\n", "!nvidia-smi --reset-gpu-clocks\n", "!nvidia-smi --reset-memory-clocks" ] }, { "cell_type": "markdown", "id": "a8838d46-3d6f-40d1-9741-d8702f6c8a45", "metadata": {}, "source": [ "## 4.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 22, "id": "937799e5-a4f0-40d8-88fe-508961d4d8f5", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 23, "id": "59138478-1bd1-47b9-87da-2c7516464221", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
────── Entering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387633 ──────\n", "\n" ], "text/plain": [ "\u001b[92m────── \u001b[0m\u001b[1;35mEntering Experiment resiliency-in-pretraining-demo with id: resiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[92m ──────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Launching training run ...\n" ] }, { "data": { "text/html": [ "
[22:47:13] Cannot detach from this experiment. Please keep it running until completion. experiment.py:651\n", "\n" ], "text/plain": [ "\u001b[2;36m[22:47:13]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;31m Cannot detach from this experiment. Please keep it running until completion.\u001b[0m \u001b]8;id=867994;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=660744;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#651\u001b\\\u001b[2m651\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n" ] }, { "data": { "text/html": [ "
Launching job resiliency-in-pretraining-demo for experiment experiment.py:724\n", " resiliency-in-pretraining-demo \n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job resiliency-in-pretraining-demo for experiment \u001b[0m \u001b]8;id=583095;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=29280;file:///opt/NeMo-Run/src/nemo_run/run/experiment.py#724\u001b\\\u001b[2m724\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m\u001b[1;36mresiliency-in-pretraining-demo\u001b[0m \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n", "Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3\n", "AppStatus:\n", " State: RUNNING\n", " Num Restarts: 0\n", " Roles: \n", " Msg:
─────────────────── Waiting for Experiment resiliency-in-pretraining-demo_1741387633 to finish ────────────────────\n", "\n" ], "text/plain": [ "\u001b[92m─────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment resiliency-in-pretraining-demo_1741387633 to finish\u001b[0m\u001b[92m ────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Experiment Status for resiliency-in-pretraining-demo_1741387633\n", "\n" ], "text/plain": [ "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214mresiliency-in-pretraining-demo_1741387633\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n",
"Task 0: resiliency-in-pretraining-demo\n",
"- Status: RUNNING\n",
"- Executor: LocalExecutor\n",
"- Job id: resiliency-in-pretraining-demo-bngwzzcstc0p3\n",
"- Local Directory: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n",
"\n"
],
"text/plain": [
"\n",
"\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214mresiliency-in-pretraining-demo\u001b[0m\n",
"- \u001b[1;32mStatus\u001b[0m: RUNNING\n",
"- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n",
"- \u001b[1;32mJob id\u001b[0m: resiliency-in-pretraining-demo-bngwzzcstc0p3\n",
"- \u001b[1;32mLocal Directory\u001b[0m: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Waiting for job resiliency-in-pretraining-demo-bngwzzcstc0p3 to finish [log=True]...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] \n",
"ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
"ining-demo/0 W0307 22:47:14.814000 207080 torch/distributed/run.py:793] *****************************************\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] entrypoint : nemo_run.core.runners.fdl_runner\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] min_nodes : 1\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] max_nodes : 1\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] nproc_per_node : 8\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] run_id : 5493\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_backend : c10d\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_endpoint : localhost:0\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] rdzv_configs : {'timeout': 900}\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] max_restarts : 0\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] monitor_interval : 0.1\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] log_dir : /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741387633/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-bngwzzcstc0p3/torchelastic/resiliency-in-pretraining-demo\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] metrics_cfg : {}\n",
"ining-demo/0 I0307 22:47:14.814000 207080 torch/distributed/launcher/api.py:194] \n",
"ining-demo/0 I0307 22:47:14.819000 207080 torch/distributed/elastic/agent/server/api.py:845] [default] starting workers for entrypoint: python\n",
"ining-demo/0 I0307 22:47:14.819000 207080 torch/distributed/elastic/agent/server/api.py:662] [default] Rendezvous'ing worker group\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] restart_count=0\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] master_addr=localhost\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] master_port=39547\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] group_rank=0\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] group_world_size=1\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:525] \n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/api.py:670] [default] Starting worker group\n",
"ining-demo/0 I0307 22:47:14.945000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:291] use_agent_store: True\n",
"ining-demo/0 I0307 22:47:14.946000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.\n",
"ining-demo/0 I0307 22:47:14.946000 207080 torch/distributed/elastic/agent/server/local_elastic_agent.py:229] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:24 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.\n",
"ining-demo/0 [default0]: cm = get_cmap(\"Set1\")\n",
"ining-demo/0 [default0]: \n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 tokenizer_utils:224] Getting Megatron tokenizer for pretrained model name: megatron-gpt-345m, custom vocab file: None, and merges file: None\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: gpt2, vocab_file: /root/.cache/torch/megatron/megatron-gpt-345m_vocab, merges_files: /root/.cache/torch/megatron/megatron-gpt-345m_merges, special_tokens_dict: {}, and use_fast: False\n",
"ining-demo/0 [default0]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default3]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default5]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 nemo_logger:145] Experiments will be logged at /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:26 megatron_strategy:315] Fixing mis-match between ddp-config & mcore-optimizer config\n",
"ining-demo/0 [default4]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default1]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default7]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default6]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default0]:GPU available: True (cuda), used: True\n",
"ining-demo/0 [default0]:TPU available: False, using: 0 TPU cores\n",
"ining-demo/0 [default0]:HPU available: False, using: 0 HPUs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:173] \"update_logger_directory\" is True. Overwriting tensorboard logger \"save_dir\" to /tmp/nemo_run/checkpoints/tb_logs\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 nemo_logger:212] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 20. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:26 resume:228] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints. Training from scratch.\n",
"ining-demo/0 [default2]:Setup to simulate a preemption if step == 4\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:426] Rank 0 has data parallel group : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 2, 4, 6]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 2, 4, 6], [1, 3, 5, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:440] Ranks 0 has data parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:448] Rank 0 has context parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:451] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:452] Ranks 0 has context parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:459] Rank 0 has model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:460] All model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:469] Rank 0 has tensor model parallel group: [0, 1]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:473] All tensor model parallel group ranks: [[0, 1], [2, 3], [4, 5], [6, 7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:474] Rank 0 has tensor model parallel rank: 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:494] Rank 0 has pipeline model parallel group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:506] Rank 0 has embedding group: [0]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:513] Rank 0 has pipeline model parallel rank 0\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:27 megatron_init:515] Rank 0 has embedding rank: 0\n",
"ining-demo/0 [default0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:distributed_backend=nccl\n",
"ining-demo/0 [default0]:All distributed processes registered. Starting with 8 processes\n",
"ining-demo/0 [default0]:----------------------------------------------------------------------------------------------------\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default3]:Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8\n",
"ining-demo/0 [default5]:Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8\n",
"ining-demo/0 [default6]:Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8\n",
"ining-demo/0 [default1]:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8\n",
"ining-demo/0 [default2]:Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8\n",
"ining-demo/0 [default4]:Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8\n",
"ining-demo/0 [default7]:Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 base:44] Padded vocab_size: 50432, original vocab_size: 50257, dummy tokens: 175.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 megatron_strategy:327] Copying Trainer's 'max_steps' (20) to LR scheduler's 'max_steps'.\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 num_microbatches_calculator:228] setting number of microbatches to constant 128\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 megatron_parallel:549] > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 54663936\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:302] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=True, fp8_param_gather=False)\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:323] Number of buckets for gradient all-reduce / reduce-scatter: 1\n",
"ining-demo/0 [default0]: Params for bucket 1 (54663936 elements):\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.final_layernorm.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc1.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_proj.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.3.mlp.linear_fc2.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]: \tmodule.embedding.word_embeddings.weight\n",
"ining-demo/0 [default0]: \tmodule.output_layer.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.2.self_attention.linear_qkv.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.mlp.linear_fc1.weight\n",
"ining-demo/0 [default0]: \tmodule.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:47:29 utils:302] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')\n",
"ining-demo/0 [default0]:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:\n",
"ining-demo/0 [default0]: | Name | Type | Params | Mode \n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:0 | module | DDP | 54.7 M | train\n",
"ining-demo/0 [default0]:----------------------------------------\n",
"ining-demo/0 [default0]:54.7 M Trainable params\n",
"ining-demo/0 [default0]:0 Non-trainable params\n",
"ining-demo/0 [default0]:54.7 M Total params\n",
"ining-demo/0 [default0]:218.656 Total estimated model params size (MB)\n",
"ining-demo/0 [default0]:91 Modules in train mode\n",
"ining-demo/0 [default0]:0 Modules in eval mode\n",
"ining-demo/0 [default6]:LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default2]:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default7]:LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default4]:LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default5]:LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default3]:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default1]:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:39 rerun_state_machine:1088] Implicit initialization of Rerun State Machine!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:47:39 rerun_state_machine:211] RerunStateMachine initialized in mode disabled\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 0/19 | lr: 1.499e-07 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 10.43\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 1/19 | lr: 2.999e-07 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 7.6 | consumed_samples: 1024\n",
"ining-demo/0 [default0]:Training epoch 0, iteration 2/19 | lr: 4.498e-07 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 7.599 | consumed_samples: 1536\n",
"ining-demo/0 [default0]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:03 preemption:87] Received signal 15, initiating graceful stop\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:03 preemption:67] Preemption detected, saving checkpoint and exiting\n",
"ining-demo/0 [default4]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default1]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default3]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default2]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default6]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default5]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default7]:Simulating preemption by raising a SIGTERM at step 4!\n",
"ining-demo/0 [default0]:[NeMo W 2025-03-07 22:48:03 validation:389] There is difference in the common state dict in different ranks. The differences are {2: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 3: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 4: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], []), 5: ([], [('optimizer', 0, 'optimizer', 'param_groups', 1, 'step')], [])}\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:06 model_checkpoint:497] Scheduled async checkpoint save for /tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=3-consumed_samples=2048.0-last.ckpt\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:06 preemption:73] Async checkpointing detected, waiting for it to complete\n",
"ining-demo/0 [default0]:[NeMo I 2025-03-07 22:48:07 model_checkpoint:522] Async checkpoint save for step 4 (/tmp/nemo_run/checkpoints/resiliency-in-pretraining-demo/checkpoints/model_name=0--val_loss=0.00-step=3-consumed_samples=2048.0-last.ckpt) finalized successfully.\n",
"ining-demo/0 I0307 22:48:12.869000 207080 torch/distributed/elastic/agent/server/api.py:864] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.\n",
"ining-demo/0 I0307 22:48:12.870000 207080 torch/distributed/elastic/agent/server/api.py:917] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish\n",
"ining-demo/0 I0307 22:48:12.870000 207080 torch/distributed/elastic/agent/server/api.py:931] Done waiting for other agents. Elapsed: 0.00019741058349609375 seconds\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Job resiliency-in-pretraining-demo-bngwzzcstc0p3 finished: SUCCEEDED\n"
]
},
{
"data": {
"text/html": [
"\n", "# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo'] \n", "# You can inspect and reconstruct this experiment at a later point in time using: \n", "experiment = run.Experiment.from_id(\"resiliency-in-pretraining-demo_1741387633\") \n", "experiment.status() # Gets the overall status \n", "experiment.logs(\"resiliency-in-pretraining-demo\") # Gets the log for the provided task \n", "experiment.cancel(\"resiliency-in-pretraining-demo\") # Cancels the provided task if still running \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['resiliency-in-pretraining-demo']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mresiliency-in-pretraining-demo\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "# You can inspect this experiment at a later point in time using the CLI as well: \n", "nemo experiment status resiliency-in-pretraining-demo_1741387633 \n", "nemo experiment logs resiliency-in-pretraining-demo_1741387633 0 \n", "nemo experiment cancel resiliency-in-pretraining-demo_1741387633 0 \n", " \n", "\n" ], "text/plain": [ "\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mresiliency-in-pretraining-demo_1741387633\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", "\u001b[48;2;39;40;34m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# run the experiment\n", "run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)" ] }, { "cell_type": "markdown", "id": "1bf9c6f8-ef79-4499-845a-72d1a28e43b1", "metadata": {}, "source": [ "## 4.2 Cleanup" ] }, { "cell_type": "code", "execution_count": 26, "id": "e5dbf5ea-9ef3-494e-b2d7-faafb1a1be58", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete old checkpoints\n", "rm -rf /tmp/nemo_run/checkpoints/" ] }, { "cell_type": "code", "execution_count": 27, "id": "80849855-972d-4121-ac77-65a10fe610a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[