62M checkpoint appears incomplete/truncated despite matching LFS sha256

by KantaHayashiAI - opened about 12 hours ago

Hi NVIDIA team,

I may have found an issue with the 62M proxy checkpoint artifact:

nemotron_climb_proxy_model_62m/iter_2500000/mp_rank_00/model_optim_rng.pt

The downloaded file matches the Hugging Face LFS metadata exactly, so this does not look like a local download/cache corruption:

file size: 770,527,232 bytes
LFS sha256 / local sha256: f339e80c501ead58cdd067442a68e3930fd3438ac47f9b95d555ea532c84ca01

However, the checkpoint cannot be loaded as a PyTorch checkpoint:

PytorchStreamReader failed reading zip archive: failed finding central directory

I also inspected the zip/PyTorch-serialization structure:

the file starts with a valid local zip header (PK\x03\x04)
the tail contains no EOCD (PK\x05\x06)
the tail contains no ZIP64 EOCD (PK\x06\x06)
the tail contains no central directory headers (PK\x01\x02)
the last stream-local entry I can find is model_optim_rng/data/798
expected later records such as model_optim_rng/data/8, model_optim_rng/version, and model_optim_rng/.data/serialization_id are absent

As a repairability check, I rebuilt a zip central directory from the surviving local file headers and added version / .data/serialization_id. The rebuilt archive became valid as a zip file, but torch.load still failed:

PytorchStreamReader failed locating file data/8: file not found

So this appears to be more than a missing central directory; some PyTorch storage records themselves seem to be missing from the published 62M artifact.

As a control, the 350M checkpoint:

nemotron_climb_proxy_model_350m/iter_2384053/mp_rank_00/model_optim_rng.pt

does load correctly in the same environment, and its zip structure includes the expected central directory and EOCD records.

Could you please verify whether the 62M model_optim_rng.pt upload is complete, or re-upload the 62M checkpoint? If the checkpoint was intended to be split or there is an alternate 62M checkpoint source, it would be helpful to document that as well.

Thanks for releasing these proxy models.

sarahyurick

NVIDIA org about 1 hour ago

Hi @KantaHayashiAI thanks for reporting! I am able to reproduce the error too. Let me see about uploading a fix soon.

For documentation purposes my code for reproducing is just:

import torch

path = "nemotron_climb_proxy_model_62m/iter_2500000/mp_rank_00/model_optim_rng.pt"
torch.load(path, map_location="cpu", weights_only=False)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment