SAM2-OCT β€” Fine-tuned SAM2 checkpoints for OCT segmentation

Fine-tuned SAM 2 (Segment Anything Model 2) checkpoints for interactive, multi-class segmentation of retinal Optical Coherence Tomography (OCT) images. Developed as part of an MSc dissertation at the University of the Witwatersrand.

These weights are designed to be used with the companion annotation tool CVAT-OCT (a fork of CVAT with a SAM2-OCT serverless interactor).

⚠️ Research use only. These models are experimental research artifacts and are not a medical device. They must not be used for clinical diagnosis, screening, or treatment decisions.

Model description

These checkpoints adapt SAM 2 for multi-class retinal layer segmentation using a semantically aware modification: SAM's generic mask tokens are replaced with dedicated per-layer mask tokens and per-class output heads, so that every retinal layer class is predicted in a single forward pass, while SAM 2's interactive prompting interface is preserved for optional manual refinement. The image encoder is SAM 2.1's Hiera Base+ backbone, fine-tuned end-to-end together with the modified mask decoder.

Key properties:

  • Single-pass multi-class output β€” one mask channel per retinal layer, rather than one binary mask per prompt.
  • Interactive-ready β€” point / box / rough-mask prompting is retained for human-in-the-loop correction (see the MGU_prompted checkpoint).
  • Data-efficient β€” on the macular (NR206) task the approach substantially outperforms a purpose-built specialised baseline when annotated data is scarce.

Checkpoints

File Description Base
MGU/final_runs_Glaucoma_last.pt Semantically aware SAM2 trained on the MGU peripapillary (glaucoma) dataset. Automatic single-pass segmentation of nine retinal layers plus the optic-disc region (ten foreground classes + background) on peripapillary OCT B-scans. SAM2.1 Hiera Base+
MGU_prompted/MGU_prompt_training_last.pt Prompted variant of the MGU model. Adds class-aware point and rough-mask prompt encoders so a reviewer can interactively guide or correct the output. Its automatic prediction matches the standard MGU model; brushing a rough mask over an error region improves the local segmentation (β‰ˆ +6.5% mIoU in the automatic-prompt evaluation), whereas point prompts did not yield a measurable gain in the current form. SAM2.1 Hiera Base+
NR206/final_runs_NR206_last.pt Semantically aware SAM2 trained on the NR206 macular dataset (healthy eyes). Automatic single-pass segmentation of eight retinal layers (+ background) on macular OCT B-scans. SAM2.1 Hiera Base+

Each checkpoint is ~880 MB.

The base sam2.1_hiera_base_plus.pt checkpoint is not included here β€” download it from Meta's SAM2 releases. Only the fine-tuned OCT weights are hosted in this repository.

Intended use

  • Interactive / automatic segmentation of OCT structures within the CVAT-OCT tool.
  • Research and educational exploration of SAM2 for medical image segmentation.

Out of scope

  • Any clinical, diagnostic, or patient-facing use.
  • Deployment on imaging modalities or populations other than those it was trained on (results are not expected to transfer). In particular, each checkpoint is specialised to its dataset's acquisition device, scan region (macular vs. peripapillary), and label set; cross-device / cross-region generalisation is not guaranteed.

How to use (with CVAT-OCT)

Download the checkpoint(s) into the matching models/ sub-directory of the SAM2-OCT serverless function, then start CVAT-OCT:

# from the root of a cvat-oct clone
mkdir -p serverless/pytorch/sam2-OCT-interactor/models/MGU

# Option A: huggingface_hub (recommended)
pip install huggingface_hub
python - <<'PY'
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id="enslinr/sam2-oct",                     # <-- your HF repo id
    filename="MGU/final_runs_Glaucoma_last.pt",
    local_dir="serverless/pytorch/sam2-OCT-interactor/models",
)
PY

# Option B: direct download
# wget https://huggingface.co/enslinr/sam2-oct/resolve/main/MGU/final_runs_Glaucoma_last.pt \
#   -O serverless/pytorch/sam2-OCT-interactor/models/MGU/final_runs_Glaucoma_last.pt

Point the function at the checkpoint via the SAM2_CHECKPOINT environment variable (see serverless/pytorch/sam2-OCT-interactor/function.yaml and docker-compose.override.yml in the CVAT-OCT repo). A short video walkthrough of the end-to-end annotation workflow is available at https://enslinr.github.io/cvat-oct/.

Training data

All training data are publicly available, fully anonymised OCT datasets; no new human data were collected. Images are grayscale OCT B-scans.

NR206 (macular, healthy eyes) β€” 206 macular B-scans of healthy human eyes, derived from the OCTID database. Acquired with a Cirrus HD-OCT device (Carl Zeiss Meditec) using an 840 nm source (β‰ˆ 5 Β΅m axial resolution); original resolution 500 Γ— 750 px. Labels cover 8 retinal-layer classes (NFL, GCL+IPL, INL, OPL, ONL, ELM+IS, OS, RPE) plus background. Author-provided split: 126 train / 40 val / 40 test. Dataset: He et al., Frontiers in Bioengineering and Biotechnology, 2023 (NR206).

MGU (peripapillary, glaucoma) β€” peripapillary OCT from 61 subjects (Shanghai General Hospital), acquired with a DRI OCT-1 Atlantis device (Topcon) over a 20.48 Γ— 7.94 mm field centred on the optic nerve head; original resolution 1024 Γ— 992 px. 122 manually annotated B-scans covering 10 foreground classes β€” nine retinal layers (RNFL, GCL, IPL, INL, OPL, ONL, IS/OS, RPE, Choroid) and the optic-disc region β€” plus background. Author-provided split: 74 train / 24 val / 12 test. Dataset: Li et al., Biomedical Optics Express, 2021 (MGU).

Preprocessing. B-scans are resized to SAM 2's 1024 Γ— 1024 input (stretch resizing, selected from a resize-strategy comparison), the single grayscale channel is duplicated across the three input channels, and images are normalised with ImageNet mean/standard deviation. Training used data augmentation (rotation, brightness/contrast jitter, Gaussian noise and blur, elastic and grid distortion, gamma adjustment, and CLAHE).

Ethics / data use. The study used only publicly available, de-identified datasets and collected no new human data; the University of the Witwatersrand granted a waiver of ethics clearance (Ethics Waiver Number: WCSAM-2024-19). Because the models are derived solely from these public datasets, releasing the fine-tuned weights is consistent with that use. Users should nonetheless comply with the terms of the underlying NR206 and MGU datasets.

Training procedure

  • Fine-tuned end-to-end from SAM 2.1 Hiera Base+, with SAM's mask tokens replaced by per-layer tokens and per-class MLP output heads.
  • Loss: a combined objective (Focal + Soft Dice + Soft IoU).
  • Optimisation: separate learning rates for the mask decoder (β‰ˆ 7.2 Γ— 10⁻³) and the image encoder (β‰ˆ 2.2 Γ— 10⁻⁷), AdamW-style weight decay 0.01, gradient clipping 2.0, cosine schedule with warmup. Final hyperparameters were selected via a 110-run Bayesian sweep (Weights & Biases) optimising validation mIoU.
  • Hardware: a single NVIDIA GeForce RTX 3090.
  • The MGU_prompted checkpoint additionally trains class-aware sparse (point) and dense (rough-mask) prompt encoders so that interactive prompts can target a specific layer.

Evaluation

Models are evaluated on the authors' held-out test splits using per-layer Dice, mean IoU (mIoU), and mean Dice. Statistical comparisons against retrained specialised baselines (EMV-Net and LightReSeg) use two-sided Wilcoxon signed-rank tests on per-image mIoU.

NR206 test set (macular, healthy; n = 40). Our model attains the highest score on every aggregate and per-layer metric, significantly outperforming both retrained baselines (mIoU and mean Dice, p < 0.001).

Metric mIoU Dice NFL GCL+IPL INL OPL ONL ELM+IS OS RPE
Ours (SAM2) 85.6 92.0 91.5 96.6 91.4 83.5 95.5 92.7 88.4 96.4

(mIoU/Dice are the aggregate scores; the remaining columns are per-layer Dice.)

MGU test set (peripapillary, glaucoma; n = 48). Our model significantly outperforms the retrained baselines on aggregate mIoU (p < 0.001) and matches the purpose-built published EMV-Net to within 0.2 mIoU, despite being a general-purpose foundation-model adaptation.

Metric mIoU Dice RNFL GCL IPL INL OPL ONL IS/OS RPE Choroid Disc
Ours (SAM2) 68.6 80.4 81.6 65.9 70.8 76.0 80.3 90.7 85.8 81.7 89.3 82.2

The same approach was additionally evaluated on a diabetic macular oedema dataset and on combined multi-dataset training (approaching a purpose-built universal baseline); those results are reported in the dissertation but the corresponding checkpoints are not released here. See the dissertation for full tables, per-image statistics, ablations, and the prompted-refinement study.

Limitations

  • Single-run point estimates for some development comparisons; final test-set numbers above are single-run results interpreted against measured seed-to-seed variability (β‰ˆ 0.1–0.3 mIoU).
  • Dataset-specific. Each checkpoint is trained and evaluated on one dataset/device; performance on other devices, protocols, or pathologies is not expected to transfer.
  • Input resolution. SAM 2's fixed 1024 Γ— 1024 input requires upscaling OCT B-scans, which can introduce interpolation artefacts affecting fine boundary precision.
  • No pathology segmentation beyond the labelled layer/disc classes (e.g. drusen or fluid are not segmented by the released macular/glaucoma checkpoints).

Base model & license

  • Fine-tuned from SAM 2.1 Hiera Base+ (facebook/sam2.1-hiera-base-plus), released by Meta AI under the Apache-2.0 license.
  • These fine-tuned weights are released under CC BY-NC 4.0 (attribution, non-commercial). Use of the weights must also respect the terms of the underlying public NR206 and MGU datasets.

Citation

If you use these checkpoints, please cite the dissertation and this repository, along with the CVAT-OCT project (see its CITATION.cff) and the upstream SAM 2 and CVAT projects.

@mastersthesis{roux_sam_oct_2026,
  author  = {Roux, Enslin},
  title   = {Adapting the Segment Anything Model (SAM) for Retinal OCT Layer Segmentation},
  school  = {University of the Witwatersrand},
  year    = {2026}
}

@software{roux_cvat_oct,
  author  = {Roux, Enslin},
  title   = {CVAT-OCT: AI-assisted segmentation of OCT images (a CVAT fork)},
  year    = {2026},
  url     = {https://github.com/enslinr/cvat-oct}
}

Author

Enslin Roux β€” University of the Witwatersrand.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for enslinr/sam2-oct

Finetuned
(22)
this model