StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)

Fine-tuned from yl4579/StyleTTS2-LibriTTS on a small (~163 min) corpus of single-speaker British audio across three speakers from the Bolton / Lancashire region: Sara Cox, Maxine Peake, Diane Morgan.

What this is

A 4-epoch fine-tune of StyleTTS2 on Northern English speech that produces moderate Northern intonation with phonetic stability. Specifically:

  • βœ… Recognisable Northern intonation (FOOT-STRUT collapse on common words)
  • βœ… Clean phonetics β€” no truncated word endings, no dropped function words
  • βœ… Stays in the Bolton/Lancashire sub-region of accent space
  • βœ— Less aggressive accent commitment than F5-TTS at comparable training depth

This is one of two checkpoints from the same project. See the F5-TTS variant for the stronger-accent / less-precise alternative.

Why epoch 4 specifically

We trained to epoch 5+ and observed the model drifting past the target accent β€” down rendered as doon (Geordie/Scots realisation rather than the Bolton/Lancashire target). Epoch 4 is the sweet spot: committed to Northern intonation, hasn't yet over-fit toward the broader Scots cluster. Detail in the architecture trade-off write-up.

Usage

import torch
from styletts2_infer import build, make_sampler, compute_style, inference
import soundfile as sf

model, _ = build("config.yml", "epoch_2nd_00003.pth")
sampler = make_sampler(model)
ref_s = compute_style(model, "ref.wav")

audio = inference(
    model, sampler,
    text="Hello from a fine-tuned model.",
    ref_s=ref_s,
)
sf.write("out.wav", audio, 24000)

The full inference helper (~150 LOC) lives at the minimal F5-TTS trainer gist companion file β€” same shape applies for StyleTTS2 inference.

Architecture and base

  • Base model: yl4579/StyleTTS2-LibriTTS β€” epochs_2nd_00020.pth from the LibriTTS multi-speaker training.
  • Fine-tune: 4 epochs at lr=1e-4 (defaults from config_ft.yml), batch_size=2, max_len=100 (memory-constrained for RTX 3060 12GB), lambda_slm=0 (WavLM SLM disabled to fit in memory).

Training corpus composition

Speaker Source Segments Duration
Sara Cox "Till the Cows Come Home" + "Thrown" audiobooks 641 101 min
Maxine Peake BFI "Working Class Heroes" keynote 249 36 min
Diane Morgan BFI "Mandy" Q&A 109 26 min
Total 999 163 min

All clean single-speaker content; interview / multi-speaker sources were filtered out at the manifest level.

Companion writeups

Limitations

  • Not for commercial use of cloned voices. Per yl4579/StyleTTS2's license terms, only use voices whose speakers consent to cloning, or publicly disclose synthesis. The training-corpus speakers did not consent to having their voices cloned β€” this model demonstrates the technique, not a production voice.
  • Bolton/Lancashire-specific. The fine-tune drifts past this sub-region toward Geordie/Scots if pushed beyond epoch 4. For other Northern sub-regions you'd need different training data.
  • Memory-constrained training. max_len=100 (~1 sec per sample) limits the prosody-level patterns the model sees; longer-context training on a bigger GPU would likely produce stronger results.
  • Single-speaker training-data dominance. Sara Cox is 62 % of the corpus by duration, so the model pulls toward audiobook-narration register.

Citation

Built from yl4579/StyleTTS2 plus the corpus above. If you use this work, cite the underlying StyleTTS2 paper and link back to this repo:

@misc{styletts2-northern-english-ft-2026,
  author = {netlinux-ai},
  title  = {StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)},
  year   = {2026},
  url    = {https://huggingface.co/grahamathf/styletts2-northern-english-ft},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support