StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)
Fine-tuned from yl4579/StyleTTS2-LibriTTS on a small (~163 min) corpus of
single-speaker British audio across three speakers from the Bolton /
Lancashire region: Sara Cox, Maxine Peake, Diane Morgan.
What this is
A 4-epoch fine-tune of StyleTTS2 on Northern English speech that produces moderate Northern intonation with phonetic stability. Specifically:
- β Recognisable Northern intonation (FOOT-STRUT collapse on common words)
- β Clean phonetics β no truncated word endings, no dropped function words
- β Stays in the Bolton/Lancashire sub-region of accent space
- β Less aggressive accent commitment than F5-TTS at comparable training depth
This is one of two checkpoints from the same project. See the F5-TTS variant for the stronger-accent / less-precise alternative.
Why epoch 4 specifically
We trained to epoch 5+ and observed the model drifting past the target
accent β down rendered as doon (Geordie/Scots realisation rather than
the Bolton/Lancashire target). Epoch 4 is the sweet spot: committed to
Northern intonation, hasn't yet over-fit toward the broader Scots cluster.
Detail in the architecture trade-off write-up.
Usage
import torch
from styletts2_infer import build, make_sampler, compute_style, inference
import soundfile as sf
model, _ = build("config.yml", "epoch_2nd_00003.pth")
sampler = make_sampler(model)
ref_s = compute_style(model, "ref.wav")
audio = inference(
model, sampler,
text="Hello from a fine-tuned model.",
ref_s=ref_s,
)
sf.write("out.wav", audio, 24000)
The full inference helper (~150 LOC) lives at the minimal F5-TTS trainer gist companion file β same shape applies for StyleTTS2 inference.
Architecture and base
- Base model: yl4579/StyleTTS2-LibriTTS
β
epochs_2nd_00020.pthfrom the LibriTTS multi-speaker training. - Fine-tune: 4 epochs at
lr=1e-4(defaults fromconfig_ft.yml),batch_size=2,max_len=100(memory-constrained for RTX 3060 12GB),lambda_slm=0(WavLM SLM disabled to fit in memory).
Training corpus composition
| Speaker | Source | Segments | Duration |
|---|---|---|---|
| Sara Cox | "Till the Cows Come Home" + "Thrown" audiobooks | 641 | 101 min |
| Maxine Peake | BFI "Working Class Heroes" keynote | 249 | 36 min |
| Diane Morgan | BFI "Mandy" Q&A | 109 | 26 min |
| Total | 999 | 163 min |
All clean single-speaker content; interview / multi-speaker sources were filtered out at the manifest level.
Companion writeups
- Architecture trade-off (F5 vs StyleTTS2)
- How human feedback steers TTS fine-tuning
- Non-AVX2 CPU TTS compatibility notes
- Project canonical site
Limitations
- Not for commercial use of cloned voices. Per yl4579/StyleTTS2's license terms, only use voices whose speakers consent to cloning, or publicly disclose synthesis. The training-corpus speakers did not consent to having their voices cloned β this model demonstrates the technique, not a production voice.
- Bolton/Lancashire-specific. The fine-tune drifts past this sub-region toward Geordie/Scots if pushed beyond epoch 4. For other Northern sub-regions you'd need different training data.
- Memory-constrained training.
max_len=100(~1 sec per sample) limits the prosody-level patterns the model sees; longer-context training on a bigger GPU would likely produce stronger results. - Single-speaker training-data dominance. Sara Cox is 62 % of the corpus by duration, so the model pulls toward audiobook-narration register.
Citation
Built from yl4579/StyleTTS2 plus the corpus above. If you use this work, cite the underlying StyleTTS2 paper and link back to this repo:
@misc{styletts2-northern-english-ft-2026,
author = {netlinux-ai},
title = {StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)},
year = {2026},
url = {https://huggingface.co/grahamathf/styletts2-northern-english-ft},
}