Model Description

This is a fine-tune of mirth/chonky_distilbert_base_uncased_1 with the goal being to see if it can be improved further by training it on more data.

MTCB Nano Benchmark (Aggregated Score)

Score = Mean of mean_recall, mean_precision, mean_mrr, and mean_ndcg across k=[1, 3, 5, 10] (Metrics reference)

Model / Chunker	Chunk Size 512	Chunk Size 1024	Chunk Size 2048	Avg Score
mirth/chonky_modernbert_large_1	0.5621	0.5621	0.5621	0.5621
mamei16/chonky_mdistilbert-base-english-cased	0.5517	0.5517	0.5517	0.5517
mamei16/chonky_distilbert_base_uncased_1.1	0.5342	0.5342	0.5342	0.5342
mirth/chonky_modernbert_base_1	0.5305	0.5305	0.5305	0.5305
mamei16/chonky_distilbert-base-multilingual-cased	0.5294	0.5294	0.5294	0.5294
mirth/chonky_distilbert_base_uncased_1	0.5116	0.5116	0.5116	0.5116
RecursiveChunker	0.4596	0.5214	0.5431	0.5080
SentenceChunker	0.4612	0.5026	0.5263	0.4967
TokenChunker	0.3155	0.4338	0.4801	0.4098
SemanticChunker_potion-32M	0.4022	0.4021	0.4019	0.4021
SemanticChunker_potion-multi-128M	0.4004	0.3999	0.3991	0.4001
SemanticChunker_potion-8M	0.3987	0.3966	0.3966	0.3973

Model Implementation Details

Benchmark Name	Implementation
RecursiveChunker	`RecursiveChunker(chunk_size=chunk_size)`
SentenceChunker	`SentenceChunker(chunk_size=chunk_size)`
TokenChunker	`TokenChunker(chunk_size=chunk_size)`
SemanticChunker_potion-32M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M")`
SemanticChunker_potion-multi-128M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M")`
SemanticChunker_potion-8M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M")`

Training Data, Code and Hardware

The model was fine-tuned for one epoch on mamei16/wikipedia_paragraphs. The training code can found here. Fine-tuning was run on an RTX 5090 for about 3 hours and 45 minutes.

Downloads last month: 32

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for mamei16/chonky_distilbert_base_uncased_1.1

Base model

distilbert/distilbert-base-uncased

Finetuned

mirth/chonky_distilbert_base_uncased_1

Finetuned

(1)

this model

Dataset used to train mamei16/chonky_distilbert_base_uncased_1.1

Space using mamei16/chonky_distilbert_base_uncased_1.1 1

Collection including mamei16/chonky_distilbert_base_uncased_1.1

Paragraph Splitting / Chunking Models

Collection

A collection of models that can be used to split natural language texts into meaningful chunks • 3 items • Updated Nov 13, 2025