Model Description

This is a fine-tune of mirth/chonky_distilbert_base_uncased_1 with the goal being to see if it can be improved further by training it on more data.

​MTCB Nano Benchmark (Aggregated Score)

Score = Mean of mean_recall, mean_precision, mean_mrr, and mean_ndcg across k=[1, 3, 5, 10] (Metrics reference)

Model / Chunker Chunk Size 512 Chunk Size 1024 Chunk Size 2048 Avg Score
mirth/chonky_modernbert_large_1 0.5621 0.5621 0.5621 0.5621
mamei16/chonky_mdistilbert-base-english-cased 0.5517 0.5517 0.5517 0.5517
mamei16/chonky_distilbert_base_uncased_1.1 0.5342 0.5342 0.5342 0.5342
mirth/chonky_modernbert_base_1 0.5305 0.5305 0.5305 0.5305
mamei16/chonky_distilbert-base-multilingual-cased 0.5294 0.5294 0.5294 0.5294
mirth/chonky_distilbert_base_uncased_1 0.5116 0.5116 0.5116 0.5116
RecursiveChunker 0.4596 0.5214 0.5431 0.5080
SentenceChunker 0.4612 0.5026 0.5263 0.4967
TokenChunker 0.3155 0.4338 0.4801 0.4098
SemanticChunker_potion-32M 0.4022 0.4021 0.4019 0.4021
SemanticChunker_potion-multi-128M 0.4004 0.3999 0.3991 0.4001
SemanticChunker_potion-8M 0.3987 0.3966 0.3966 0.3973
Model Implementation Details
Benchmark Name Implementation
RecursiveChunker RecursiveChunker(chunk_size=chunk_size)
SentenceChunker SentenceChunker(chunk_size=chunk_size)
TokenChunker TokenChunker(chunk_size=chunk_size)
SemanticChunker_potion-32M SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M")
SemanticChunker_potion-multi-128M SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M")
SemanticChunker_potion-8M SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M")

Training Data, Code and Hardware

The model was fine-tuned for one epoch on mamei16/wikipedia_paragraphs. The training code can found here. Fine-tuning was run on an RTX 5090 for about 3 hours and 45 minutes.

Downloads last month
32
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mamei16/chonky_distilbert_base_uncased_1.1

Dataset used to train mamei16/chonky_distilbert_base_uncased_1.1

Space using mamei16/chonky_distilbert_base_uncased_1.1 1

Collection including mamei16/chonky_distilbert_base_uncased_1.1