mamei16/wikipedia_paragraphs
Viewer • Updated • 6.39M • 16
How to use mamei16/chonky_distilbert_base_uncased_1.1 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="mamei16/chonky_distilbert_base_uncased_1.1") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("mamei16/chonky_distilbert_base_uncased_1.1")
model = AutoModelForTokenClassification.from_pretrained("mamei16/chonky_distilbert_base_uncased_1.1")This is a fine-tune of mirth/chonky_distilbert_base_uncased_1 with the goal being to see if it can be improved further by training it on more data.
Score = Mean of mean_recall, mean_precision, mean_mrr, and mean_ndcg across k=[1, 3, 5, 10] (Metrics reference)
| Model / Chunker | Chunk Size 512 | Chunk Size 1024 | Chunk Size 2048 | Avg Score |
|---|---|---|---|---|
| mirth/chonky_modernbert_large_1 | 0.5621 | 0.5621 | 0.5621 | 0.5621 |
| mamei16/chonky_mdistilbert-base-english-cased | 0.5517 | 0.5517 | 0.5517 | 0.5517 |
| mamei16/chonky_distilbert_base_uncased_1.1 | 0.5342 | 0.5342 | 0.5342 | 0.5342 |
| mirth/chonky_modernbert_base_1 | 0.5305 | 0.5305 | 0.5305 | 0.5305 |
| mamei16/chonky_distilbert-base-multilingual-cased | 0.5294 | 0.5294 | 0.5294 | 0.5294 |
| mirth/chonky_distilbert_base_uncased_1 | 0.5116 | 0.5116 | 0.5116 | 0.5116 |
| RecursiveChunker | 0.4596 | 0.5214 | 0.5431 | 0.5080 |
| SentenceChunker | 0.4612 | 0.5026 | 0.5263 | 0.4967 |
| TokenChunker | 0.3155 | 0.4338 | 0.4801 | 0.4098 |
| SemanticChunker_potion-32M | 0.4022 | 0.4021 | 0.4019 | 0.4021 |
| SemanticChunker_potion-multi-128M | 0.4004 | 0.3999 | 0.3991 | 0.4001 |
| SemanticChunker_potion-8M | 0.3987 | 0.3966 | 0.3966 | 0.3973 |
| Benchmark Name | Implementation |
|---|---|
| RecursiveChunker | RecursiveChunker(chunk_size=chunk_size) |
| SentenceChunker | SentenceChunker(chunk_size=chunk_size) |
| TokenChunker | TokenChunker(chunk_size=chunk_size) |
| SemanticChunker_potion-32M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M") |
| SemanticChunker_potion-multi-128M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M") |
| SemanticChunker_potion-8M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M") |
The model was fine-tuned for one epoch on mamei16/wikipedia_paragraphs. The training code can found here. Fine-tuning was run on an RTX 5090 for about 3 hours and 45 minutes.
Base model
distilbert/distilbert-base-uncased