DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models
We are thrilled to release LateOn and DenseOn, two open retrieval models that outperform existing state-of-the-art (SOTA) models on BEIR, alongside the large-scale datasets, mixture studies, and curation experiments that made them possible. SOTA retrieval has long been locked behind closed data, stalling progress for everyone outside a handful of private labs. With this release, everything is on the table, and our decontamination experiments confirm the gains come from genuine generalization, not leakage.
DenseOn | LateOn | PyLate | FastPLAID
Table of Contents
- Introduction
- Pre-Training Data Pipeline
- Fine-tuning Data
- Results
- What We Learned During our Explorations
- Conclusion
- Citation
- Acknowledgements
- Appendix
- References
- 🤗 Models: LateOn | DenseOn
- 📚 Datasets: Pre-training | Pre-training (curated) | Fine-tuning
- 🛠️ Tools: PyLate | FastPLAID
Introduction
The capabilities of information retrieval models are evolving rapidly. Due to this progress, leaderboards are increasingly topped by closed models: some hidden behind APIs, some under restrictive licenses. Even when they are released under permissive licenses, they are trained on closed data. This prevents exploration of possible data leakage and contamination (or even purposely overfitting). Maybe even more importantly, it prevents broader exploration of new ideas and gatekeeps the demonstration of superior performance to a few private companies owning this data (which also puts academia at a strong disadvantage).
Among open efforts, Nomic Embed stood out by releasing both models and training data. Despite this awesome effort which enabled some interesting controlled explorations (such as our ColBERT-Zero one), the data became outdated and is now weaker than what is used by closed labs. This is clearly illustrated by comparing modernbert-embed, trained by Nomic AI on their open data, against gte-modernbert: both share the exact same ModernBERT backbone, yet gte-modernbert is significantly stronger. While differences in hyperparameters or training setup may also play a role, this comparison strongly suggests that the data recipe is the primary differentiator.
While the mGTE technical report thoroughly describes the data sources used to train the GTE family of models, they do not release the data itself, preventing people from easily using it while making the data recipe theoretically reproducible. We thus decided to gather most of the datasets of the mGTE publication, alongside other sources, and share it publicly here. We also create a new and high quality common crawl split by leveraging the FineWeb data. Finally, besides gathering the data, we also annotate the data in order to further curate it. We then complement this large scale contrastive pre-training dataset with smaller datasets of higher quality with mined hard negatives. Besides being fully open, all of this data is filtered in a non-destructive way, allowing everyone to either directly use our mixture or extend/replace our filters.
We used these datasets to run ablations and identify a winning recipe, which we share openly to allow the community to build strong models. However, this was only the first step as many of the public datasets in the mixture carry restrictive licenses. We applied our findings to construct a proprietary training dataset with no meaningful difference in performance compared to our ablation models.
At 149M parameters, both LateOn and DenseOn sit at the Pareto frontier of the size-quality trade-off, outperforming models up to 4x larger (Jina ColBERT v2 at 559M, Arctic Embed L v2 at 568M).
Based on our findings, we release LateOn and DenseOn, two model families that cover both ends of the retrieval spectrum: multi-vector (ColBERT) and single-vector (dense). Both are built on the ModernBERT backbone at 149M parameters, a size we believe hits the sweet spot: large enough to capture the complexity of real-world queries and documents, small enough to serve at high throughput in latency-sensitive production systems. LateOn reaches 57.22 NDCG@10 on BEIR, surpassing every existing ColBERT model, including those 4× its size. DenseOn achieves 56.20, a score that punches well above its 149M weight class and remains competitive with dense retrievers several times larger. Crucially, these results hold up under decontamination: when we strip overlapping training data from the evaluation corpora, both models improve their rankings, confirming that the gains come from genuine generalization rather than data leakage. All models, intermediate checkpoints are released under Apache 2.0.
Training Pipeline
The training follows a two stage pipeline, each stage building on the previous one:
- Large scale contrastive pre-training on a filtered corpus of 665M query-document pairs, selected from 1.4B total pairs collected across 34 sources. Our carefully designed multi-stage filtering pipeline retains 48% of the original data (see Filtering Pipeline below). Training then uses in-batch negatives with batch sizes up to 16k to produce the unsupervised model variants.
- Supervised contrastive fine-tuning on a smaller and high quality filtered corpus of 1.69M contrastive pairs, retained from an initial 1.88M pairs, using mined hard negatives following the NV-Retriever approach (see Fine-tuning Data) in addition to in-batch negatives. This yields the supervised variants.
It is worth noting that, unlike our usual setups, we did not perform knowledge distillation after the second stage for two reasons. First, we found that it is difficult to make KL-divergence distillation work for dense models, and it did not improve results despite extensive exploration. Second, despite the general usefulness of KD for ColBERT models, our multi-vector models reach such strong BEIR results that standard KD on MS MARCO provided no meaningful gains. We leave for future work both the exploration of KD for dense models and the use of stronger distillation setups, such as more diverse data or a stronger teacher, that may lead to further gains.
Pre-Training Data Pipeline
Data Mixture
The first step in training a retrieval model from a masked-language-model backbone such as ModernBERT is large-scale unsupervised contrastive training. In this setup, each query is only associated to a positive document and no specific negatives, only the in-batch negatives are used. At this stage, the key ingredient is quality and diversity of the data mixture.
The mGTE technical report describes the data sources used to train the GTE family of models but does not release the data itself. We painstakingly reassembled this mixture from scratch, tracking down each source, matching formats, and rebuilding the full pipeline. The final dataset contains all the sources alongside additional ones as well as a new dataset we created to replace the "common crawl" component. Indeed, most studies mention using data from the web but either do not mention which one exactly or use strongly outdated one. We thus decided to build upon the more recent and strong FineWeb-Edu data.
We release this reconstructed dataset (alongside all our filtering/annotation values) publicly to enable the community to reproduce and build upon our work. Each source retains its original license; users are responsible for verifying compliance with the license terms of each individual source. We then ran extensive ablations on this mixture to identify which sources, ratios, and filtering thresholds yielded the best downstream results.
Filtered dataset composition by category. Mixed/Multi-Domain and Instructional sources account for half the dataset.
Filtering Pipeline
After gathering the various data sources, it is required to properly filter and curate it, as described for example by Arctic-embed. Thus, besides reconstructing the mixture from mGTE, we also design an extensible, non-destructive data filtering pipeline: rather than deleting examples, we retain all data internally and expose filtering signals as metadata: a boolean filter column for structural quality filtering, a duplicate column for deduplication, and a column containing a strong cross-encoder similarity score to filter for relevancy of the document with the query. All filtering is applied only at training time, to facilitate further exploration and the design of other filtering approaches.
Filtering in action: with a relevance score threshold of 3.0, we retain examples #01, #02, #04, and #05. Examples #03 and #06 are stripped because both their filter flag is negative and their score falls below the threshold.
The structural filtering pipeline consists of 30+ composable filters:
- Surface-level cleanup: removal of legal boilerplate (terms of use, cookie banners), residual HTML tags and inline styles, non-printable characters, and URL placeholders. A short-text filter also drops rows with fewer than 3 whitespace tokens in either field.
- Language and script consistency: FastText language identification (≥50% confidence) verifies that both query and document are in the target language. A family of Unicode-range filters remove non-target scripts (Chinese, Cyrillic, Arabic, Hebrew, Japanese kana, Korean hangul), auto-injected as a group so each source opts in with a single config entry.
- Statistical quality heuristics: unigram log-probability scoring against Google's 1T frequency table (documents below −10.0 are flagged), a repeated-uncommon-word detector (any token with frequency < 10⁻⁶ accounting for >30% of all tokens), adaptive special-character ratio filtering (top 4% per source), and digit/uppercase ratio checks (>70% digits or >90% uppercase).
Semantic relevance is assessed by scoring every pair with mxbai-rerank-large-v2, a cross-encoder reranker. Pairs with a similarity score below 3.0 are flagged. This is an absolute threshold, not a per-source quantile, so curated scientific abstracts keep >99% of their content while noisy web Q&A loses its semantically mismatched tail.
Deduplication is derived by computing an MD5 hash over the normalized (query, document) pair. Within each group of identical hashes, the first occurrence is retained as canonical, while every subsequent row records a pointer to that canonical index in the duplicate column. As with the other signals, duplicates are flagged rather than deleted, and are filtered out at training time.
The pipeline is also source-aware, not monolithic: each source has its own ordered list of filter calls. The same cleaning rule that rescues a FAQ scrape can destroy scientific notation. Scientific abstracts use Greek and math symbols that a strict character allowlist would obliterate, so it is simply turned off for those sources. This per-source configuration is the key design decision: on our 5-shard sample, end-to-end retention ranges from ~49% on noisy news headlines (AG News) to ~97% on curated scientific abstracts (biorxiv), with the variance driven by intrinsic source quality rather than filter aggressiveness.
For the FineWeb-Edu split (and FineWeb-2 for multilingual data), which accounts for over half the mixture, we had to run some custom data processing to extract the titles of the webpages (used for contrastive learning with the document) from CommonCrawl, as they were missing in the original datasets. Then we took a different approach to filtering. Since this source was already processed with its own quality curation, we did not run the rule-based filtering stack or the MD5 deduplication step. We thus only apply the cross-encoder reranker filtering step, with a much more aggressive similarity threshold: we keep roughly the top third of pairs per shard (effective threshold ~10.6–11.0, compared to 3.0 for every other subset), yielding a consistent ~34% retention rate across all 119 shards.
Cross-Encoder scores distributions across our unfiltered pre-training dataset. Higher induce better relevance between a query and a document. We used this score to filter our dataset.
For the Wikipedia-flavored sources (the two Atlas HLP splits, hlp_wikipedia_cm and hlp_wikipedia_dl, totaling 20M rows), we took yet another approach: these pairs were included as-is, without rule-based filtering, deduplication, or cross-encoder scoring. Because the Atlas pipeline already produces clean, well-structured query-document pairs from Wikipedia, we judged that additional filtering would add cost without meaningful quality gains. The smaller encyclopedia sources (wikianswers, wikihow) did go through cross-encoder scoring with the standard 3.0 threshold, retaining over 99% of their rows.
The training-ready dataset, which reproduces the mGTE technical report recipe, is then a simple view of the annotated embeddings-pre-training parent:
-- Standard sources (e.g. agnews, altlex, amazon_qa, arxiv, biorxiv, ...)
SELECT index, query, document, similarity
FROM lightonai/embeddings-pre-training
WHERE config NOT IN ('fw_edu', 'hlp_wikipedia_cm', 'hlp_wikipedia_dl')
AND filter = FALSE
AND duplicate IS NULL
AND similarity >= 3.0
-- FineWeb-Edu: no rule-based filter or dedup applied,
-- Keep only the top ~35%
UNION ALL
SELECT index, query, document, similarity
FROM (
SELECT *, PERCENT_RANK() OVER (ORDER BY similarity DESC, ID ASC) AS prank
FROM lightonai/embeddings-pre-training
WHERE config IN ('fw_edu')
)
WHERE prank <= 0.35
-- Wikipedia HLP splits: included as-is, no filtering applied
UNION ALL
SELECT index, query, document, similarity
FROM lightonai/embeddings-pre-training
WHERE config IN ('hlp_wikipedia_cm', 'hlp_wikipedia_dl')
Downstream users who want a stricter or more permissive cut can adjust the similarity floor without re-running the pipeline. They can also replace any of these filters by their own or add new ones, because all of the original data is still available.
To make this recipe immediately actionable, we also release a fully curated version of the dataset, embeddings-pre-training-curated, which materializes the exact mixture described above into a single, training-ready resource. All the work of navigating sources, applying the per-source filter stacks, thresholding cross-encoder relevance scores, removing duplicates, and balancing the final composition is already done. It allows practitioners and academic labs to reproduce our final ablation pre-training setup end-to-end, or use it as a strong, transparent baseline to benchmark their own data curation ideas against. The curated dataset and its annotated parent together give researchers a real shot at training state-of-the-art retrieval model.
Fine-tuning Data
After running the unsupervised pre-training on the large scale dataset of query-document pairs, we then further refine the model by fine-tuning it using a higher-quality dataset where each query is paired not only with its positive document but also with specifically mined hard negatives. This stage reduces reliance on large batch sizes alone and teaches the model to better distinguish positive documents from semantically related but irrelevant passages.
Specifically, we apply the NV-Retriever approach for hard negative mining. We retrieve the top-2048 nearest passages per query with GTE-ModernBERT, then keep the top-50 eligible passages whose similarity score falls below 99% of the query-positive similarity score as negatives. At each training step, 7 negatives are randomly selected for each sample in addition to in-batch negatives.
Several aspects make this resource especially valuable. First, negatives are mined from text-deduplicated passages while excluding the positive document, preventing relevant passages from reappearing in the negative pool and contaminating the contrastive data. Second, this dataset allows us to filter out examples for which NV-Retriever thresholding yields fewer than 50 negatives. We found this criterion particularly important, as failing to identify enough negatives often suggests that the original query-positive pair was weakly matched or incorrectly annotated, producing a low query-positive similarity. In practice, this gives us a simple and effective way to filter noisy supervision without requiring explicit reranker-based validation of query-positive pairs.
Unfiltered fine-tuning dataset composition with hard negatives mined across 7 sources by GTE-ModernBERT.
As with the pre-training data, we release our unfiltered fine-tuning dataset here with 2048 mined passages and their retrieval scores for each query-positive pair. This makes the dataset highly reusable, as users can apply different NV-Retriever thresholds, vary the number of retained negatives, use other sampling methods, or build distillation-based pipelines on top of the strong retriever scores. We apply this methodology to seven widely used fine-tuning datasets: FiQA, Natural Questions, HotpotQA, MS MARCO, FEVER, SQuAD v2, and TriviaQA. Collectively, they span financial retrieval, real-user web search, multi-hop reasoning, claim verification, extractive question answering with unanswerable questions, and trivia grounded in external evidence, providing a diverse and challenging training basis for a general-purpose retrieval model. In total, this represents 1.88M pairs. Here again, we believe that this is a very valuable source of data, as mining and creating a strong set of hard negatives is something every team does at the start of their project.
Results
BEIR Benchmark: Multi-Vector (ColBERT) Models
Multi-vector (ColBERT) models comparison on BEIR (NDCG@10). Best in bold. Results in italic are the ones we ran because values were missing from official results.
| Model | Average | Size (M) | Embed dim | ArguAna | CQADupstackRetrieval | ClimateFEVER | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | TRECCOVID | Touche2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ColBERTv2 | 48.63 | 110 | 128 | 46.50 | 38.30 | 17.60 | 45.20 | 78.50 | 35.40 | 67.50 | 46.00 | 33.70 | 52.40 | 85.50 | 15.40 | 68.90 | 72.60 | 26.00 |
| Jina-ColBERT-v2 | 51.85 | 600 | 128 | 36.60 | 40.80 | 23.90 | 47.10 | 80.50 | 40.80 | 76.60 | 46.90 | 34.60 | 64.00 | 88.70 | 18.60 | 67.80 | 83.40 | 27.40 |
| ColBERT-small | 53.79 | 33 | 96 | 50.09 | 38.75 | 33.07 | 45.58 | 90.96 | 41.15 | 76.11 | 43.50 | 37.30 | 59.10 | 87.72 | 18.42 | 74.77 | 84.59 | 25.69 |
| GTE-ModernColBERT-v1 | 54.75 | 149 | 128 | 47.52 | 41.08 | 31.33 | 47.56 | 87.67 | 45.25 | 77.48 | 45.60 | 37.83 | 61.62 | 86.71 | 19.22 | 76.33 | 84.84 | 31.25 |
| ColBERT-Zero | 55.39 | 149 | 128 | 52.82 | 41.41 | 35.90 | 47.43 | 90.52 | 42.50 | 79.45 | 45.95 | 37.21 | 61.82 | 85.19 | 19.84 | 76.33 | 78.27 | 36.24 |
| LateOn-unsupervised | 50.11 | 149 | 128 | 43.12 | 47.71 | 18.76 | 43.36 | 65.74 | 51.94 | 68.17 | 37.51 | 37.15 | 58.41 | 89.48 | 21.13 | 76.89 | 69.81 | 22.53 |
| LateOn | 57.22 | 149 | 128 | 50.52 | 47.36 | 39.67 | 45.99 | 92.02 | 53.12 | 79.98 | 45.67 | 37.79 | 63.91 | 89.67 | 21.90 | 76.61 | 83.60 | 30.52 |
LateOn achieves 57.22 average NDCG@10, becoming the first ColBERT (and sub 150M parameter) model to break the 57 mark on BEIR by surpassing the previous best ColBERT model (ColBERT-Zero at 55.32) by almost two points and exceeding GTE-ModernColBERT-v1 by two point and a half, despite sharing the same backbone. It is worth noting that we achieve this performance only with contrastive data and without the use of the prompt (which has been shown to be beneficial in the ColBERT-Zero study but makes it harder and more expensive to use). This work was focused on an exploration of contrastive data, but we believe we can achieve much stronger results through a KD phase and the addition of prompts to serve as query expansion. Given such strong performances, one fair concern would be overfitting to BEIR. As highlighted in the decontaminated BEIR experiments, our models stay very strong on benches curated from any possible data leakage, especially in the multi-vector case and contrary to some other models.
BEIR Benchmark: Single-Vector (Dense) Models
Single-vector (dense) models comparison on BEIR (NDCG@10). Best in bold. Results in italic are the ones we ran because values were missing from official results.
| Model | Average | Size | Emb dim | ArguAna | CQADupstackRetrieval | ClimateFEVER | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | TRECCOVID | Touche2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| modernbert-embed-base | 52.89 | 149 | 768 | 48.96 | 42.08 | 35.67 | 41.50 | 87.35 | 40.59 | 67.11 | 41.47 | 33.40 | 62.15 | 88.85 | 18.59 | 69.63 | 84.15 | 31.91 |
| bge-large-en-v1.5 | 54.34 | 335 | 1024 | 64.52 | 42.23 | 36.57 | 44.11 | 87.18 | 45.02 | 74.10 | 42.49 | 38.06 | 55.03 | 89.07 | 22.63 | 74.64 | 74.70 | 24.81 |
| gte-modernbert-base | 55.19 | 149 | 768 | 74.56 | 42.64 | 45.90 | 41.39 | 93.98 | 49.54 | 70.39 | 39.93 | 34.32 | 56.10 | 88.57 | 20.44 | 76.41 | 75.75 | 17.97 |
| snowflake-arctic-embed-l-v2.0 | 55.22 | 568 | 1024 | 59.11 | 45.88 | 41.82 | 43.40 | 91.54 | 45.35 | 68.15 | 44.86 | 35.08 | 63.67 | 88.75 | 20.28 | 70.90 | 83.63 | 25.89 |
| jina-embeddings-v5-text-nano | 56.06 | 239 | 768 | 65.70 | 44.66 | 39.60 | 45.26 | 89.51 | 47.85 | 69.07 | 41.64 | 38.69 | 63.38 | 88.87 | 22.60 | 75.78 | 77.60 | 30.70 |
| Qwen3-Embedding-0.6B | 55.52 | 600 | 1024 | 70.97 | 46.03 | 42.11 | 39.48 | 88.15 | 46.61 | 65.74 | 37.99 | 36.71 | 53.46 | 87.78 | 24.41 | 69.72 | 90.52 | 33.18 |
| pplx-embed-v1-0.6b | 56.70 | 600 | 1024 | 60.45 | 45.96 | 39.82 | 44.30 | 90.66 | 52.05 | 74.41 | 43.86 | 35.80 | 62.04 | 88.96 | 22.84 | 74.78 | 85.63 | 28.98 |
| DenseOn-unsupervised | 49.05 | 149 | 768 | 54.94 | 46.28 | 18.20 | 37.39 | 70.68 | 52.34 | 59.77 | 29.30 | 37.92 | 50.62 | 88.98 | 23.05 | 76.35 | 68.12 | 21.87 |
| DenseOn | 56.20 | 149 | 768 | 54.65 | 46.89 | 37.49 | 44.65 | 90.69 | 53.86 | 74.51 | 43.58 | 39.03 | 59.25 | 89.31 | 22.35 | 75.95 | 82.33 | 28.43 |
DenseOn reaches 56.20 NDCG@10 on BEIR, making it the top base-size dense retriever and the first sub-150M model to clear the 56 bar. At 149M parameters it decisively beats GTE-ModernBERT (55.19) at the same size, and more tellingly outperforms snowflake-arctic-embed-l-v2.0 (55.22, 568M) and Qwen3-Embedding-0.6B (55.52, 595M) despite being roughly 4× smaller. DenseOn also stays within half a point of the strongest current-generation dense baselines, pplx-embed-v1-0.6B (56.70, 596M) and jina-embeddings-v5-text-nano (56.08, 239M), both substantially larger.
Decontaminated BEIR
Standard benchmarks risk overestimating model quality when training data overlaps with evaluation corpora. This is a non-negligible risk in our case, as our mixture explorations are mostly built on BEIR evaluation. To quantify this and ensure that our model has not memorized possible leakage, we built decontaminated versions of the BEIR datasets by removing samples found in both the mGTE training dataset and in our internal pre-training dataset. Since many retrieval models draw from similar public sources (Wikipedia, MS MARCO, Common Crawl, academic corpora), we expect significant overlap across models and believe the decontaminated benchmarks provide a meaningful, if imperfect, stress test. The decontaminated datasets are publicly available on HuggingFace (see Appendix: Decontaminated BEIR Datasets).
Methodology. Contamination was detected using a two-pass approach:
- Exact hash matching. All texts (queries and corpus documents) were normalized (lowercased, Unicode NFKD, whitespace collapsed) and hashed with xxHash-64. The same normalization and hashing was applied to every query and document field in mgte-en. Any sample whose hash appeared in mgte-en was flagged as contaminated.
- 13-gram containment (GPT-3 style). Following the methodology introduced in the GPT-3 paper (Brown et al., 2020), word-level 13-grams were extracted from all remaining samples. For each sample, containment was computed as: containment = |ngrams_sample ∩ ngrams_mgte| / |ngrams_sample|. Samples with containment ≥ 0.5 were flagged as near-duplicates.
Relevance judgments (qrels) referencing any removed query or corpus document were also removed. The resulting decontaminated corpora ranges from near-intact (ArguAna: 1.5% docs removed) to heavily reduced (NQ: 88.6% docs removed due to it being based on wikipedia), providing a stress test for model robustness.
ColBERT Models Generalize Better Under Decontamination
Comparing how ColBERT and dense models react to decontaminated evaluation reveals a consistent pattern: multi-vector models are more robust to corpus perturbation than single-vector models.
Scores are average NDCG@10 over 12 decontaminated BEIR datasets (excluding ClimateFEVER and FEVER). Avg Delta = Decontaminated BEIR - BEIR, per dataset.
LateOn and DenseOn stay consistent under decontamination. LateOn keeps its #1 position and DenseOn stays in the top four (only falling behind our other strong multi-vector model, ColBERT-Zero), even though the decontamination was built from their own pre-training data and would surface any train/test overlap as a sharp drop. Neither model flinches, which is direct evidence of generalization rather than overfitting.
All three ColBERT models hold or improve their ranking, and they take 2 of the top 3 decontaminated positions. The biggest climbers are AnswerAI ColBERT and ColBERT-Zero, each gaining 4 places, which is particularly striking for AnswerAI at only 33M parameters. The biggest loser is GTE-ModernBERT, dropping from 8th to last, which is particularly interesting considering our base mixture is derived from theirs. This highlights the strength of our curation methodology.
Late-interaction models generalize better. The three ColBERT models average a delta of +3.30 under decontamination, more than double the dense-model average of +1.44. Late interaction matches at the token level instead of compressing each document into a single vector, which is harder to overfit.
Two established dense baselines show clear overfitting. Decontamination shrinks the candidate pool and should hand every model free NDCG@10 points. Qwen3-Emb 0.6B is the most telling case: it posts one of the strongest standard-BEIR scores in the dense family (rank 5), yet gains barely half a point once decontaminated (Δ +0.58) and slides from 5th to 8th. At 595M parameters, a near-flat delta on an easier task is only explicable by overlap-driven inflation on standard BEIR. GTE-ModernBERT is worse still: the only model in the leaderboard whose absolute score actually drops (Δ −0.91) on an easier task, falling four positions to dead last. Arctic Embed L v2 (Δ +0.86, 7th → 9th) sits in a softer middle ground: its gain is smaller than corpus shrinkage alone would predict, but it avoids the sharp overfitting symptoms of Qwen3 and GTE-ModernBERT. The remaining dense models (pplx-v1-0.6b, jina-v5-text-nano, DenseOn) cluster around Δ +1.5 to +1.8 with at most a single-rank shift, consistent with honest generalization.
The resulting decontaminated corpora range from near-intact (ArguAna: 1.5% docs removed) to heavily reduced (NQ: 88.6% docs removed). Datasets with very high removal rates tend to show large positive deltas across all models because the reduced corpus makes retrieval mechanically easier. The more informative signal comes from datasets with moderate or low removal rates, where corpus changes are meaningful but do not dominate. Full decontamination statistics and per-model results can be found in the Appendix.
What We Learned During our Explorations
Multinomial Sampling
Pre-training sources vary enormously in size. Proportional sampling would over-represent the largest (often noisiest) sources. Following the mGTE technical report, we use multinomial sampling to smooth dataset probabilities, preventing forgetting on smaller sources and enabling higher learning rates. Also, at each stage of training, every batch is drawn from a single data source so the model cannot rely on superficial cues (e.g. formatting or domain style) to distinguish positives from negatives instead of learning genuine semantic relevance.
Results Before Supervised Fine-Tuning Are Unreliable
As observed by other works such as Arctic Embed, the performance after the unsupervised phase does not fully correlate with the final performance after supervised fine-tuning. Our best-performing models had worse pre-fine-tuning scores than some runs that ended up much worse overall. Finding a way to measure this "hidden knowledge" would make exploration faster and cheaper. We believe decontaminating evaluation datasets before training, rather than after, would sharpen this signal, and plan to adopt this approach going forward.
NanoBEIR Gives Signal, Until It Does Not
BEIR is the standard benchmark for evaluating retrieval models, but running it is expensive. NanoBEIR is a lightweight subset designed to give a quick approximation of BEIR scores during training, without the full computational cost.
We found NanoBEIR useful as a sanity check to verify that training is not diverging, but unreliable for model comparison and ablations. Models can score high on NanoBEIR yet underperform on full BEIR, and vice versa. We believe this is due to the benchmark’s reduced passage set, but increasing its size may be impractical for fast evaluation during training. Coupled to the unreliability of results pre-fine-tuning, this unfortunately means that, to get signal for our pre-training mixture, we had to run the full training pipeline (including fine-tuning/knowledge distillation) and then run the full BEIR evaluation, which made it quite expensive.
Prompts and Pooling Method Matter
Transformer encoders produce one vector per token. To get a single embedding for the whole input for dense models, you need a pooling strategy. The default approach in sentence transformers is mean pooling: averaging all token vectors. The alternative is CLS pooling: taking the vector of the special [CLS] token that sits at the very beginning of the sequence.
One of our largest gains for the dense model came from switching to CLS pooling combined with asymmetric prompts: we prepend query: to queries and document: to documents before encoding. We believe this asymmetry is the primary driver of the improvement, because it gives the model an explicit signal about what kind of input it is processing, allowing it to build distinct encoding strategies for queries and documents.
Conclusion
LateOn and DenseOn are the strongest open retrieval models at their size. LateOn achieves 57.22 average NDCG@10 on BEIR, surpassing every existing ColBERT model and outperforming dense models up to 4x larger. DenseOn reaches 56.20, topping all base-size dense models. On decontaminated BEIR, both models extend their lead: LateOn climbs to 60.36 and DenseOn to 57.71, with ColBERT models sweeping the top 3 positions and demonstrating stronger generalization than any dense alternative.
We release all models and intermediate checkpoints under Apache 2.0, together with decontaminated versions of all 14 BEIR datasets to support more rigorous evaluation across the community. All LateOn models were trained with PyLate, our open-source library for training and fine-tuning ColBERT models, and can be evaluated efficiently with FastPLAID. We trained DenseOn models with SentenceTransformers.
Get started now:
- 🤗 Models: LateOn | DenseOn
- 📚 Datasets: Pre-training | Pre-training (curated) | Fine-tuning
- 🛠️ Tools: PyLate | FastPLAID
Citation
@misc{sourty2026denseonlateon,
title={DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models},
author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Moura Junior, Paulo Roberto and Chatelain, Amelie},
year={2026},
howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
}
Acknowledgements
We thank Xin Zhang, Zach Nussbaum, Tom Aarsen, Bo Wang, Eugene Yang, Benjamin Clavié, Nandan Thakur, Oskar Hallström and Iacopo Poli for their valuable contributions and feedback. We are grateful to the teams behind Sentence Transformers, and the BEIR benchmark, and to the open-source retrieval community, in particular the authors of Nomic Embed. This work was granted access to the HPC resources of IDRIS under GENCI allocations AS011016449, A0181016214, and A0171015706 (Jean Zay supercomputer). We also acknowledge the Barcelona Supercomputing Center (BSC-CNS) for providing access to MareNostrum 5 under EuroHPC AI Factory Fast Lane project EHPC-AIF-2025FL01-445.
Appendix
All Checkpoints
In addition to the main DenseOn and LateOn models, we release every intermediate checkpoint from our training pipeline:
- DenseOn-unsupervised: single-vector model after large-scale contrastive pre-training.
- DenseOn (recommended): further refined with hard-negative contrastive fine-tuning.
- LateOn-unsupervised: ColBERT model after contrastive pre-training.
- LateOn (recommended): further refined with hard-negative contrastive fine-tuning.
Decontaminated BEIR Datasets
All 14 decontaminated BEIR datasets are publicly available on HuggingFace:
| Dataset | Link |
|---|---|
| ArguAna | lightonai/arguana-decontaminated |
| ClimateFEVER | lightonai/climate-fever-decontaminated |
| DBPedia | lightonai/dbpedia-entity-decontaminated |
| FEVER | lightonai/fever-decontaminated |
| FiQA | lightonai/fiqa-decontaminated |
| HotpotQA | lightonai/hotpotqa-decontaminated |
| MS MARCO | lightonai/msmarco-decontaminated |
| NFCorpus | lightonai/nfcorpus-decontaminated |
| NQ | lightonai/nq-decontaminated |
| Quora | lightonai/quora-decontaminated |
| SCIDOCS | lightonai/scidocs-decontaminated |
| SciFact | lightonai/scifact-decontaminated |
| TREC-COVID | lightonai/trec-covid-decontaminated |
| Touche2020 | lightonai/webis-touche2020-decontaminated |
Full Decontaminated BEIR Results
Decontaminated BEIR: Multi-Vector (ColBERT) Models
Multi-vector (ColBERT) models on decontaminated BEIR (NDCG@10). Best in bold.
| Model | Params | ArguAna | DBPedia | FiQA | HotpotQA | MS MARCO | NFCorpus | NQ | Quora | SCIDOCS | SciFact | TREC-COVID | Touche2020 | Average (12ds) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LateOn | 149M | 52.16 | 31.72 | 57.92 | 78.94 | 70.31 | 26.97 | 93.05 | 91.54 | 15.09 | 88.92 | 80.94 | 36.77 | 60.36 |
| ColBERT-Zero | 149M | 54.49 | 32.95 | 46.55 | 77.84 | 74.21 | 26.59 | 91.13 | 88.29 | 14.22 | 89.48 | 75.32 | 40.93 | 59.33 |
| AnswerAI | 33M | 47.63 | 30.38 | 45.75 | 76.35 | 72.39 | 25.41 | 86.10 | 89.36 | 12.72 | 87.82 | 77.51 | 27.81 | 56.60 |
Decontaminated BEIR: Single-Vector (Dense) Models
Single-vector (dense) models on decontaminated BEIR (NDCG@10). Best in bold.
| Model | Params | ArguAna | DBPedia | FiQA | HotpotQA | MS MARCO | NFCorpus | NQ | Quora | SCIDOCS | SciFact | TREC-COVID | Touche2020 | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pplx-v1-0.6b | 596M | 43.66 | 28.44 | 55.17 | 73.50 | 71.87 | 27.99 | 91.63 | 91.54 | 15.44 | 89.02 | 83.68 | 30.05 | 58.50 |
| DenseOn | 149M | 40.02 | 28.79 | 55.87 | 73.65 | 68.91 | 28.54 | 92.13 | 91.09 | 14.71 | 85.38 | 82.45 | 31.03 | 57.71 |
| jina-v5-text-nano | 239M | 47.22 | 30.18 | 51.45 | 67.48 | 68.58 | 29.38 | 92.26 | 91.33 | 14.93 | 89.36 | 76.80 | 33.16 | 57.68 |
| BGE-large | 335M | 46.01 | 28.91 | 49.26 | 75.17 | 68.87 | 29.84 | 85.86 | 91.26 | 13.97 | 86.54 | 72.71 | 26.92 | 56.28 |
| Qwen3-Emb. | 595M | 48.4 | 25.3 | 49.1 | 62.2 | 63.6 | 25.8 | 88.3 | 90.0 | 15.3 | 85.5 | 87.9 | 31.8 | 56.10 |
| Arctic-L v2 | 568M | 43.10 | 28.02 | 50.35 | 63.07 | 71.00 | 25.98 | 90.68 | 91.26 | 13.90 | 87.43 | 81.40 | 26.82 | 56.08 |
| BGE-base | 109M | 45.62 | 26.74 | 44.52 | 72.72 | 66.78 | 27.44 | 85.64 | 91.13 | 13.84 | 87.64 | 76.57 | 28.13 | 55.56 |
| MBEmb.-base | 149M | 36.51 | 24.69 | 46.04 | 62.71 | 65.28 | 24.27 | 89.33 | 89.93 | 12.91 | 85.53 | 82.67 | 33.12 | 54.42 |
| Nomic v1.5 | 137M | 35.82 | 28.78 | 44.67 | 72.71 | 67.44 | 24.36 | 85.11 | 87.22 | 12.69 | 83.30 | 80.65 | 29.38 | 54.34 |
| GTE-MB | 149M | 52.45 | 25.94 | 55.53 | 65.50 | 64.84 | 26.07 | 84.48 | 90.79 | 11.62 | 88.62 | 62.37 | 23.09 | 54.28 |
Decontamination Statistics
Percentage of documents and queries removed per dataset during decontamination, sorted by document removal rate.
| Dataset | Docs (orig.) | Docs (decon.) | Docs removed | Queries (orig.) | Queries (decon.) | Queries removed |
|---|---|---|---|---|---|---|
| NQ | 2,681,468 | 305,674 | 88.60% | 3,452 | 26 | 99.25% |
| SciFact | 5,183 | 858 | 83.45% | 300 | 41 | 86.33% |
| SCIDOCS | 25,657 | 5,833 | 77.27% | 1,000 | 201 | 79.90% |
| NFCorpus | 3,633 | 912 | 74.90% | 323 | 172 | 46.75% |
| DBPedia | 4,635,922 | 1,678,309 | 63.80% | 400 | 349 | 12.75% |
| HotpotQA | 5,233,329 | 2,314,813 | 55.77% | 7,405 | 2,092 | 71.75% |
| MS MARCO | 8,841,823 | 4,036,967 | 54.34% | 6,980 | 41 | 99.41% |
| TREC-COVID | 171,332 | 99,522 | 41.91% | 50 | 50 | 0.00% |
| Quora | 522,931 | 413,157 | 20.99% | 10,000 | 2,788 | 72.12% |
| FiQA | 57,638 | 47,617 | 17.39% | 648 | 61 | 90.59% |
| ArguAna | 8,674 | 8,546 | 1.48% | 1,406 | 1,373 | 2.35% |
| Touche2020 | 382,545 | 378,223 | 1.13% | 49 | 30 | 38.78% |
Ranking changes between BEIR and decontaminated BEIR (12 datasets*). Avg Δ = average score change per dataset. Positive rank change = places gained.
| Decon Rank | Model | Type | Params | BEIR* | Decontaminated BEIR | Avg Δ | BEIR Rank | Change |
|---|---|---|---|---|---|---|---|---|
| 1 | LateOn | ColBERT | 149M | 57.22 | 60.36 | +3.14 | 1 | = |
| 2 | ColBERT-Zero | ColBERT | 149M | 55.39 | 59.33 | +3.94 | 6 | +4 ▲ |
| 3 | pplx-v1-0.6b | Dense | 596M | 56.70 | 58.50 | +1.80 | 2 | −1 ▼ |
| 4 | DenseOn | Dense | 149M | 56.20 | 57.71 | +1.51 | 3 | −1 ▼ |
| 5 | jina-v5-text-nano | Dense | 239M | 56.08 | 57.68 | +1.60 | 4 | −1 ▼ |
| 6 | AnswerAI ColBERT | ColBERT | 33M | 53.79 | 56.60 | +2.81 | 10 | +4 ▲ |
| 7 | BGE Large v1.5 | Dense | 335M | 54.34 | 56.28 | +1.94 | 9 | +2 ▲ |
| 8 | Qwen3-Emb 0.6B | Dense | 595M | 55.52 | 56.10 | +0.58 | 5 | −3 ▼ |
| 9 | Arctic Embed L v2 | Dense | 568M | 55.22 | 56.08 | +0.86 | 7 | −2 ▼ |
| 10 | MBEmb base | Dense | 149M | 52.89 | 54.42 | +1.53 | 11 | +1 ▲ |
| 11 | GTE-ModernBERT | Dense | 149M | 55.19 | 54.28 | −0.91 | 8 | −3 ▼ |
*BEIR column computed on the same 12 datasets as the decontaminated evaluation (excluding ClimateFEVER and FEVER) so that Avg Δ = Decontaminated BEIR - BEIR.
References
- Zhang, P., et al. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. arXiv preprint arXiv:2407.19669, 2024.
- Nussbaum, Z., et al. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv preprint arXiv:2402.01613, 2024.
- Moreira, G., et al. NV-Retriever: Improving text embedding models with effective hard-negative mining. arXiv preprint arXiv:2407.15831, 2024.
- Merrick, L., et al. Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models. arXiv preprint arXiv:2405.05374, 2024.
- Chaffin, A., Arnaboldi, L., Chatelain, A., Krzakala, F. ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models. arXiv preprint arXiv:2602.16609, 2026.
- Thakur, N., et al. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets and Benchmarks Track. arXiv preprint arXiv:2104.08663, 2021.
- Warner, B., et al. ModernBERT: A Smarter BERT for the Modern Era. arXiv preprint arXiv:2412.13663, 2024.
- Khattab, O., Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. arXiv preprint arXiv:2004.12832, 2020.
- Chen, J., et al. bge-reranker-v2-gemma: A Lightweight Cross-Encoder Reranker. BAAI, 2024.
- Chaffin, A., Sourty, R. PyLate: Flexible Training and Retrieval for Late Interaction Models. Proceedings of CIKM 2025. arXiv preprint arXiv:2508.03555, 2025.
- Jha, R., et al. Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever. arXiv preprint arXiv:2408.16672, 2024.
