ModernBERT Embed base Legal Matryoshka

This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/modernbert-embed-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/modernbert-embed-base-legal-original")
# Run inference
sentences = [
    'conclusion, however, is weighty—steeped in myriad complexity and fraught with tension—and in the Court’s view, \nthis conclusion has significant implications for the scope of the FOIA.  The Court will further discuss the two-fold \nreasoning that leads to this result. \nFirst, permitting a member of the public to request from an agency a listing of search results or a listing that',
    'What does the Court believe about the conclusion?',
    'Where can the statement about the best value basis for awards in Polaris be found?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5555, 0.0863],
#         [0.5555, 1.0000, 0.1753],
#         [0.0863, 0.1753, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5379
cosine_accuracy@3 0.5842
cosine_accuracy@5 0.6708
cosine_accuracy@10 0.7512
cosine_precision@1 0.5379
cosine_precision@3 0.5085
cosine_precision@5 0.3852
cosine_precision@10 0.2306
cosine_recall@1 0.1919
cosine_recall@3 0.5063
cosine_recall@5 0.6242
cosine_recall@10 0.7361
cosine_ndcg@10 0.6428
cosine_mrr@10 0.5854
cosine_map@100 0.6291

Information Retrieval

Metric Value
cosine_accuracy@1 0.5317
cosine_accuracy@3 0.5719
cosine_accuracy@5 0.6708
cosine_accuracy@10 0.7573
cosine_precision@1 0.5317
cosine_precision@3 0.4992
cosine_precision@5 0.3821
cosine_precision@10 0.2325
cosine_recall@1 0.1892
cosine_recall@3 0.4957
cosine_recall@5 0.6185
cosine_recall@10 0.7393
cosine_ndcg@10 0.6404
cosine_mrr@10 0.5797
cosine_map@100 0.6222

Information Retrieval

Metric Value
cosine_accuracy@1 0.4884
cosine_accuracy@3 0.5363
cosine_accuracy@5 0.6321
cosine_accuracy@10 0.7187
cosine_precision@1 0.4884
cosine_precision@3 0.4673
cosine_precision@5 0.362
cosine_precision@10 0.2204
cosine_recall@1 0.1716
cosine_recall@3 0.4623
cosine_recall@5 0.5846
cosine_recall@10 0.705
cosine_ndcg@10 0.6021
cosine_mrr@10 0.5401
cosine_map@100 0.5854

Information Retrieval

Metric Value
cosine_accuracy@1 0.4389
cosine_accuracy@3 0.4838
cosine_accuracy@5 0.5641
cosine_accuracy@10 0.6631
cosine_precision@1 0.4389
cosine_precision@3 0.4163
cosine_precision@5 0.3233
cosine_precision@10 0.202
cosine_recall@1 0.1561
cosine_recall@3 0.4138
cosine_recall@5 0.5252
cosine_recall@10 0.648
cosine_ndcg@10 0.547
cosine_mrr@10 0.4867
cosine_map@100 0.5313

Information Retrieval

Metric Value
cosine_accuracy@1 0.3261
cosine_accuracy@3 0.3648
cosine_accuracy@5 0.4328
cosine_accuracy@10 0.5286
cosine_precision@1 0.3261
cosine_precision@3 0.3081
cosine_precision@5 0.2457
cosine_precision@10 0.1575
cosine_recall@1 0.1179
cosine_recall@3 0.3072
cosine_recall@5 0.3986
cosine_recall@10 0.5082
cosine_ndcg@10 0.4198
cosine_mrr@10 0.3678
cosine_map@100 0.4135

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 5,822 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 33 tokens
    • mean: 97.83 tokens
    • max: 160 tokens
    • min: 8 tokens
    • mean: 16.69 tokens
    • max: 38 tokens
  • Samples:
    positive anchor
    the IRGs. Id. at 248 & n.15. It did not matter that the NIMH “may be greatly influenced” by an
    IRG’s “expert view.” Id. at 248. Given the functions that IRGs were “empowered by law to
    perform,” they did not wield “substantial independent authority.” Id. at 247–48.

    Two months after Washington Research Project, Congress enacted the 1974 amendment
    What did Congress enact two months after Washington Research Project?
    GSA’s interpretation of 13 C.F.R. § 125.9(b)(3)(i) harms protégés has broad implications. If
    exclusion from bidding on the SB Solicitation indeed harms either protégé member of SHS or
    VCH, perhaps this suggests the mentor-protégé relationships should not have been approved in the
    first instance. See 13 C.F.R. § 125.9(b)(3) (“In order for SBA to agree to allow a mentor to have
    Which two protégés could be harmed by exclusion from bidding on the SB Solicitation?
    Black’s Law Dictionary 742 (9th ed. 2009) (defining “function” as “[a]ctivity that is appropriate
    to a particular business or profession”); Webster’s Third New Int’l Dictionary 920 (1981)
    (defining “function” as “the action for which a person or thing is specially fitted, used, or
    responsible or for which a thing exists”).
    What year was the 9th edition of Black’s Law Dictionary published?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • gradient_accumulation_steps: 16
  • bf16: True
  • tf32: True
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • max_steps: -1
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 16
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.8791 10 5.6857 - - - - -
1.0 12 - 0.6080 0.5895 0.5496 0.4814 0.3514
1.7033 20 2.7243 - - - - -
2.0 24 - 0.6351 0.6230 0.5869 0.5244 0.3940
2.5275 30 2.0143 - - - - -
3.0 36 - 0.6404 0.6403 0.6022 0.5458 0.4158
3.3516 40 1.7492 - - - - -
4.0 48 - 0.6428 0.6404 0.6021 0.547 0.4198
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.3.0
  • Transformers: 5.3.0
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.13.0
  • Datasets: 4.8.2
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aaa961/modernbert-embed-base-legal-original

Finetuned
(110)
this model

Papers for aaa961/modernbert-embed-base-legal-original

Evaluation results