YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

RoBERTa Model - Custom Implementation

A from-scratch implementation of RoBERTa (Robustly Optimized BERT Pre-training Approach) with masked language modeling (MLM) task, trained on diverse text data.

Overview

This project implements a RoBERTa-style transformer encoder model trained using the masked language modeling objective. The model learns to predict masked tokens in text sequences, developing a rich understanding of language patterns and semantics.

Model Architecture

Core Components

1. Embeddings Layer

  • Token Embeddings: Maps input tokens to dense vectors (VOCAB_SIZE Γ— HIDDEN_SIZE)
  • Positional Embeddings: Encodes absolute position information (MAX_LEN Γ— HIDDEN_SIZE)
  • Layer Normalization: Normalizes embedding outputs for stable training
  • Dropout: Regularization with rate = 0.1

2. Encoder Blocks (Stacked Transformer Encoders)

Each encoder block consists of:

  • Multi-Head Self-Attention:

    • Number of heads: 8
    • Allows the model to attend to different parts of the sequence
    • Uses key padding mask to ignore padding tokens
  • Feed-Forward Network (FFN):

    • Linear layer: HIDDEN_SIZE β†’ FFN_DIM (256 β†’ 1024)
    • Activation: GELU
    • Linear layer: FFN_DIM β†’ HIDDEN_SIZE (1024 β†’ 256)
    • Expands and compresses information
  • Residual Connections & Layer Normalization:

    • Applied after attention and FFN layers
    • Stabilizes deep network training

3. MLM (Masked Language Model) Head

  • Dense Layer: HIDDEN_SIZE β†’ HIDDEN_SIZE (256 β†’ 256)
  • GELU Activation: Non-linear transformation
  • Layer Normalization: Output normalization
  • Decoder: HIDDEN_SIZE β†’ VOCAB_SIZE (256 β†’ 16000)
  • Outputs logits for vocabulary prediction

Architecture Summary

Input IDs (batch_size, seq_len)
    ↓
Embeddings Layer (Token + Position + LayerNorm + Dropout)
    ↓
Encoder Block 1 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 2 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 3 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 4 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
MLM Head (Dense + GELU + LayerNorm + Decoder)
    ↓
Output Logits (batch_size, seq_len, vocab_size)

Model Hyperparameters

Parameter Value Description
MAX_LEN 128 Maximum sequence length
HIDDEN_SIZE 256 Dimension of hidden states
NUM_LAYERS 4 Number of encoder blocks
NUM_HEADS 8 Number of attention heads
FFN_DIM 1024 Feed-forward network hidden dimension
DROPOUT 0.1 Dropout probability
VOCAB_SIZE 16000 Vocabulary size

Data

Data Sources

The model is trained on three diverse datasets combined via interleaving:

  1. Wikipedia 2023

    • Source: wikimedia/wikipedia (20231101.en split)
    • Samples used: 100,000
    • High-quality encyclopedic text
  2. Alpaca Dataset

    • Source: tatsu-lab/alpaca
    • Samples used: 100,000
    • Instruction-following examples and outputs
  3. TinyStories

    • Source: roneneldan/TinyStories
    • Samples used: 100,000
    • Narrative and story-based text

Total samples: ~300,000 text documents

Preprocessing Pipeline

1. Corpus Generation

  • Combined datasets are interleaved to create a unified corpus
  • Text from all sources is concatenated into corpus.txt
  • One document per line

2. Tokenization

  • Tokenizer: SentencePiece (BPE model type)
  • Vocabulary size: 16,000 tokens
  • Special tokens:
    • <pad> (ID: 0) - Padding token
    • <unk> (ID: 1) - Unknown token
    • BOS (ID: 2) - Beginning of sequence
    • EOS (ID: 3) - End of sequence
    • [MASK] (custom) - Masking token

3. Chunking Strategy

  • Sequence length: 128 tokens (including BOS and EOS)
  • Content length: 126 tokens (128 - 2 for BOS/EOS)
  • Each text is split into overlapping chunks of 126 tokens
  • Chunks are padded to length 128 with pad tokens
  • Saved as PyTorch tensors in chunks/ directory (10,000 chunks per file)

Example chunk structure:

[BOS, token_1, token_2, ..., token_126, EOS, PAD, PAD, ...]

Dataset Statistics

  • Total chunks generated: 200+ files
  • Samples per file: 10,000
  • Total training samples: 2,000,000+
  • Sequence length: 128 tokens

Training

Masked Language Modeling (MLM) Objective

The model is trained to predict original tokens from a masked version of the input.

MLM Process (15% masking probability):

  1. Token Selection: 15% of tokens are randomly selected for masking

  2. Masking Strategy (applied to selected tokens):

    • 80% replaced with [MASK] token
    • 10% replaced with a random token from vocabulary
    • 10% left unchanged
  3. Loss Calculation:

    • Only masked positions contribute to the loss
    • Non-masked positions have labels set to -100 (ignored)
    • Cross-entropy loss is computed on masked token predictions

Example:

Original:     "the quick brown fox jumps over the lazy dog"
Tokens:       [2, 101, 102, 103, 104, 105, 106, 107, 108, 109, 3, 0, ...]
Masked:       [2, [MASK], 102, [MASK], 104, random_token, 106, 107, [MASK], 109, 3, 0, ...]
Labels:       [-100, 101, -100, 104, -100, 105, -100, -100, -100, -100, -100, -100, ...]
                      ↑              ↑              ↑              ↑
                   Predict original tokens at these positions

Training Configuration

  • Loss function: CrossEntropyLoss
  • Optimizer: AdamW
  • Batch size: 4
  • Dataset split: 200,000 training samples (subset for testing)

Project Structure

d:\BERT/
β”œβ”€β”€ main.ipynb                      # Main training and evaluation notebook
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ README.md                       # This file
β”‚
β”œβ”€β”€ corpus.txt                      # Combined text corpus
β”œβ”€β”€ chunks.txt                      # Chunk metadata
β”œβ”€β”€ english_tokenizer.model         # SentencePiece tokenizer model
β”œβ”€β”€ english_tokenizer.vocab         # Tokenizer vocabulary file
β”‚
β”œβ”€β”€ chunks/                         # Preprocessed token chunks
β”‚   β”œβ”€β”€ all_chunks.pt              # All chunks combined
β”‚   β”œβ”€β”€ chunks_0.pt
β”‚   β”œβ”€β”€ chunks_1.pt
β”‚   β”œβ”€β”€ ... (up to chunks_200+.pt)
β”‚
β”œβ”€β”€ models/                         # Trained model checkpoints
β”‚   └── (saved model files)
β”‚
β”œβ”€β”€ model_checkpoints/              # Training checkpoints
β”‚   └── (checkpoint files)
β”‚
β”œβ”€β”€ sentence_embeddings.html        # 3D visualization of sentence embeddings
β”œβ”€β”€ word_embeddings.html            # 3D visualization of word embeddings
β”œβ”€β”€ sentence_embeddings_2d.png      # 2D PCA visualization
β”‚
└── bert_venv/                      # Python virtual environment

Usage

Dependencies

See requirements.txt for full list. Key packages:

torch                    # Deep learning framework
transformers            # For tokenizer utilities
datasets                # For loading public datasets
sentencepiece           # Tokenization
huggingface_hub         # Model hub integration
matplotlib              # Visualization
scikit-learn            # PCA visualization
plotly                  # Interactive plots
pandas                  # Data manipulation

Installation

# Create virtual environment
python -m venv bert_venv
source bert_venv/Scripts/activate  # On Windows

# Install dependencies
pip install -r requirements.txt

Training

Run the notebook:

jupyter notebook main.ipynb

Key training steps:

  1. Load and authenticate with Hugging Face (requires HF_TOKEN in .env)
  2. Generate corpus from combined datasets
  3. Train SentencePiece tokenizer
  4. Create text chunks and save as tensors
  5. Initialize RoBERTa model
  6. Train with masked language modeling objective
  7. Save checkpoints and evaluate

Evaluation

The notebook includes:

  • Mask accuracy: Percentage of correctly predicted masked tokens
  • Embedding analysis:
    • Word embeddings visualization (3D and 2D PCA)
    • Sentence embeddings visualization
    • Similarity matrices between sequences

Key Features

βœ… Custom transformer architecture - Built from scratch using PyTorch
βœ… Efficient tokenization - SentencePiece BPE with 16K vocabulary
βœ… Masked language modeling - Industry-standard pre-training objective
βœ… Multi-source training data - Wikipedia, Alpaca, TinyStories
βœ… Visualization tools - 3D embeddings and similarity analysis
βœ… Checkpoint management - Save and resume training

Implementation Details

Special Design Choices

  1. Dynamic masking guarantee: If a sequence has no masked tokens, at least one token is randomly selected for masking to ensure effective learning

  2. Attention masking: Padding tokens are properly masked to prevent attention to pad positions

  3. Residual connections: Applied in both attention and FFN blocks for training stability

  4. GELU activation: Used instead of ReLU for smoother gradients and better performance

  5. Layer normalization placement: Applied after each sublayer (post-norm) for numerical stability

Performance Metrics

The model tracks:

  • Training loss: Cross-entropy loss on masked tokens
  • Masking accuracy: Correct predictions / total masked tokens
  • Per-layer hidden states: For analysis and visualization

References

License

MIT License - Feel free to use for research and educational purposes

Author

Custom implementation for learning and research purposes


Last Updated: 2026
Status: Active Development

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for atiyab/RoBERTa