YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

RoBERTa Model - Custom Implementation

A from-scratch implementation of RoBERTa (Robustly Optimized BERT Pre-training Approach) with masked language modeling (MLM) task, trained on diverse text data.

Overview

This project implements a RoBERTa-style transformer encoder model trained using the masked language modeling objective. The model learns to predict masked tokens in text sequences, developing a rich understanding of language patterns and semantics.

Model Architecture

Core Components

1. Embeddings Layer

Token Embeddings: Maps input tokens to dense vectors (VOCAB_SIZE × HIDDEN_SIZE)
Positional Embeddings: Encodes absolute position information (MAX_LEN × HIDDEN_SIZE)
Layer Normalization: Normalizes embedding outputs for stable training
Dropout: Regularization with rate = 0.1

2. Encoder Blocks (Stacked Transformer Encoders)

Each encoder block consists of:

Multi-Head Self-Attention:
- Number of heads: 8
- Allows the model to attend to different parts of the sequence
- Uses key padding mask to ignore padding tokens
Feed-Forward Network (FFN):
- Linear layer: HIDDEN_SIZE → FFN_DIM (256 → 1024)
- Activation: GELU
- Linear layer: FFN_DIM → HIDDEN_SIZE (1024 → 256)
- Expands and compresses information
Residual Connections & Layer Normalization:
- Applied after attention and FFN layers
- Stabilizes deep network training

3. MLM (Masked Language Model) Head

Dense Layer: HIDDEN_SIZE → HIDDEN_SIZE (256 → 256)
GELU Activation: Non-linear transformation
Layer Normalization: Output normalization
Decoder: HIDDEN_SIZE → VOCAB_SIZE (256 → 16000)
Outputs logits for vocabulary prediction

Architecture Summary

Input IDs (batch_size, seq_len)
    ↓
Embeddings Layer (Token + Position + LayerNorm + Dropout)
    ↓
Encoder Block 1 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 2 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 3 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
Encoder Block 4 (Multi-Head Attention + FFN + Residual + LayerNorm)
    ↓
MLM Head (Dense + GELU + LayerNorm + Decoder)
    ↓
Output Logits (batch_size, seq_len, vocab_size)

Model Hyperparameters

Parameter	Value	Description
`MAX_LEN`	128	Maximum sequence length
`HIDDEN_SIZE`	256	Dimension of hidden states
`NUM_LAYERS`	4	Number of encoder blocks
`NUM_HEADS`	8	Number of attention heads
`FFN_DIM`	1024	Feed-forward network hidden dimension
`DROPOUT`	0.1	Dropout probability
`VOCAB_SIZE`	16000	Vocabulary size

Data

Data Sources

The model is trained on three diverse datasets combined via interleaving:

Wikipedia 2023
- Source: wikimedia/wikipedia (20231101.en split)
- Samples used: 100,000
- High-quality encyclopedic text
Alpaca Dataset
- Source: tatsu-lab/alpaca
- Samples used: 100,000
- Instruction-following examples and outputs
TinyStories
- Source: roneneldan/TinyStories
- Samples used: 100,000
- Narrative and story-based text

Total samples: ~300,000 text documents

Preprocessing Pipeline

1. Corpus Generation

Combined datasets are interleaved to create a unified corpus
Text from all sources is concatenated into corpus.txt
One document per line

2. Tokenization

Tokenizer: SentencePiece (BPE model type)
Vocabulary size: 16,000 tokens
Special tokens:
- <pad> (ID: 0) - Padding token
- <unk> (ID: 1) - Unknown token
- BOS (ID: 2) - Beginning of sequence
- EOS (ID: 3) - End of sequence
- [MASK] (custom) - Masking token

3. Chunking Strategy

Sequence length: 128 tokens (including BOS and EOS)
Content length: 126 tokens (128 - 2 for BOS/EOS)
Each text is split into overlapping chunks of 126 tokens
Chunks are padded to length 128 with pad tokens
Saved as PyTorch tensors in chunks/ directory (10,000 chunks per file)

Example chunk structure:

[BOS, token_1, token_2, ..., token_126, EOS, PAD, PAD, ...]

Dataset Statistics

Total chunks generated: 200+ files
Samples per file: 10,000
Total training samples: 2,000,000+
Sequence length: 128 tokens

Training

Masked Language Modeling (MLM) Objective

The model is trained to predict original tokens from a masked version of the input.

MLM Process (15% masking probability):

Token Selection: 15% of tokens are randomly selected for masking
Masking Strategy (applied to selected tokens):
- 80% replaced with [MASK] token
- 10% replaced with a random token from vocabulary
- 10% left unchanged
Loss Calculation:
- Only masked positions contribute to the loss
- Non-masked positions have labels set to -100 (ignored)
- Cross-entropy loss is computed on masked token predictions

Example:

Original:     "the quick brown fox jumps over the lazy dog"
Tokens:       [2, 101, 102, 103, 104, 105, 106, 107, 108, 109, 3, 0, ...]
Masked:       [2, [MASK], 102, [MASK], 104, random_token, 106, 107, [MASK], 109, 3, 0, ...]
Labels:       [-100, 101, -100, 104, -100, 105, -100, -100, -100, -100, -100, -100, ...]
                      ↑              ↑              ↑              ↑
                   Predict original tokens at these positions

Training Configuration

Loss function: CrossEntropyLoss
Optimizer: AdamW
Batch size: 4
Dataset split: 200,000 training samples (subset for testing)

Project Structure

d:\BERT/
├── main.ipynb                      # Main training and evaluation notebook
├── requirements.txt                # Python dependencies
├── README.md                       # This file
│
├── corpus.txt                      # Combined text corpus
├── chunks.txt                      # Chunk metadata
├── english_tokenizer.model         # SentencePiece tokenizer model
├── english_tokenizer.vocab         # Tokenizer vocabulary file
│
├── chunks/                         # Preprocessed token chunks
│   ├── all_chunks.pt              # All chunks combined
│   ├── chunks_0.pt
│   ├── chunks_1.pt
│   ├── ... (up to chunks_200+.pt)
│
├── models/                         # Trained model checkpoints
│   └── (saved model files)
│
├── model_checkpoints/              # Training checkpoints
│   └── (checkpoint files)
│
├── sentence_embeddings.html        # 3D visualization of sentence embeddings
├── word_embeddings.html            # 3D visualization of word embeddings
├── sentence_embeddings_2d.png      # 2D PCA visualization
│
└── bert_venv/                      # Python virtual environment

Usage

Dependencies

See requirements.txt for full list. Key packages:

torch                    # Deep learning framework
transformers            # For tokenizer utilities
datasets                # For loading public datasets
sentencepiece           # Tokenization
huggingface_hub         # Model hub integration
matplotlib              # Visualization
scikit-learn            # PCA visualization
plotly                  # Interactive plots
pandas                  # Data manipulation

Installation

# Create virtual environment
python -m venv bert_venv
source bert_venv/Scripts/activate  # On Windows

# Install dependencies
pip install -r requirements.txt

Training

Run the notebook:

jupyter notebook main.ipynb

Key training steps:

Load and authenticate with Hugging Face (requires HF_TOKEN in .env)
Generate corpus from combined datasets
Train SentencePiece tokenizer
Create text chunks and save as tensors
Initialize RoBERTa model
Train with masked language modeling objective
Save checkpoints and evaluate

Evaluation

The notebook includes:

Mask accuracy: Percentage of correctly predicted masked tokens
Embedding analysis:
- Word embeddings visualization (3D and 2D PCA)
- Sentence embeddings visualization
- Similarity matrices between sequences

Key Features

✅ Custom transformer architecture - Built from scratch using PyTorch
✅ Efficient tokenization - SentencePiece BPE with 16K vocabulary
✅ Masked language modeling - Industry-standard pre-training objective
✅ Multi-source training data - Wikipedia, Alpaca, TinyStories
✅ Visualization tools - 3D embeddings and similarity analysis
✅ Checkpoint management - Save and resume training

Implementation Details

Special Design Choices

Dynamic masking guarantee: If a sequence has no masked tokens, at least one token is randomly selected for masking to ensure effective learning
Attention masking: Padding tokens are properly masked to prevent attention to pad positions
Residual connections: Applied in both attention and FFN blocks for training stability
GELU activation: Used instead of ReLU for smoother gradients and better performance
Layer normalization placement: Applied after each sublayer (post-norm) for numerical stability