YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
RoBERTa Model - Custom Implementation
A from-scratch implementation of RoBERTa (Robustly Optimized BERT Pre-training Approach) with masked language modeling (MLM) task, trained on diverse text data.
Overview
This project implements a RoBERTa-style transformer encoder model trained using the masked language modeling objective. The model learns to predict masked tokens in text sequences, developing a rich understanding of language patterns and semantics.
Model Architecture
Core Components
1. Embeddings Layer
- Token Embeddings: Maps input tokens to dense vectors (VOCAB_SIZE Γ HIDDEN_SIZE)
- Positional Embeddings: Encodes absolute position information (MAX_LEN Γ HIDDEN_SIZE)
- Layer Normalization: Normalizes embedding outputs for stable training
- Dropout: Regularization with rate = 0.1
2. Encoder Blocks (Stacked Transformer Encoders)
Each encoder block consists of:
Multi-Head Self-Attention:
- Number of heads: 8
- Allows the model to attend to different parts of the sequence
- Uses key padding mask to ignore padding tokens
Feed-Forward Network (FFN):
- Linear layer: HIDDEN_SIZE β FFN_DIM (256 β 1024)
- Activation: GELU
- Linear layer: FFN_DIM β HIDDEN_SIZE (1024 β 256)
- Expands and compresses information
Residual Connections & Layer Normalization:
- Applied after attention and FFN layers
- Stabilizes deep network training
3. MLM (Masked Language Model) Head
- Dense Layer: HIDDEN_SIZE β HIDDEN_SIZE (256 β 256)
- GELU Activation: Non-linear transformation
- Layer Normalization: Output normalization
- Decoder: HIDDEN_SIZE β VOCAB_SIZE (256 β 16000)
- Outputs logits for vocabulary prediction
Architecture Summary
Input IDs (batch_size, seq_len)
β
Embeddings Layer (Token + Position + LayerNorm + Dropout)
β
Encoder Block 1 (Multi-Head Attention + FFN + Residual + LayerNorm)
β
Encoder Block 2 (Multi-Head Attention + FFN + Residual + LayerNorm)
β
Encoder Block 3 (Multi-Head Attention + FFN + Residual + LayerNorm)
β
Encoder Block 4 (Multi-Head Attention + FFN + Residual + LayerNorm)
β
MLM Head (Dense + GELU + LayerNorm + Decoder)
β
Output Logits (batch_size, seq_len, vocab_size)
Model Hyperparameters
| Parameter | Value | Description |
|---|---|---|
MAX_LEN |
128 | Maximum sequence length |
HIDDEN_SIZE |
256 | Dimension of hidden states |
NUM_LAYERS |
4 | Number of encoder blocks |
NUM_HEADS |
8 | Number of attention heads |
FFN_DIM |
1024 | Feed-forward network hidden dimension |
DROPOUT |
0.1 | Dropout probability |
VOCAB_SIZE |
16000 | Vocabulary size |
Data
Data Sources
The model is trained on three diverse datasets combined via interleaving:
Wikipedia 2023
- Source:
wikimedia/wikipedia(20231101.en split) - Samples used: 100,000
- High-quality encyclopedic text
- Source:
Alpaca Dataset
- Source:
tatsu-lab/alpaca - Samples used: 100,000
- Instruction-following examples and outputs
- Source:
TinyStories
- Source:
roneneldan/TinyStories - Samples used: 100,000
- Narrative and story-based text
- Source:
Total samples: ~300,000 text documents
Preprocessing Pipeline
1. Corpus Generation
- Combined datasets are interleaved to create a unified corpus
- Text from all sources is concatenated into
corpus.txt - One document per line
2. Tokenization
- Tokenizer: SentencePiece (BPE model type)
- Vocabulary size: 16,000 tokens
- Special tokens:
<pad>(ID: 0) - Padding token<unk>(ID: 1) - Unknown token- BOS (ID: 2) - Beginning of sequence
- EOS (ID: 3) - End of sequence
[MASK](custom) - Masking token
3. Chunking Strategy
- Sequence length: 128 tokens (including BOS and EOS)
- Content length: 126 tokens (128 - 2 for BOS/EOS)
- Each text is split into overlapping chunks of 126 tokens
- Chunks are padded to length 128 with pad tokens
- Saved as PyTorch tensors in
chunks/directory (10,000 chunks per file)
Example chunk structure:
[BOS, token_1, token_2, ..., token_126, EOS, PAD, PAD, ...]
Dataset Statistics
- Total chunks generated: 200+ files
- Samples per file: 10,000
- Total training samples: 2,000,000+
- Sequence length: 128 tokens
Training
Masked Language Modeling (MLM) Objective
The model is trained to predict original tokens from a masked version of the input.
MLM Process (15% masking probability):
Token Selection: 15% of tokens are randomly selected for masking
Masking Strategy (applied to selected tokens):
- 80% replaced with
[MASK]token - 10% replaced with a random token from vocabulary
- 10% left unchanged
- 80% replaced with
Loss Calculation:
- Only masked positions contribute to the loss
- Non-masked positions have labels set to -100 (ignored)
- Cross-entropy loss is computed on masked token predictions
Example:
Original: "the quick brown fox jumps over the lazy dog"
Tokens: [2, 101, 102, 103, 104, 105, 106, 107, 108, 109, 3, 0, ...]
Masked: [2, [MASK], 102, [MASK], 104, random_token, 106, 107, [MASK], 109, 3, 0, ...]
Labels: [-100, 101, -100, 104, -100, 105, -100, -100, -100, -100, -100, -100, ...]
β β β β
Predict original tokens at these positions
Training Configuration
- Loss function: CrossEntropyLoss
- Optimizer: AdamW
- Batch size: 4
- Dataset split: 200,000 training samples (subset for testing)
Project Structure
d:\BERT/
βββ main.ipynb # Main training and evaluation notebook
βββ requirements.txt # Python dependencies
βββ README.md # This file
β
βββ corpus.txt # Combined text corpus
βββ chunks.txt # Chunk metadata
βββ english_tokenizer.model # SentencePiece tokenizer model
βββ english_tokenizer.vocab # Tokenizer vocabulary file
β
βββ chunks/ # Preprocessed token chunks
β βββ all_chunks.pt # All chunks combined
β βββ chunks_0.pt
β βββ chunks_1.pt
β βββ ... (up to chunks_200+.pt)
β
βββ models/ # Trained model checkpoints
β βββ (saved model files)
β
βββ model_checkpoints/ # Training checkpoints
β βββ (checkpoint files)
β
βββ sentence_embeddings.html # 3D visualization of sentence embeddings
βββ word_embeddings.html # 3D visualization of word embeddings
βββ sentence_embeddings_2d.png # 2D PCA visualization
β
βββ bert_venv/ # Python virtual environment
Usage
Dependencies
See requirements.txt for full list. Key packages:
torch # Deep learning framework
transformers # For tokenizer utilities
datasets # For loading public datasets
sentencepiece # Tokenization
huggingface_hub # Model hub integration
matplotlib # Visualization
scikit-learn # PCA visualization
plotly # Interactive plots
pandas # Data manipulation
Installation
# Create virtual environment
python -m venv bert_venv
source bert_venv/Scripts/activate # On Windows
# Install dependencies
pip install -r requirements.txt
Training
Run the notebook:
jupyter notebook main.ipynb
Key training steps:
- Load and authenticate with Hugging Face (requires
HF_TOKENin.env) - Generate corpus from combined datasets
- Train SentencePiece tokenizer
- Create text chunks and save as tensors
- Initialize RoBERTa model
- Train with masked language modeling objective
- Save checkpoints and evaluate
Evaluation
The notebook includes:
- Mask accuracy: Percentage of correctly predicted masked tokens
- Embedding analysis:
- Word embeddings visualization (3D and 2D PCA)
- Sentence embeddings visualization
- Similarity matrices between sequences
Key Features
β
Custom transformer architecture - Built from scratch using PyTorch
β
Efficient tokenization - SentencePiece BPE with 16K vocabulary
β
Masked language modeling - Industry-standard pre-training objective
β
Multi-source training data - Wikipedia, Alpaca, TinyStories
β
Visualization tools - 3D embeddings and similarity analysis
β
Checkpoint management - Save and resume training
Implementation Details
Special Design Choices
Dynamic masking guarantee: If a sequence has no masked tokens, at least one token is randomly selected for masking to ensure effective learning
Attention masking: Padding tokens are properly masked to prevent attention to pad positions
Residual connections: Applied in both attention and FFN blocks for training stability
GELU activation: Used instead of ReLU for smoother gradients and better performance
Layer normalization placement: Applied after each sublayer (post-norm) for numerical stability
Performance Metrics
The model tracks:
- Training loss: Cross-entropy loss on masked tokens
- Masking accuracy: Correct predictions / total masked tokens
- Per-layer hidden states: For analysis and visualization
References
- Original RoBERTa Paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach
- BERT: BERT: Pre-training of Deep Bidirectional Transformers
- Attention Is All You Need: Transformer Architecture
License
MIT License - Feel free to use for research and educational purposes
Author
Custom implementation for learning and research purposes
Last Updated: 2026
Status: Active Development