ViGPT2 AIO Mixed One Step

VLAI-AIVN/vigpt2-aio-mixed-one-step is a Vietnamese GPT-2 style causal language model trained with a single mixed pretraining run over general Vietnamese text and a poem corpus.

This checkpoint is intended for Vietnamese text generation and research experiments around mixed-domain pretraining. It is not an instruction-tuned chat model.

Model Summary

  • Architecture: GPT2LMHeadModel
  • Layers: 12
  • Hidden size: 768
  • Attention heads: 12
  • Context length: 1024 tokens
  • Vocabulary size: 50,257
  • Parameter count: 124,439,808
  • Saved weights format: safetensors
  • Framework: Hugging Face Transformers

Training Data

The model was trained on a mixed Vietnamese corpus built from:

  • Deduplicated BKAI training data
  • Deduplicated Vietnamese Wikipedia articles
  • A Vietnamese poem stanza corpus

Text is normalized, tokenized, concatenated, and packed into fixed 1024-token blocks. An end-of-text token is appended between samples before packing.

Training Procedure

This model was trained with the mixed pretraining recipe implemented in src/train_mixed.py.

Important detail: the training script loads the tokenizer and config from the project checkpoint at artifacts/checkpoints/sft_poem/final, but creates the model with AutoModelForCausalLM.from_config(config). In other words, this run uses the saved architecture/tokenizer configuration but initializes model weights from config for the mixed one-step run instead of loading pretrained weights from that checkpoint.

Saved training arguments from this checkpoint:

Setting Value
max_steps 18920
per_device_train_batch_size 2
per_device_eval_batch_size 2
gradient_accumulation_steps 64
learning_rate 5e-4
weight_decay 0.01
warmup_ratio 0.1
lr_scheduler_type cosine
bf16 true
fp16 false
eval_steps 2000
save_steps 2000
logging_steps 100
seed 42

The launch script in this project runs mixed training with torchrun --nproc_per_node=2.

Training Metrics

  • Final saved training step: 18920
  • Final eval loss: 2.5013
  • Approximate perplexity: 12.20
  • Final reported train loss: 2.8971

These numbers should be treated as run-level reference metrics, not as a full benchmark.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "VLAI-AIVN/vigpt2-aio-mixed-one-step"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "Hà Nội là thủ đô của Việt Nam. "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

  • Vietnamese language modeling experiments
  • Vietnamese text generation baselines
  • Research on mixed-domain pretraining
  • Further fine-tuning for downstream Vietnamese generation tasks

Out-of-Scope Uses

  • Safety-critical decision making
  • Factual question answering without verification
  • Use as a chat assistant without additional instruction tuning
  • Deployment in production without task-specific evaluation and filtering

Limitations

  • The model can generate incorrect, biased, repetitive, or low-quality text.
  • The training mixture includes general web-like and Wikipedia-style text as well as poem data, so style may drift depending on the prompt.
  • This is a base generative model, not an aligned assistant model.
  • The repository does not currently declare a license in the local project snapshot used to produce this checkpoint. Confirm licensing before broad redistribution or commercial use.

Repository Context

This checkpoint comes from the Vietnamese GPT-2 pretraining project in this repository, which includes:

  • Tokenizer training
  • Deduplication and corpus preparation
  • Base pretraining
  • Poem-domain continued pretraining
  • Mixed one-step pretraining

Citation

If you use this model, cite the repository or link back to the Hugging Face model page.

Downloads last month
44
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support