YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Residual K-Means Tokenizer

A residual K-means model for vector quantization. It encodes continuous embeddings into discrete codes through hierarchical clustering.

Files

  • res_kmeans.py - Model definition
  • train_res_kmeans.py - Training script
  • infer_res_kmeans.py - Inference script

Installation

pip install torch numpy pandas pyarrow faiss tqdm

Usage

Training

python train_res_kmeans.py \
    --data_path ./data/embeddings.parquet \
    --model_path ./checkpoints \
    --n_layers 3 \
    --codebook_size 8192 \
    --dim 4096

Arguments:

  • --data_path: Path to parquet file(s) with embedding column
  • --model_path: Directory to save the model
  • --n_layers: Number of residual layers (default: 3)
  • --codebook_size: Size of each codebook (default: 8192)
  • --dim: Embedding dimension (default: 4096)
  • --seed: Random seed (default: 42)

Inference

python infer_res_kmeans.py \
    --model_path ./checkpoints/model.pt \
    --emb_path ./data/embeddings.parquet \
    --output_path ./output/codes.parquet

Arguments:

  • --model_path: Path to trained model checkpoint
  • --emb_path: Path to parquet file with pid and embedding columns
  • --output_path: Output path (default: {emb_path}_codes.parquet)
  • --batch_size: Inference batch size (default: 10000)
  • --device: Device to use (default: cuda if available)
  • --n_layers: Number of layers to use (default: all)

Input format: Parquet with columns pid, embedding

Output format: Parquet with columns pid, codes

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support