YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Residual K-Means Tokenizer
A residual K-means model for vector quantization. It encodes continuous embeddings into discrete codes through hierarchical clustering.
Files
res_kmeans.py- Model definitiontrain_res_kmeans.py- Training scriptinfer_res_kmeans.py- Inference script
Installation
pip install torch numpy pandas pyarrow faiss tqdm
Usage
Training
python train_res_kmeans.py \
--data_path ./data/embeddings.parquet \
--model_path ./checkpoints \
--n_layers 3 \
--codebook_size 8192 \
--dim 4096
Arguments:
--data_path: Path to parquet file(s) withembeddingcolumn--model_path: Directory to save the model--n_layers: Number of residual layers (default: 3)--codebook_size: Size of each codebook (default: 8192)--dim: Embedding dimension (default: 4096)--seed: Random seed (default: 42)
Inference
python infer_res_kmeans.py \
--model_path ./checkpoints/model.pt \
--emb_path ./data/embeddings.parquet \
--output_path ./output/codes.parquet
Arguments:
--model_path: Path to trained model checkpoint--emb_path: Path to parquet file withpidandembeddingcolumns--output_path: Output path (default:{emb_path}_codes.parquet)--batch_size: Inference batch size (default: 10000)--device: Device to use (default: cuda if available)--n_layers: Number of layers to use (default: all)
Input format: Parquet with columns pid, embedding
Output format: Parquet with columns pid, codes
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support