Instructions to use ai21labs/Jamba-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ai21labs/Jamba-v0.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ai21labs/Jamba-v0.1", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ai21labs/Jamba-v0.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ai21labs/Jamba-v0.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ai21labs/Jamba-v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ai21labs/Jamba-v0.1
- SGLang
How to use ai21labs/Jamba-v0.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ai21labs/Jamba-v0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ai21labs/Jamba-v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ai21labs/Jamba-v0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ai21labs/Jamba-v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ai21labs/Jamba-v0.1 with Docker Model Runner:
docker model run hf.co/ai21labs/Jamba-v0.1
Smaller version to ease implementation experiments?
Hi. I've worked on implementing Mamba support in llama.cpp before (see https://github.com/ggerganov/llama.cpp/pull/5328), and I'd like to eventually implement support for Jamba too.
However, for my hardware, this model is too big for quick experimentation, so I'd really appreciate it if you'd also release a smaller model with the same architecture. It doesn't need to be good (though some coherency is preferred). Ideally a Jamba model with less than 1B parameters would help a lot with this, if possible.
I second this. Loading the weights take a really long time. Some light version (with pruning?) even if the end results is not effective at all would be great for quick testing iteration.
I third this
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba
Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset and attn_layer_period keys in config.json differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-JambaNice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The
attn_layer_offsetandattn_layer_periodkeys inconfig.jsondiffer from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?
Ah, this is because I set expert_layer_offset and expert_layer_period to be the same as attn_layer_offset and attn_layer_period. I wanted to first test the results of using MoE only in the Attention layer when making this version.
I will make a new version with Mamba+MoE, Mamba+MLP, Attention+MoE, Attention+MLP at the same time later.
Hi, we uploaded this version for debugging and development purposes (random weights, no training whatsoever)
https://huggingface.co/ai21labs/Jamba-tiny-random