Instructions to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="zerofata/Q3.5-BlueStar-v2-27B-GGUF", filename="Q3.5-BlueStar-v2-IQ3_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Ollama:
ollama run hf.co/zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
- Unsloth Studio new
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for zerofata/Q3.5-BlueStar-v2-27B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for zerofata/Q3.5-BlueStar-v2-27B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for zerofata/Q3.5-BlueStar-v2-27B-GGUF to start chatting
- Pi new
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Docker Model Runner:
docker model run hf.co/zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
- Lemonade
How to use zerofata/Q3.5-BlueStar-v2-27B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull zerofata/Q3.5-BlueStar-v2-27B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Q3.5-BlueStar-v2-27B-GGUF-Q4_K_M
List all available models
lemonade list
BlueStar v2
Qwen3.5 27BDesigned for RP and writing tasks.
Feels like a good improvement on v1. This version aims to fix the rep and improve the intelligence while keeping the creativity.
Non thinking and thinking are both supported. If you want to use thinking, it is required to prefill the <think>\n as that is how it was trained.
Creation Process: SFT
SFT on approx 27 million tokens.
I've confirmed the repetition coming from the RP datasets. Despite the extensive filtering, human editing, rewriting and deduping. Compared to other types of data like chat and writing, RP is just somewhat repetitive in nature. One idea to fix this is to just not use the RP datasets, or use less of them. This does seem to *sort of* work, but the model performs noticably worse at RP as a result. Which makes sense, given that's the entire idea of having RP data to begin with.
The current solution I'm testing is using custom loss masking with the RP datasets. Most common phrases of slop are masked out, so the model doesn't get rewarded for learning these patterns. Overused words within a conversation also get masked out in later turns.
It... seems to have worked? Repetition from my testing is greatly reduced after a few hours of using the model. It can still latch onto phrases, but I've seen much less verbatim repetition.
Trained using Axolotl.
Axolotl Config
base_model: Qwen/Qwen3.5-27B
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false
datasets:
- path: ./data/bluestar_v2_sft_3_all_rp_attempt_masked_20260318_075236.jsonl
val_set_size: 0.02
output_dir: ./Qwen3.5-27B-v2-SFT-5
sequence_len: 10756
sample_packing: true
load_in_8bit: true
adapter: lora
lora_r: 128
lora_alpha: 128
peft_use_rslora: true
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- down_proj
- up_proj
# Uncomment below to also target the linear attention projections.
# These use separate in_proj_qkv / in_proj_z / out_proj (Qwen3.5-specific).
- linear_attn.in_proj_qkv
- linear_attn.in_proj_z
- linear_attn.out_proj
wandb_project: Qwen3.5-27B-SFT
wandb_name: Qwen3.5-27B-v2-SFT-5
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1.2e-5
weight_decay: 0.01
warmup_ratio: 0.05
bf16: auto
tf32: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
evals_per_epoch: 4
saves_per_epoch: 4
special_tokens:
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen3_5DecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true
- Downloads last month
- 652
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit