AntAngelMed-eagle3
Model Overview
AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.
The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.
Key Features
- Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
- Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
- Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200
Performance
Speculative Sampling Efficiency
Average Acceptance Length with speculative length of 4:
| Benchmark | Average Acceptance Length |
|---|---|
| HumanEval | 2.816 |
| GSM8K | 3.24 |
| Math-500 | 3.326 |
| Med_MCPA | 2.600 |
| Health_Bench | 2.446 |
Throughput Improvement
Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:
| Benchmark | Throughput Improvement |
|---|---|
| HumanEval | +67.3% |
| GSM8K | +58.6% |
| Math-500 | +89.8% |
| Med_MCPA | +46% |
| Health_Bench | +45.3% |
Ultimate Inference Performance
- Hardware Environment: NVIDIA H200 single GPU
Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200
Technical Specifications
- Model Architecture: LlamaForCausalLMEagle3
- Number of Layers: 1 layer (Draft Model)
- Hidden Size: 4096
- Attention Heads: 32 (KV heads: 8)
- Intermediate Size: 14336
- Vocabulary Size: 157,184
- Max Position Embeddings: 32,768
- Data Type: bfloat16
Quick Start
Requirements
- H200-class Computational Performance
- CUDA 12.0+
- PyTorch 2.0+
Installation
pip install sglang==0.5.6
and include PR https://github.com/sgl-project/sglang/pull/15119
Inference with SGLang
python3 -m sglang.launch_server \
--model-path MedAIBase/AntAngelMed-FP8 \
--host 0.0.0.0 --port 30012 \
--trust-remote-code \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--tp-size 1 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Training Data
- Data Quality: Rigorously filtered and cleaned to ensure high-quality training data
Use Cases
- High-concurrency inference services
- Real-time dialogue systems
- Code generation and completion
- Mathematical reasoning and computation
- Production environments requiring low-latency responses
Open Source Contribution
We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:
- PR #15119: EAGLE3 Optimization Implementation
Limitations and Notes
- This model is a draft model that needs to be used with a target model to achieve speculative sampling
- FP8 quantization is recommended for optimal performance
- Performance may vary across different hardware platforms
- Medical domain applications must comply with relevant regulations; model outputs are for reference only
License
This code repository is licensed under the MIT License.
- Downloads last month
- 10
Model tree for MedAIBase/AntAngelMed-eagle3
Base model
inclusionAI/Ling-flash-base-2.0

