AntAngelMed-eagle3

Model Overview

AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

Key Features

Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200

Performance

Speculative Sampling Efficiency

Average Acceptance Length with speculative length of 4:

Benchmark	Average Acceptance Length
HumanEval	2.816
GSM8K	3.24
Math-500	3.326
Med_MCPA	2.600
Health_Bench	2.446

Throughput Improvement

Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:

Benchmark	Throughput Improvement
HumanEval	+67.3%
GSM8K	+58.6%
Math-500	+89.8%
Med_MCPA	+46%
Health_Bench	+45.3%

Ultimate Inference Performance

Hardware Environment: NVIDIA H200 single GPU

Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

Technical Specifications

Model Architecture: LlamaForCausalLMEagle3
Number of Layers: 1 layer (Draft Model)
Hidden Size: 4096
Attention Heads: 32 (KV heads: 8)
Intermediate Size: 14336
Vocabulary Size: 157,184
Max Position Embeddings: 32,768
Data Type: bfloat16

Quick Start

Requirements

H200-class Computational Performance
CUDA 12.0+
PyTorch 2.0+

Installation

pip install sglang==0.5.6

and include PR https://github.com/sgl-project/sglang/pull/15119

Inference with SGLang

python3 -m sglang.launch_server  \
    --model-path MedAIBase/AntAngelMed-FP8 \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend fa3  \
    --mem-fraction-static 0.9 \
    --tp-size 1  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4

Training Data

Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

Use Cases

High-concurrency inference services
Real-time dialogue systems
Code generation and completion
Mathematical reasoning and computation
Production environments requiring low-latency responses

Open Source Contribution

We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:

PR #15119: EAGLE3 Optimization Implementation

Limitations and Notes

This model is a draft model that needs to be used with a target model to achieve speculative sampling
FP8 quantization is recommended for optimal performance
Performance may vary across different hardware platforms
Medical domain applications must comply with relevant regulations; model outputs are for reference only

License

This code repository is licensed under the MIT License.

Downloads last month: 10

Safetensors

Model size

0.4B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MedAIBase/AntAngelMed-eagle3

Base model

inclusionAI/Ling-flash-base-2.0

Finetuned

inclusionAI/Ling-flash-2.0

Finetuned

MedAIBase/AntAngelMed

Finetuned

(1)

this model