nvidia
/

audio-flamingo-3-chat

@@ -15,6 +15,8 @@ datasets:
 - nvidia/AudioSkills
 - nvidia/AF-Think
 - nvidia/AF-Chat
 ---
 # Model Overview
@@ -72,20 +74,19 @@ Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Mo
 Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
 **This model is for non-commercial research purposes only.**
 ## Results:
 <center><img src="static/af3_radial-1.png" width="400"></center>
-<br>
 ## Model Architecture:
 Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
 <center><img src="static/af3_main_diagram-1.png" width="800"></center>
 ## License / Terms of Use
 The model is released under the [NVIDIA OneWay Noncommercial License](static/NVIDIA_OneWay_Noncommercial_License.docx). Portions of the dataset generation are also subject to the [Qwen Research License](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE) and OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use).
@@ -123,21 +124,21 @@ AF3 uses:
 **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
 ## Input:
-Input Type: Audio, Text <br>
-Input Format: WAV/MP3/FLAC, UTF-8 text <br>
-Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
-Other Properties Related to Input: <br>
--Max Audio Length: 10 Minutes <br>
--Max Text Length: 16000 tokens<br>
 ## Output:
-Output Type: Text (and optional speech) <br>
-Text Format: UTF-8 string  <br>
-Output Parameters: One-Dimensional (1D)<br>
-Other Properties Related to Output: <br>
--Max Text Length: 1024 tokens <br>
--Speech Format: streaming TTS (text-to-speech) waveform<br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>

 - nvidia/AudioSkills
 - nvidia/AF-Think
 - nvidia/AF-Chat
+base_model:
+- nvidia/audio-flamingo-3
 ---
 # Model Overview
 Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
+**This model is the chat version of AF3, capable of voice chat and muiti-tun  multi-audio dialogue. The non-chat version can be found [here](https://huggingface.co/nvidia/audio-flamingo-3/)**
 **This model is for non-commercial research purposes only.**
 ## Results:
 <center><img src="static/af3_radial-1.png" width="400"></center>
 ## Model Architecture:
 Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
 <center><img src="static/af3_main_diagram-1.png" width="800"></center>
 ## License / Terms of Use
 The model is released under the [NVIDIA OneWay Noncommercial License](static/NVIDIA_OneWay_Noncommercial_License.docx). Portions of the dataset generation are also subject to the [Qwen Research License](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE) and OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use).
 **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
 ## Input:
+- Input Type: Audio, Text <br>
+- Input Format: WAV/MP3/FLAC, UTF-8 text <br>
+- Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
+- Other Properties Related to Input: <br>
+- Max Audio Length: 10 Minutes <br>
+- Max Text Length: 16000 tokens<br>
 ## Output:
+- Output Type: Text (and optional speech) <br>
+- Text Format: UTF-8 string  <br>
+- Output Parameters: One-Dimensional (1D)<br>
+- Other Properties Related to Output: <br>
+- Max Text Length: 1024 tokens <br>
+- Speech Format: streaming TTS (text-to-speech) waveform<br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>