SreyanG-NVIDIA commited on
Commit
5d7aa11
·
verified ·
1 Parent(s): 8b18d0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -15
README.md CHANGED
@@ -15,6 +15,8 @@ datasets:
15
  - nvidia/AudioSkills
16
  - nvidia/AF-Think
17
  - nvidia/AF-Chat
 
 
18
  ---
19
  # Model Overview
20
 
@@ -72,20 +74,19 @@ Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Mo
72
 
73
  Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
74
 
 
 
75
  **This model is for non-commercial research purposes only.**
76
 
77
 
78
  ## Results:
79
  <center><img src="static/af3_radial-1.png" width="400"></center>
80
 
81
- <br>
82
-
83
  ## Model Architecture:
84
  Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
85
 
86
  <center><img src="static/af3_main_diagram-1.png" width="800"></center>
87
 
88
-
89
  ## License / Terms of Use
90
  The model is released under the [NVIDIA OneWay Noncommercial License](static/NVIDIA_OneWay_Noncommercial_License.docx). Portions of the dataset generation are also subject to the [Qwen Research License](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE) and OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use).
91
 
@@ -123,21 +124,21 @@ AF3 uses:
123
  **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
124
 
125
  ## Input:
126
- Input Type: Audio, Text <br>
127
- Input Format: WAV/MP3/FLAC, UTF-8 text <br>
128
- Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
129
- Other Properties Related to Input: <br>
130
- -Max Audio Length: 10 Minutes <br>
131
- -Max Text Length: 16000 tokens<br>
132
 
133
 
134
  ## Output:
135
- Output Type: Text (and optional speech) <br>
136
- Text Format: UTF-8 string <br>
137
- Output Parameters: One-Dimensional (1D)<br>
138
- Other Properties Related to Output: <br>
139
- -Max Text Length: 1024 tokens <br>
140
- -Speech Format: streaming TTS (text-to-speech) waveform<br>
141
 
142
 
143
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
 
15
  - nvidia/AudioSkills
16
  - nvidia/AF-Think
17
  - nvidia/AF-Chat
18
+ base_model:
19
+ - nvidia/audio-flamingo-3
20
  ---
21
  # Model Overview
22
 
 
74
 
75
  Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
76
 
77
+ **This model is the chat version of AF3, capable of voice chat and muiti-tun multi-audio dialogue. The non-chat version can be found [here](https://huggingface.co/nvidia/audio-flamingo-3/)**
78
+
79
  **This model is for non-commercial research purposes only.**
80
 
81
 
82
  ## Results:
83
  <center><img src="static/af3_radial-1.png" width="400"></center>
84
 
 
 
85
  ## Model Architecture:
86
  Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
87
 
88
  <center><img src="static/af3_main_diagram-1.png" width="800"></center>
89
 
 
90
  ## License / Terms of Use
91
  The model is released under the [NVIDIA OneWay Noncommercial License](static/NVIDIA_OneWay_Noncommercial_License.docx). Portions of the dataset generation are also subject to the [Qwen Research License](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE) and OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use).
92
 
 
124
  **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
125
 
126
  ## Input:
127
+ - Input Type: Audio, Text <br>
128
+ - Input Format: WAV/MP3/FLAC, UTF-8 text <br>
129
+ - Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
130
+ - Other Properties Related to Input: <br>
131
+ - Max Audio Length: 10 Minutes <br>
132
+ - Max Text Length: 16000 tokens<br>
133
 
134
 
135
  ## Output:
136
+ - Output Type: Text (and optional speech) <br>
137
+ - Text Format: UTF-8 string <br>
138
+ - Output Parameters: One-Dimensional (1D)<br>
139
+ - Other Properties Related to Output: <br>
140
+ - Max Text Length: 1024 tokens <br>
141
+ - Speech Format: streaming TTS (text-to-speech) waveform<br>
142
 
143
 
144
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>