| | --- |
| | license: apache-2.0 |
| | pipeline_tag: audio-text-to-text |
| | language: |
| | - en |
| | - zh |
| | base_model: |
| | - Yi3852/MuFun-Base |
| | --- |
| | an instruct version of MuFun model proposed in [Advancing the Foundation Model for Music Understanding](https://arxiv.org/abs/2508.01178) |
| |
|
| | gradio demo: http://47.121.209.64/mufun_demo_chat |
| |
|
| | train code: https://github.com/laitselec/MuFun |
| |
|
| | ## Usage |
| | some audio processing packages like mutagen, torchaudio are needed to be installed |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | hf_path = 'Yi3852/MuFun-Instruct' |
| | tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False) |
| | device='cuda' |
| | model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16") |
| | model.to(device) |
| | |
| | # single audio |
| | # during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt |
| | aud="/path/to/your/song.mp3" |
| | inp="\n<audio>Can you listen to this song and tell me its lyrics?" |
| | res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer) |
| | print(res) |
| | |
| | # multiple audios |
| | # for multiple songs each will be placed in the coresponding <audio> tag in the prompt |
| | aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3'] |
| | inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason." |
| | res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer) |
| | print(res) |
| | |
| | # analyze only a specific segment of audio using the segs parameter |
| | # format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]] |
| | aud="/path/to/your/song.mp3" |
| | inp="\n<audio>How is the rhythm of this music clip?" |
| | res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer) |
| | print(res) |
| | |
| | # set audio_files=None will work, however it is not recommended to use it as a text model |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{jiang2025advancingfoundationmodelmusic, |
| | title={Advancing the Foundation Model for Music Understanding}, |
| | author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo}, |
| | year={2025}, |
| | eprint={2508.01178}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.SD}, |
| | url={https://arxiv.org/abs/2508.01178}, |
| | } |