Audio Flamingo Next Think: Temporally Grounded Audio Reasoning for Speech, Sound, and Music

Upload audio or paste a YouTube URL and ask multi-step, timestamp-grounded questions that require combining evidence across events, speakers, and long-form context with the AF-Next Think checkpoint.

Authors: Sreyan Ghosh1,2, Arushi Goel1, Kaousheik Jayakumar2, Lasha Koroshinadze2, Nishit Anand2, Zhifeng Kong1, Siddharth Gururani1, Sang-gil Lee1, Jaehyeon Kim1, Aya Aljafari1, Chao-Han Huck Yang1, Sungwon Kim1, Ramani Duraiswami2, Dinesh Manocha2, Mohammad Shoeybi1, Bryan Catanzaro1, Ming-Yu Liu1, Wei Ping1

1NVIDIA, CA, USA | 2University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

Prompting note: AF-Next-Think is strongest when you explicitly request step-by-step, timestamp-grounded reasoning and then a final answer.

Prompt Guide

Task Prompt Recommended Checkpoint(s)
ASR Transcribe the input speech. Instruct, Think
AST Translate any speech you hear from <src_lang> into <tgt_lang>. Instruct, Think
Short Audio Captioning Generate a caption for the input audio. Captioner, Think
Long Audio Captioning Generate a detailed caption for the input audio. In the caption, transcribe all spoken content by all speakers in the audio precisely. Captioner, Think
Music Captioning Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys. Captioner, Instruct, Think
Lyrics Generate a lyrics transcription from the input song. Instruct, Captioner, Think
QA What precise description did the commentator use for the punch that ended the fight? Instruct, Think
Timestamped Multi-Talker ASR Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels.
[Speaker 1] ...
[Speaker 2] ...
Instruct, Think

Audio Input

OR

Example Prompts
YouTube URL Prompt