Audio Flamingo Next Think: Temporally Grounded Audio Reasoning for Speech, Sound, and Music

Upload audio or paste a YouTube URL and ask multi-step, timestamp-grounded questions that require combining evidence across events, speakers, and long-form context with the AF-Next Think checkpoint.

Authors: Sreyan Ghosh^1,2, Arushi Goel¹, Kaousheik Jayakumar², Lasha Koroshinadze², Nishit Anand², Zhifeng Kong¹, Siddharth Gururani¹, Sang-gil Lee¹, Jaehyeon Kim¹, Aya Aljafari¹, Chao-Han Huck Yang¹, Sungwon Kim¹, Ramani Duraiswami², Dinesh Manocha², Mohammad Shoeybi¹, Bryan Catanzaro¹, Ming-Yu Liu¹, Wei Ping¹

¹NVIDIA, CA, USA | ²University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

Prompting note: AF-Next-Think is strongest when you explicitly request step-by-step, timestamp-grounded reasoning and then a final answer.

Prompt Guide

Task	Prompt	Recommended Checkpoint(s)
ASR	`Transcribe the input speech.`	`Instruct`, `Think`
AST	`Translate any speech you hear from <src_lang> into <tgt_lang>.`	`Instruct`, `Think`
Short Audio Captioning	`Generate a caption for the input audio.`	`Captioner`, `Think`
Long Audio Captioning	`Generate a detailed caption for the input audio. In the caption, transcribe all spoken content by all speakers in the audio precisely.`	`Captioner`, `Think`
Music Captioning	`Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys.`	`Captioner`, `Instruct`, `Think`
Lyrics	`Generate a lyrics transcription from the input song.`	`Instruct`, `Captioner`, `Think`
QA	`What precise description did the commentator use for the punch that ended the fight?`	`Instruct`, `Think`
Timestamped Multi-Talker ASR	`Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels.` `[Speaker 1] ...` `[Speaker 2] ...`	`Instruct`, `Think`

Audio Input

Upload Audio File

YouTube URL

Paste any YouTube URL - we'll extract high-quality audio automatically

Prompt

Example Prompts

YouTube URL	Prompt

Model Response