Audio Reasoning Challenge Logo

Audio Reasoning Challenge

Interspeech 2026

News
  • 2025-12-03
    Please refer to the FAQs page for the frequently asked questions.
  • 2025-12-01
    Please join our slack and WeChat group for real-time communication in the Contact section.
  • 2025-12-01
    Registration (link) for teams is open now! The deadline for registration is 2026-01-15. Register early to get latest updates.
  • 2025-12-01
    Baselines (link) released!
  • 2025-11-25
    Website goes live!

Introduction

Understanding and reasoning about sound is a fundamental aspect of human intelligence. From spoken conversations and musical compositions to subtle environmental cues, humans can not only perceive a wide variety of auditory signals but also interpret their meanings, draw inferences, and make decisions in complex acoustic scenarios. Replicating this capability in artificial systems has long been a key goal of AI research.

Recent progress in Large Language Models (LLMs), combined with advances in audio processing, has given rise to Large Audio Language Models (LALMs)[1-10]. Leveraging large-scale multimodal training and sophisticated architectures, LALMs have achieved impressive results in audio perception tasks such as automatic speech recognition (ASR) and automated audio captioning (AAC). Beyond perception, several recent works have made initial attempts to bring explicit Chain-of-Thought (CoT) reasoning into the audio domain, including Audio-CoT[11], Audio-Reasoner[12], Qwen3-Omni-Thinking[13], and Audio Flamingo 3[14], demonstrating improved reasoning performance by integrating advanced cross-modal thinking strategies.

However, despite these advances, current LALMs still exhibit limited and unstable reasoning capabilities. Even on established reasoning benchmarks like MMAR[15] and MMAU-Pro[16], they often produce direct answers without presenting the underlying reasoning process, or show inconsistent performance across tasks and inputs. This lack of transparent and reliable reasoning limits interpretability, trustworthiness, and the potential to generalize reasoning ability to unseen audio scenarios.

Challenge Goals

To address this gap, we have enriched the MMAR benchmark with manually labeled CoT annotations and explicit reasoning cues, enabling systematic evaluation of LALMs in reasoning-intensive tasks. Building on this resource, we propose the Audio Reasoning Challenge at Interspeech 2026, designed to push LALMs beyond surface-level response accuracy toward robust, interpretable reasoning.

Our evaluation framework adopts a stricter criterion: a prediction is considered correct only if both the reasoning path and the final answer are accurate, ensuring that models are rewarded for genuine, logically consistent thought processes. The challenge features two complementary tracks:

  1. Single Model Track: Participants can use open-source data to post-train open-source models, focusing on intrinsic model reasoning capabilities.
  2. Agent Track: Participants can use open-source models to build an agent system or pipeline without human-in-the-loop, emphasizing system-level orchestration and tool use.
Audio Reasoning Challenge Overview

Examples (with CoT annotated) from the MMAR benchmark, spanning audio, speech, music, and their mix, and illustrating challenges at the signal, perceptual, semantic, and cultural levels.

Challenge Tracks

Track 1: Single Model Track

Participants build a single, end-to-end Audio–Language Model that consumes the audio and produces (i) a Chain-of-Thought (CoT) reasoning trace and (ii) a final answer. Systems must perform intrinsic reasoning within one forward without delegating to external tools, APIs, search engines, or separate controllers. The goal is to isolate model-internal reasoning quality under our strict criterion: a prediction is counted correct only if both the CoT and the final answer are validated.

Learn more about Track 1

Track 2: Agent Track

Participants design an audio reasoning agent that may orchestrate multiple open-source models and tools (e.g., ASR, separation, beat/onset tracking, captioners, planners) to produce a CoT path and a final answer. This track evaluates system-level reasoning: planning, tool selection, and self-checking under the same strict correctness criterion. The emphasis is on transparent trajectories that reveal how intermediate audio analyses contribute to decisions, moving beyond answer-only pipelines.

Learn more about Track 2

Benchmark and Evaluation Metrics

Benchmark

All submissions will be evaluated on the updated version of MMAR benchmark, a 1,000-item dataset designed for deep audio reasoning across speech, sound, music, and mixed-modality scenarios. Each sample contains audio, a question, a ground-truth answer, and a newly annotated CoT rationale.

Submission Format

Participants must submit a JSONL file to the online leaderboard, where each line contains:

{
  "id": "<sample_id>",
  "thinking_prediction": "<model_or_agent_generated_CoT>",
  "answer_prediction": "<final_prediction>"
}

The leaderboard will automatically compute all metrics and rank systems by the primary score.

Evaluation Metrics

We adopt a two-stage scoring protocol that jointly assesses Answer Correctness and Reasoning Quality.

  • Metric 1: Reasoning Score (Primary)
    1. If the answer_prediction is incorrect, the model’s score is immediately set to 0.
    2. If the answer_prediction is correct, the score is assigned based on the quality of the thinking_prediction. Scores are assigned in 0.2 increments, from 0.2 to 1.0.
    3. This metric is computed using an LLM-as-a-judge protocol with carefully designed evaluation criteria for consistency and reliability.
  • Metric 2: Cue Score
\[\text{Cue Score} = \frac{\#\text{correct cues mentioned}}{\#\text{all ground-truth cues}}\]
  1. We measure the proportion of correctly recovered reasoning cues within the generated CoT.
  2. This captures how well a model identifies perceptual and structural evidence in the audio, even if its final answer is incorrect.

Not that systems are ranked by Metric 1. Metric 2 is used for tie-breaking and qualitative leaderboard highlights (e.g., “Best Evidence Alignment”). Both Metric 1 and Metric 2 scores will be calculated based on 5 evaluation runs. The final score for each metric will be the mean of the 3 middle runs, effectively discarding the highest and lowest results.

Registration and Leaderboard

Registration for the leaderboard and Google Form submission are required. Refer to the Leaderboard tab for more details.

Learn more about Leaderboard

Paper submission

Participants can submit a paper describing the submitted model or system to the Interspeech 2026. Submissions describing the competition systems or reporting research results based on the competition benchmark are equally welcome. The submitted papers will go through the same review process as the regular papers and will be indexed and included in the ISCA archive.

Learn more about Timeline

Contact

We have a Slack channel and a WeChat group for real-time communication. Please send an email if you have any private questions.

Slack Channel QR Code
Slack Channel
WeChat Group QR Code
WeChat Group
Scan to join

Organizers

Organizer 1
Shanghai Jiao Tong University
Nanyang Technological University
Organizer 2
Queen Mary University of London
Organizer 3
NVIDIA Research
Organizer 4
Shanghai Jiao Tong University
Organizer 5
Shanghai Jiao Tong University
Organizer 6
Carnegie Mellon University
Organizer 7
Alibaba Group
Organizer 8
Microsoft Corporation
Organizer 9
Carnegie Mellon University
Organizer 10
Shanghai Jiao Tong University
Organizer 11
Nanyang Technological University
Organizer 12
Shanghai Jiao Tong University

References

  • [1] Gong, Yuan, et al. "Joint audio and speech understanding." Proc. ASRU (2023).
  • [2] Tang, Changli, et al. "SALMONN: Towards generic hearing abilities for large language models." Proc. ICLR (2024).
  • [3] Chu, Yunfei, et al. "Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models." arXiv preprint arXiv:2311.07919 (2023).
  • [4] Chu, Yunfei, et al. "Qwen2-Audio technical report." arXiv preprint arXiv:2407.10759 (2024).
  • [5] Ghosh, Sreyan, et al. "GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities." Proc. EMNLP (2024).
  • [6] Kong, Zhifeng, et al. "Audio Flamingo: A novel audio language model with few-shot learning and dialogue abilities." Proc. ICML (2024).
  • [7] Ghosh, Sreyan, et al. "Audio Flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities." Proc. ICML (2025).
  • [8] Huang, Ailin, et al. "Step-Audio: Unified understanding and generation in intelligent speech interaction." arXiv preprint arXiv:2502.11946 (2025).
  • [9] Xu, Jin, et al. "Qwen2.5-Omni technical report." arXiv preprint arXiv:2503.20215 (2025).
  • [10] Ding, Ding, et al. "Kimi-Audio technical report." arXiv preprint arXiv:2504.18425 (2025).
  • [11] Ma, Ziyang, et al. "Audio-CoT: Exploring chain-of-thought reasoning in large audio language model." Proc. ASRU (2025).
  • [12] Zhifei, Xie, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." Proc. EMNLP (2025).
  • [13] Xu, Jin, et al. "Qwen3-Omni technical report." arXiv preprint arXiv:2509.17765 (2025).
  • [14] Goel, Arushi, et al. "Audio flamingo 3: Advancing audio intelligence with fully open large audio language models." arXiv preprint arXiv:2507.08128 (2025).
  • [15] Ma, Ziyang, et al. "MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix." Proc. NeurIPS (2025).
  • [16] Kumar, Sonal, et al. "MMAU-Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence." arXiv preprint arXiv:2508.13992 (2025).

  • Follow us on GitHub for updates: @Audio-Reasoning-Challenge