Leaderboard

Final Stage Results

Single Model Track

Rank	Team Name	Affiliation(s)	Rubrics	Accuracy
🏅	xianghe	Tencent AI Lab, The Hong Kong University of Science and Technology (Guangzhou)	65.29	74.0
🥈	tju-cca	Tianjin University Tianjin Key Laboratory of Cognitive Computing and Application	62.55	71.0
🥉	TeleAI-NPU	The Institute of Artificial Intelligence, China Telecom (TeleAI); Audio, Speech and Language Processing Lab, Northwestern Polytechnical University (NPU-ASLP)	62.22	71.7
4th	sujitnoronha	Independent Researcher	60.61	73.4
5th	(◍•ᴗ•◍)	Bilibili Inc.	58.86	71.2
6th	jtjtjt	China Mobile Jiutian Research	58.53	69.3
7th	ABSTRACT	National Yang Ming Chiao Tung University	57.77	69.4
8th	alm-tf	Maum AI Inc.	57.15	68.1
9th	zionhuang	National Taiwan University	56.03	68.4
10th	mimanchi	Beijing Institute of Technology	47.83	63.6
11th	Aniket Tathe	University of Illinois Urbana-Champaign, Carnegie Mellon University	47.49	62.5
12th	erq111	Tsinghua University	38.93	62.6
13th	RobinLy	Ho Chi Minh City University of Science	23.77	46.4
14th	bismarck91	Johns Hopkins University	19.03	40.4

Agent Track

View Results on Codabench

Rank	Team Name	Affiliation(s)	Rubrics	Accuracy
🏅	TalTech	Tallinn University of Technology	69.83	76.9
🥈	AISpeech	AISpeech Co.Ltd, Shanghai Jiao Tong University	66.23	77.4
🥉	AI^2	The Hong Kong University of Science and Technology (Guangzhou), Tencent AI Lab	66.09	75.1
4th	AaronLi	Hangzhou Yiwise Tech Co.Ltd	64.61	72.2
5th	zsy20030814	Tianjin University Tianjin Key Laboratory of Cognitive Computing and Application	63.0	71.0
6th	tt	Shanghai Jiao Tong University	62.63	73.6
7th	sunsunsun	ByteDance	62.35	72.6
8th	alm-tf	Maum AI Inc.	61.85	72.7
9th	roysun2006	Ant Group	60.25	77.1
10th	AudioMind	Southwestern University of Finance and Economics, Wuhan University, Xiaomi Inc.	59.47	68.2
11th	adasp	Telecom Paris	57.51	69.8
12th	artagent	Guangzhou HKUST	54.16	70.5
13th	liusiqi	ByteDance	53.3	71.7
14th	dalietnguyen34	VinUniversity	53.03	68.6
15th	ABSTRACT	National Yang Ming Chiao Tung University	47.84	65.0
16th	RobinLy	Ho Chi Minh City University of Science	23.91	46.4

Registration Process

Welcome to the competition! To ensure a fair and rigorous evaluation, all participants must adhere to the following registration and submission protocols.

To participate in the challenge and appear on the leaderboard, you must complete the following steps:

Step 1: Complete the Google Form with your team details.
Step 2: Once submitted, you will receive an email with detailed instructions for the Leaderboard (Single Model Track) and (Agent Track) registration and submission.

Benchmark and Evaluation Protocol

Benchmark

All submissions will be evaluated on the updated version of MMAR benchmark, a 1,000-item dataset designed for deep audio reasoning across speech, sound, music, and mixed-modality scenarios. Each sample contains audio, a question, a ground-truth answer, and a newly annotated CoT rationale.

Submission Format

Participants must submit a JSONL file to the online leaderboard, where each line contains:

{
  "id": "<sample_id>",
  "thinking_prediction": "<model_or_agent_generated_CoT>",
  "answer_prediction": "<final_prediction>"
}

The leaderboard will automatically compute all metrics and rank systems by the primary score.

Evaluation Metrics

Answer Correctness: If the answer_prediction is incorrect, the score is 0.
Reasoning Quality: If the answer is correct, an LLM-as-a-judge evaluates the thinking_prediction on a scale of 0.2 to 1.0 (in 0.2 increments).
Stability Mechanism: To account for variance, each submission is calculated based on 5 independent evaluation runs. The final score for each metric will be the mean of the 3 middle runs, effectively discarding the highest and lowest results.

Competition Timeline and Phases

The evaluation is split into two distinct phases. Please note the dataset size and dates:

Phase	Start Dates	End Dates	Dataset Size	Submission Limit
Preliminary Stage	2026-01-01 00:00:00	2026-01-29 23:59:59	500 Questions	2 submissions / day
Final Stage	2026-01-30 00:00:00	2026-02-01 23:59:59	1,000 Questions	3 submissions total (over 3 days)

Refresh Time: Submission quotas reset daily at 00:00 UTC+0.

Submission Technical Constraints

To ensure system stability and fair judging, please adhere to the following character limits:

Thinking Prediction: Each individual thinking_prediction must < 10,000 characters.

Note: For the Agent Track, your system should be able to manage the final thinking_prediction.
Daily Limit: During the Preliminary Stage, you are allowed a maximum of 2 submissions per day.