Leaderboard
Final Stage Results
Single Model Track
| Rank | Team Name | Affiliation(s) | Rubrics | Accuracy |
|---|---|---|---|---|
| 🏅 | xianghe | Tencent AI Lab, The Hong Kong University of Science and Technology (Guangzhou) | 65.29 | 74.0 |
| 🥈 | tju-cca | Tianjin University Tianjin Key Laboratory of Cognitive Computing and Application | 62.55 | 71.0 |
| 🥉 | TeleAI-NPU | The Institute of Artificial Intelligence, China Telecom (TeleAI); Audio, Speech and Language Processing Lab, Northwestern Polytechnical University (NPU-ASLP) | 62.22 | 71.7 |
| 4th | sujitnoronha | Independent Researcher | 60.61 | 73.4 |
| 5th | (◍•ᴗ•◍) | Bilibili Inc. | 58.86 | 71.2 |
| 6th | jtjtjt | China Mobile Jiutian Research | 58.53 | 69.3 |
| 7th | ABSTRACT | National Yang Ming Chiao Tung University | 57.77 | 69.4 |
| 8th | alm-tf | Maum AI Inc. | 57.15 | 68.1 |
| 9th | zionhuang | National Taiwan University | 56.03 | 68.4 |
| 10th | mimanchi | Beijing Institute of Technology | 47.83 | 63.6 |
| 11th | Aniket Tathe | University of Illinois Urbana-Champaign, Carnegie Mellon University | 47.49 | 62.5 |
| 12th | erq111 | Tsinghua University | 38.93 | 62.6 |
| 13th | RobinLy | Ho Chi Minh City University of Science | 23.77 | 46.4 |
| 14th | bismarck91 | Johns Hopkins University | 19.03 | 40.4 |
Agent Track
| Rank | Team Name | Affiliation(s) | Rubrics | Accuracy |
|---|---|---|---|---|
| 🏅 | TalTech | Tallinn University of Technology | 69.83 | 76.9 |
| 🥈 | AISpeech | AISpeech Co.Ltd, Shanghai Jiao Tong University | 66.23 | 77.4 |
| 🥉 | AI^2 | The Hong Kong University of Science and Technology (Guangzhou), Tencent AI Lab | 66.09 | 75.1 |
| 4th | AaronLi | Hangzhou Yiwise Tech Co.Ltd | 64.61 | 72.2 |
| 5th | zsy20030814 | Tianjin University Tianjin Key Laboratory of Cognitive Computing and Application | 63.0 | 71.0 |
| 6th | tt | Shanghai Jiao Tong University | 62.63 | 73.6 |
| 7th | sunsunsun | ByteDance | 62.35 | 72.6 |
| 8th | alm-tf | Maum AI Inc. | 61.85 | 72.7 |
| 9th | roysun2006 | Ant Group | 60.25 | 77.1 |
| 10th | AudioMind | Southwestern University of Finance and Economics, Wuhan University, Xiaomi Inc. | 59.47 | 68.2 |
| 11th | adasp | Telecom Paris | 57.51 | 69.8 |
| 12th | artagent | Guangzhou HKUST | 54.16 | 70.5 |
| 13th | liusiqi | ByteDance | 53.3 | 71.7 |
| 14th | dalietnguyen34 | VinUniversity | 53.03 | 68.6 |
| 15th | ABSTRACT | National Yang Ming Chiao Tung University | 47.84 | 65.0 |
| 16th | RobinLy | Ho Chi Minh City University of Science | 23.91 | 46.4 |
Registration Process
Welcome to the competition! To ensure a fair and rigorous evaluation, all participants must adhere to the following registration and submission protocols.
To participate in the challenge and appear on the leaderboard, you must complete the following steps:
- Step 1: Complete the Google Form with your team details.
- Step 2: Once submitted, you will receive an email with detailed instructions for the Leaderboard (Single Model Track) and (Agent Track) registration and submission.
Benchmark and Evaluation Protocol
Benchmark
All submissions will be evaluated on the updated version of MMAR benchmark, a 1,000-item dataset designed for deep audio reasoning across speech, sound, music, and mixed-modality scenarios. Each sample contains audio, a question, a ground-truth answer, and a newly annotated CoT rationale.
Submission Format
Participants must submit a JSONL file to the online leaderboard, where each line contains:
{
"id": "<sample_id>",
"thinking_prediction": "<model_or_agent_generated_CoT>",
"answer_prediction": "<final_prediction>"
}
The leaderboard will automatically compute all metrics and rank systems by the primary score.
Evaluation Metrics
- Answer Correctness: If the
answer_predictionis incorrect, the score is 0. - Reasoning Quality: If the answer is correct, an LLM-as-a-judge evaluates the
thinking_predictionon a scale of 0.2 to 1.0 (in 0.2 increments). - Stability Mechanism: To account for variance, each submission is calculated based on 5 independent evaluation runs. The final score for each metric will be the mean of the 3 middle runs, effectively discarding the highest and lowest results.
Competition Timeline and Phases
The evaluation is split into two distinct phases. Please note the dataset size and dates:
| Phase | Start Dates | End Dates | Dataset Size | Submission Limit |
|---|---|---|---|---|
| Preliminary Stage | 2026-01-01 00:00:00 | 2026-01-29 23:59:59 | 500 Questions | 2 submissions / day |
| Final Stage | 2026-01-30 00:00:00 | 2026-02-01 23:59:59 | 1,000 Questions | 3 submissions total (over 3 days) |
Refresh Time: Submission quotas reset daily at 00:00 UTC+0.
Submission Technical Constraints
To ensure system stability and fair judging, please adhere to the following character limits:
- Thinking Prediction: Each individual
thinking_predictionmust < 10,000 characters.Note: For the Agent Track, your system should be able to manage the final
thinking_prediction. - Daily Limit: During the Preliminary Stage, you are allowed a maximum of 2 submissions per day.