Leaderboard
Welcome to the competition! To ensure a fair and rigorous evaluation, all participants must adhere to the following registration and submission protocols.
Registration Process
To participate in the challenge and appear on the leaderboard, you must complete the following steps:
- Step 1: Complete the Google Form with your team details.
- Step 2: Once submitted, you will receive an email with detailed instructions for the Leaderboard (Single Model Track) and (Agent Track) registration and submission.
Benchmark and Evaluation Protocol
Benchmark
All submissions will be evaluated on the updated version of MMAR benchmark, a 1,000-item dataset designed for deep audio reasoning across speech, sound, music, and mixed-modality scenarios. Each sample contains audio, a question, a ground-truth answer, and a newly annotated CoT rationale.
Submission Format
Participants must submit a JSONL file to the online leaderboard, where each line contains:
{
"id": "<sample_id>",
"thinking_prediction": "<model_or_agent_generated_CoT>",
"answer_prediction": "<final_prediction>"
}
The leaderboard will automatically compute all metrics and rank systems by the primary score.
Evaluation Metrics
- Answer Correctness: If the
answer_predictionis incorrect, the score is 0. - Reasoning Quality: If the answer is correct, an LLM-as-a-judge evaluates the
thinking_predictionon a scale of 0.2 to 1.0 (in 0.2 increments). - Stability Mechanism: To account for variance, each submission is calculated based on 5 independent evaluation runs. The final score for each metric will be the mean of the 3 middle runs, effectively discarding the highest and lowest results.
Competition Timeline and Phases
The evaluation is split into two distinct phases. Please note the dataset size and dates:
| Phase | Start Dates | End Dates | Dataset Size | Submission Limit |
|---|---|---|---|---|
| Preliminary Stage | 2026-01-01 00:00:00 | 2026-01-29 23:59:59 | 500 Questions | 2 submissions / day |
| Final Stage | 2026-01-30 00:00:00 | 2026-02-01 23:59:59 | 1,000 Questions | 3 submissions total (over 3 days) |
Refresh Time: Submission quotas reset daily at 00:00 UTC+0.
Submission Technical Constraints
To ensure system stability and fair judging, please adhere to the following character limits:
- Thinking Prediction: Each individual
thinking_predictionmust < 10,000 characters.Note: For the Agent Track, your system should be able to manage the final
thinking_prediction. - Daily Limit: During the Preliminary Stage, you are allowed a maximum of 2 submissions per day.