Track 2
Track 2: Agent Track
Participants design an audio reasoning agent that may orchestrate multiple open-source models and tools (e.g., ASR, separation, beat/onset tracking, captioners, planners) to produce a CoT path and a final answer. This track evaluates system-level reasoning: planning, tool selection, and self-checking under the same strict correctness criterion. The emphasis is on transparent trajectories that reveal how intermediate audio analyses contribute to decisions, moving beyond answer-only pipelines. Agents can be implemented with explicit planning (plan–execute–reflect), structured memory for intermediate artifacts (e.g., transcriptions, stems, captions), and self-verification or cross-checking. We encourage modular designs with well-documented interfaces to facilitate ablations and community reuse.
Rules and Restrictions:
- Open-source models/tools only. All components (including music/audio tools) must be publicly available under research-permissive licenses; no commercial or hosted APIs.
- Full system + trajectory submission. Finalists must submit runnable code, environment specs, and complete execution logs (tool calls, prompts, intermediate outputs) for audit.
- No human-in-the-loop. Inference-time human assistance or manual curation is strictly prohibited.
- Output Token Limit. The total output generated by the system (including all final and intermediate textual outputs) must not exceed 10,000 words per sample.
Baseline: Please refer to our baselines repository.