VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Overview

VisTA is a reinforcement learning framework that trains visual AI agents to intelligently explore, select, and combine tools from a diverse library based on performance feedback. Unlike existing approaches that either rely on fixed prompting or require extensive fine-tuning and human supervision, VisTA learns tool-selection strategies through trial and error, using task outcomes to improve over time. It leverages Group Relative Policy Optimization (GRPO) to guide this learning without needing explicit reasoning supervision. Experiments show that VisTA outperforms traditional baselines, especially on unfamiliar tasks, making it a flexible and adaptive solution for visual reasoning.

End Frame

The image above highlights key capabilities of our approach. As shown on the left, our method learns to combine multiple tools relevant to the task—without requiring any human demonstrations. Since our framework is independent of the underlying reasoner model, the trained agent (policy) can be seamlessly integrated to guide tool selection for any reasoner.

🏅 VisTA ranks among top models on ChartQA, Geometry3K, and MathVerse

Strong performance across structured chart reasoning, geometric QA, and math-based diagram understanding

ChartQA

🤖 Agent uses the right tools during inference

Our agent dynamically selects specialized reasoning modules for visual tasks — ensuring optimal performance on both geometric diagrams and structured charts.

ChartQA

BibTeX

@misc{huang2025visualtoolagentvistareinforcementlearning,
      title={VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection}, 
      author={Zeyi Huang and Yuyang Ji and Anirudh Sundara Rajan and Zefan Cai and Wen Xiao and Junjie Hu and Yong Jae Lee},
      year={2025},
      eprint={2505.20289},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.20289}, 
}