- Web Chair: Douglas Yueming Lai
- Finance & Logistics Chair: Celine Yuqin Xie
We thank our support team for their dedication and behind-the-scenes work that made this workshop possible.
Date: Sun Dec 7th | Location: Upper Level Room 23ABC,San Diego Convention Center, San Diego, USA
The development of intelligent agents – particularly those powered by large language models (LLMs) – has emphasized the critical role of environments in shaping agent behavior and capabilities, especially for achieving end-to-end autonomy. Environments are not merely testing grounds; they are dynamic, interactive contexts that serve as the essential "data" for agents to learn adaptive behavior, complex reasoning, and long-term decision-making skills. Just as scaling the model size, dataset size, and training computation has led to emergent capabilities in LLMs, scaling the structure, fidelity, and diversity of environments is one of the crucial dimensions in advancing agent intelligence. Moreover, recent advances in end-to-end reinforcement learning (RL), particularly when paired with LLM-based agents, have made it increasingly viable to train agents through sustained interaction. These agents can now acquire skills, strategies, and planning abilities through environmental feedback, rather than relying solely on imitation learning or static prompt engineering. As we move toward more autonomous, general-purpose agents, the need for scalable, richly interactive, and diverse environments has become both urgent and foundational.
We invite contributions in Topics of Interest that are central to the theme of the workshop. However, we emphasize that the topic list is not exhaustive and welcome submissions in related areas.
We manage paper submissions through OpenReview. The review process is double‑blind, so submissions must be anonymized. We welcome work that is (1) original and unpublished, or (2) work‑in‑progress. By default, submissions will not have archival proceedings. However, if you would like your paper to be indexed, please inform us when submitting your paper (there is a checkbox for this).
Please use the NeurIPS 2025 LaTeX style file; it includes a
preprint
option for non‑anonymous preprints posted online (see additional formatting details here).
Submissions
should be PDFs of ≤ 9 pages (excluding references and appendices).
We will select outstanding papers for lightning talks. The award for best paper will be announced at the workshop.
| Paper Submission Deadline | |
| Notification of Acceptance | |
| Workshop at NeurIPS |
| Paper ID | Title | Authors |
|---|---|---|
| 10 | Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction | Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar |
| 14 | VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT | Zhuo Zhi, Qiangqiang Wu, Minghe Shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou |
| 19 | Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks | Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji |
| 40 | YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models | Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen |
| 74 | Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning | Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Dakuo Wang |
| 95 | What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities | Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang |
| 97 | MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision | Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty |
| 126 | GEM: A Gym for Agentic LLMs | Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Haotian Xu, Simon Yu, Chenmien Tan, Shaopan Xiong, Weixun Wang, Bo Liu, Hao Zhu, Weiyan Shi, Diyi Yang, Wee Sun Lee, Min Lin |
| 132 | RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines | Pengfei Yu, Dongming Shen, Silin Meng, Jaewon Lee, Weisu Yin, Andrea Yaoyun Cui, Zhenlin Xu, Yi Zhu, Xingjian Shi, Mu Li, Alex Smola |
| 138 | Shaping Smart Personal Assistants through Generative Interactive Environments for Scalable Design and Evaluation | Ziyi Xuan, Yiwen Wu, Vinod Namboodiri, Yu Yang |
| Paper ID | Title | Authors |
|---|---|---|
| 3 | Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning | Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Tong Che, Dimitris N. Metaxas |
| 4 | Exploring Personality Trait Change of LLM-Based AI Systems | Yuhan Ma, Junjie Wang |
| 5 | All Life is Problem Creation: Learning to Generate Environments that Maximize Performance Gain | Titas Anciukevičius, Yuhui Wang, Piotr Piękos, Li Nanbo, Wenyi Wang, Jürgen Schmidhuber |
| 7 | UserBench: An Interactive Gym Environment for User-Centric Agents | Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang |
| 8 | Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs | Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alexander F Spies, Alessandra Russo, Michael D Dennis |
| 11 | RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users | Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu |
| 12 | Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI | Christopher Lohse, Adrian Selk, Amadou Ba, Jonas Wahl, Marco Ruffini |
| 13 | GLEE: A Unified Framework and Benchmark for Language-based Economic Environments | Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz |
| 18 | PrivacyMAS: A Privacy-Preserving Multi-Agent System Framework | Maryam Fatima |
| 20 | Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models | Brennen Hill, Mant Koh En Wei, Jishnuanandh Thangavel |
| 21 | Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation | Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang |
| 22 | What Limits Agentic Systems Efficiency? | Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman |
| 26 | Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula | Brennen Hill |
| 28 | When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents | Matous Kozak, Roshanak Zilouchian Moghaddam, Kalpathy Sivaraman |
| 32 | The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum | Brennen Hill |
| 33 | DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates | Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan V. Shah, Robert D. Hawkins, Junjie Hu, Timothy T. Rogers |
| 34 | MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization | Yichen Han, Bojun Liu, Zhengpeng zhou, Guanyu Liu, Zeng Zhang, Yang Yang, Wenli Wang, Isaac N Shi, Yunyan, Lewei He, TIANYU SHI |
| 36 | BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair | Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen |
| 37 | TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks | Aishwarya Mandyam |
| 43 | You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation | Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu |
| 44 | Paper2Video: Automatic Video Generation from Scientific Papers | Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou |
| 45 | On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems | Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong |
| 46 | SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models | Xinxing Ren, Qianbo Zang, Zekun Guo |
| 48 | SEDM: Scalable Self-Evolving Distributed Memory for Agents | Haoran Xu, Jiacong Hu, ZHANG Ke, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, TIANYU SHI |
| 49 | Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning | Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li |
| 50 | BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery | Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Ying Li, Aditi Bhaskar, Mohammed Zaman, Noah Goodman |
| 53 | GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge | Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Jiaoyan Chen, Steffen Staab, Yuan He, Evgeny Kharlamov |
| 54 | A Multi-agent Reasoning Framework for Video Question Answering | Abhi Kamboj, Gaurav Kumar, Krista Holden, Madhumitha Saravanan, Pradyumna Narayana |
| 59 | LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra | Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin |
| 60 | Ludax: A GPU-Accelerated Domain Specific Language for Board Games | Graham Todd, Alexander George Padula, Dennis J. N. J. Soemers, Julian Togelius |
| 61 | Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies | Xisen Wang, Qi Zhang |
| 62 | EVOLVE-MEM: A Self-Adaptive Hierarchical Memory Architecture for Next-Generation Agentic AI Systems | Rishi Ashish Shah, Ujjwal Kakar, Shashvat Singhal, Dinesh K Vishwakarma |
| 67 | ReMAC: Large Language Model-Driven Reward Design for Multi-Agent Manipulation Collaboration | Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO |
| 68 | Revisiting Uncertainty Estimation and Calibration of Large Language Models | Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu |
| 70 | Vision-Language Models Unlock Task-Centric Latent Actions | Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov |
| 71 | AgentyxCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration | Harish Karthikeyan, Yue Guo, Udari Madhushani Sehwag, Leo de Castro, Antigoni Polychroniadou, Leo Ardon, Sumitra Ganesh |
| 77 | Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents | Jacopo Teneggi, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar |
| 79 | Zephyrus: An Agentic Framework for Weather Science | Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Ruijia Niu, Yasaman Jafari, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu |
| 80 | Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation | Maria Emilia Mazzolenis, Ruirui Zhang |
| 82 | LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training | Yiming Wang, Da Yin, Yuedong Cui, Zhiqian Li, Ruichen Zheng, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang |
| 83 | DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments | Chiyu Zhang, Marc-Alexandre Côté, Michael Albada, Anush Sankaran, Jack W Stokes, Tong Wang, Amir H. Abdi, William Blum, Muhammad Abdul-Mageed |
| 84 | ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning | Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li |
| 86 | Characterizing Deep Research: A Benchmark and Formal Definition | Abhinav Java, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma |
| 88 | Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning | Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah, Kevin Zhu |
| 89 | IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in Industrial Automation | Xiaoran Yang, Yuyang Du, Kexin Chen, Soung Chang Liew, Jiamin Lu, Ziyu Guo, Xiaoyan Liu, Qun Yang, Shiqi XU, Xingyu Fan, Yuchen Pan, Taoyong Cui, Hongyu Deng, Boris Düdder, Jianzhang Pan, Qun Fang, Pheng-Ann Heng |
| 90 | Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation | Aleksei Kudrinskii, Saibo Geng, Luca Beurer-Kellner, Marc Fischer |
| 91 | MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers | Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li |
| 94 | Code2MCP: Transforming Code Repositories into MCP Services | Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Shaowu Pan, Min-Ling Zhang |
| 98 | WebArena Verified: Reliable Evaluation for Web Agents | Amine El hattami, Megh Thakkar, Nicolas Chapados, Christopher Pal |
| 99 | See, Think, Act: Online Shopper Behavior Simulation with VLM Agents | Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Jing Huang, Dakuo Wang |
| 101 | Agent Context Protocols Enhance Collective Inference | Arjun Beniwal, Devansh Bhardwaj, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, Vishvak Murahari |
| 102 | Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning | Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea Mørch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Märtens, Julien Fauqueur |
| 103 | Enabling User-Created Multi-Agent Simulations: Interactive and Customizable 2D Environments to Study Team Dynamics with LLM Agents | Mohammed Almutairi, Charles Chiang, Haoze Guo, Nandini Banerjee, Maria Milkowski, Daniel Nguyen, Michael G Yankoski, Tim Weninger, Svitlana Volkova, Trenton W. Ford, Diego Gomez-Zara |
| 105 | The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents | Mariana Meireles, Rupali Bhati, Niklas Lauffer, Cameron Allen |
| 108 | Go-Browse: Training Web Agents with Structured Exploration | Apurva Gandhi, Graham Neubig |
| 110 | VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills | Erik M. Lintunen |
| 112 | OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation | Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang |
| 113 | GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning | Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Guohao Li, Zhen Han, Volker Tresp |
| 114 | When Agents go Astray: Course-Correcting SWE Agents with PRMs | Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk |
| 119 | Natural Language Grounded Reinforcement Learning for Clinical Decision-Making in Virtual Patient Simulations | Niyel Hassan, Benjamin Liu, Jason Tsai, Jeffrey K Jopling, Dana Lin, Edward Melcer, Cara Liebert |
| 120 | Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey | Yuchen Huang, Sijia Li, Minghao LIU, Wei Liu, Zhiyuan Fan, Yi R. Fung |
| 121 | RAISE: Reliable Agent Improvement via Simulated Experience | Sahar Omidi Shayegan, Joshua Meyer, Victor Shih, Sebastian Sosa, Tianyi Peng, Kostis Kaffes, Eugene Wu, Andi Partovi, Mehdi Jamei |
| 122 | Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties | Philipp J. Schneider, LIN TIAN, Marian-Andrei Rizoiu |
| 123 | CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs | Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein |
| 125 | Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications | Aditi Tiwari, Akshit Bhalla |
| 130 | MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments | Darshan Girish Deshpande, Varun Prashant Gangal, Hersh Mehta, Jędrzej Rosłaniec, Anand Kannappan, Rebecca Qian, Peng Wang |
| 131 | Enabling multi-agent collaboration in knowledge graph environments | Iñaki Arango, Ayush Noori, Lucas Vittor, Joaquin Polonuer, Marinka Zitnik |
| 135 | AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents | Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song |
| 136 | MIRAI: Evaluating LLM Agents for International Event Forecasting | Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang |
| 137 | PuzzleJAX: A Benchmark for Reasoning and Learning | Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Zehua Jiang, Muhammad Umair Nasir, Andrzej Banburski-Fahey, Julian Togelius |
| 141 | Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios | Hareeshwar Karthikeyan |
| 146 | Are LLMs Generalist Hanabi Agents? | Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege |
| 147 | MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers | Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow |
| 150 | SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards | Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark |
| 152 | UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs | Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach |
| 153 | Steering Diffusion Policies with Value-Guided Denoising | Hanming Ye |
| 154 | CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents | Hanqing Yang, Narjes Nourzad, Shiyu Chen, Carlee Joe-Wong |
| 156 | Player-Coach Teamwork: Multi-agent Collaboration for Improving LLM Reasoning | Heewon Park, Minhae Kwon |
| 163 | Automated Specialization of Stateful Agent Systems | Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, Mayank Goel, Kevin Zhu |
| 164 | Scaling Open-Ended Reasoning to Predict the Future | Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping |
| 165 | Verifiable Chemical Reasoning through Tool-Calling Agentic Workflow | Gabrielle Gaudeau, Shinnosuke Tanaka, Defne Circi, Ian W Kennedy, Movina Moses, Mohab Elkaref |
| 166 | Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL | Shreyas Singh, Kunal Singh, Pradeep Moturi |
| 167 | SEA: Stateful Execution Environment for Conversational Big Data Analytics | Rohit Kumar, Ajay Anil Kumar |
We thank our support team for their dedication and behind-the-scenes work that made this workshop possible.