Scaling Environments for Agents (SEA)

NeurIPS 2025 Workshop

Date: Sun Dec 7th   |   Location: Upper Level Room 23ABC,San Diego Convention Center, San Diego, USA

The development of intelligent agents – particularly those powered by large language models (LLMs) – has emphasized the critical role of environments in shaping agent behavior and capabilities, especially for achieving end-to-end autonomy. Environments are not merely testing grounds; they are dynamic, interactive contexts that serve as the essential "data" for agents to learn adaptive behavior, complex reasoning, and long-term decision-making skills. Just as scaling the model size, dataset size, and training computation has led to emergent capabilities in LLMs, scaling the structure, fidelity, and diversity of environments is one of the crucial dimensions in advancing agent intelligence. Moreover, recent advances in end-to-end reinforcement learning (RL), particularly when paired with LLM-based agents, have made it increasingly viable to train agents through sustained interaction. These agents can now acquire skills, strategies, and planning abilities through environmental feedback, rather than relying solely on imitation learning or static prompt engineering. As we move toward more autonomous, general-purpose agents, the need for scalable, richly interactive, and diverse environments has become both urgent and foundational.

Topics of Interest Scaling Environments for Agents
1
Environment Infrastructure Design Task formulation, action-space design, environment generation, compositionality, and agent integration.
2
Benchmarks and Evaluation Multi-step interaction metrics, generalization testing, open-ended benchmarks, curriculum scaling, and human-in-the-loop assessments.
3
LLMs in Interactive Environments Reinforcement learning, policy learning, reward modeling, hybrid training, and fine-tuning through interaction.
4
Tool-Use and Software Environments Workforce, Agents as programmers, API orchestration, tool design, software manipulation, and web navigation.
5
Multi-Agent Systems and Simulation Environments Scaling agent populations, emergent behaviors, communication, coordination, competition, and role dynamics.
6
Embodiment and Grounding Perception-action loops, physical simulation, spatial reasoning, robotics integration, and simulation-to-physical grounding.
7
Sim2Real and Deployment Domain adaptation, real-world API integration, robustness under scale, safety, and large-scale deployment.
Call for Papers Scaling Environments for Agents

Submission Tracks

We invite contributions in Topics of Interest that are central to the theme of the workshop. However, we emphasize that the topic list is not exhaustive and welcome submissions in related areas.


Submission Guidelines

We manage paper submissions through OpenReview. The review process is double‑blind, so submissions must be anonymized. We welcome work that is (1) original and unpublished, or (2) work‑in‑progress. By default, submissions will not have archival proceedings. However, if you would like your paper to be indexed, please inform us when submitting your paper (there is a checkbox for this).

Please use the NeurIPS 2025 LaTeX style file; it includes a preprint option for non‑anonymous preprints posted online (see additional formatting details here). Submissions should be PDFs of ≤ 9 pages (excluding references and appendices).

We will select outstanding papers for lightning talks. The award for best paper will be announced at the workshop.


Important Dates

Paper Submission Deadline
Notification of Acceptance
Workshop at NeurIPS
Accepted Papers Scaling Environments for Agents

Oral Presentations

Paper ID Title Authors
10 Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
14 VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT Zhuo Zhi, Qiangqiang Wu, Minghe Shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou
19 Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
40 YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen
74 Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Dakuo Wang
95 What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang
97 MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
126 GEM: A Gym for Agentic LLMs Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Haotian Xu, Simon Yu, Chenmien Tan, Shaopan Xiong, Weixun Wang, Bo Liu, Hao Zhu, Weiyan Shi, Diyi Yang, Wee Sun Lee, Min Lin
132 RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines Pengfei Yu, Dongming Shen, Silin Meng, Jaewon Lee, Weisu Yin, Andrea Yaoyun Cui, Zhenlin Xu, Yi Zhu, Xingjian Shi, Mu Li, Alex Smola
138 Shaping Smart Personal Assistants through Generative Interactive Environments for Scalable Design and Evaluation Ziyi Xuan, Yiwen Wu, Vinod Namboodiri, Yu Yang

Poster Presentations

Paper ID Title Authors
3 Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Tong Che, Dimitris N. Metaxas
4 Exploring Personality Trait Change of LLM-Based AI Systems Yuhan Ma, Junjie Wang
5 All Life is Problem Creation: Learning to Generate Environments that Maximize Performance Gain Titas Anciukevičius, Yuhui Wang, Piotr Piękos, Li Nanbo, Wenyi Wang, Jürgen Schmidhuber
7 UserBench: An Interactive Gym Environment for User-Centric Agents Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang
8 Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alexander F Spies, Alessandra Russo, Michael D Dennis
11 RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu
12 Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI Christopher Lohse, Adrian Selk, Amadou Ba, Jonas Wahl, Marco Ruffini
13 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
18 PrivacyMAS: A Privacy-Preserving Multi-Agent System Framework Maryam Fatima
20 Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models Brennen Hill, Mant Koh En Wei, Jishnuanandh Thangavel
21 Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang
22 What Limits Agentic Systems Efficiency? Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman
26 Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula Brennen Hill
28 When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents Matous Kozak, Roshanak Zilouchian Moghaddam, Kalpathy Sivaraman
32 The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum Brennen Hill
33 DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan V. Shah, Robert D. Hawkins, Junjie Hu, Timothy T. Rogers
34 MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization Yichen Han, Bojun Liu, Zhengpeng zhou, Guanyu Liu, Zeng Zhang, Yang Yang, Wenli Wang, Isaac N Shi, Yunyan, Lewei He, TIANYU SHI
36 BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen
37 TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks Aishwarya Mandyam
43 You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu
44 Paper2Video: Automatic Video Generation from Scientific Papers Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
45 On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong
46 SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models Xinxing Ren, Qianbo Zang, Zekun Guo
48 SEDM: Scalable Self-Evolving Distributed Memory for Agents Haoran Xu, Jiacong Hu, ZHANG Ke, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, TIANYU SHI
49 Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li
50 BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Ying Li, Aditi Bhaskar, Mohammed Zaman, Noah Goodman
53 GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Jiaoyan Chen, Steffen Staab, Yuan He, Evgeny Kharlamov
54 A Multi-agent Reasoning Framework for Video Question Answering Abhi Kamboj, Gaurav Kumar, Krista Holden, Madhumitha Saravanan, Pradyumna Narayana
59 LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin
60 Ludax: A GPU-Accelerated Domain Specific Language for Board Games Graham Todd, Alexander George Padula, Dennis J. N. J. Soemers, Julian Togelius
61 Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies Xisen Wang, Qi Zhang
62 EVOLVE-MEM: A Self-Adaptive Hierarchical Memory Architecture for Next-Generation Agentic AI Systems Rishi Ashish Shah, Ujjwal Kakar, Shashvat Singhal, Dinesh K Vishwakarma
67 ReMAC: Large Language Model-Driven Reward Design for Multi-Agent Manipulation Collaboration Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO
68 Revisiting Uncertainty Estimation and Calibration of Large Language Models Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu
70 Vision-Language Models Unlock Task-Centric Latent Actions Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov
71 AgentyxCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration Harish Karthikeyan, Yue Guo, Udari Madhushani Sehwag, Leo de Castro, Antigoni Polychroniadou, Leo Ardon, Sumitra Ganesh
77 Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents Jacopo Teneggi, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar
79 Zephyrus: An Agentic Framework for Weather Science Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Ruijia Niu, Yasaman Jafari, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu
80 Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation Maria Emilia Mazzolenis, Ruirui Zhang
82 LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training Yiming Wang, Da Yin, Yuedong Cui, Zhiqian Li, Ruichen Zheng, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang
83 DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments Chiyu Zhang, Marc-Alexandre Côté, Michael Albada, Anush Sankaran, Jack W Stokes, Tong Wang, Amir H. Abdi, William Blum, Muhammad Abdul-Mageed
84 ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li
86 Characterizing Deep Research: A Benchmark and Formal Definition Abhinav Java, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma
88 Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah, Kevin Zhu
89 IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in Industrial Automation Xiaoran Yang, Yuyang Du, Kexin Chen, Soung Chang Liew, Jiamin Lu, Ziyu Guo, Xiaoyan Liu, Qun Yang, Shiqi XU, Xingyu Fan, Yuchen Pan, Taoyong Cui, Hongyu Deng, Boris Düdder, Jianzhang Pan, Qun Fang, Pheng-Ann Heng
90 Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation Aleksei Kudrinskii, Saibo Geng, Luca Beurer-Kellner, Marc Fischer
91 MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li
94 Code2MCP: Transforming Code Repositories into MCP Services Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Shaowu Pan, Min-Ling Zhang
98 WebArena Verified: Reliable Evaluation for Web Agents Amine El hattami, Megh Thakkar, Nicolas Chapados, Christopher Pal
99 See, Think, Act: Online Shopper Behavior Simulation with VLM Agents Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Jing Huang, Dakuo Wang
101 Agent Context Protocols Enhance Collective Inference Arjun Beniwal, Devansh Bhardwaj, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, Vishvak Murahari
102 Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea Mørch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Märtens, Julien Fauqueur
103 Enabling User-Created Multi-Agent Simulations: Interactive and Customizable 2D Environments to Study Team Dynamics with LLM Agents Mohammed Almutairi, Charles Chiang, Haoze Guo, Nandini Banerjee, Maria Milkowski, Daniel Nguyen, Michael G Yankoski, Tim Weninger, Svitlana Volkova, Trenton W. Ford, Diego Gomez-Zara
105 The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents Mariana Meireles, Rupali Bhati, Niklas Lauffer, Cameron Allen
108 Go-Browse: Training Web Agents with Structured Exploration Apurva Gandhi, Graham Neubig
110 VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills Erik M. Lintunen
112 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
113 GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Guohao Li, Zhen Han, Volker Tresp
114 When Agents go Astray: Course-Correcting SWE Agents with PRMs Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk
119 Natural Language Grounded Reinforcement Learning for Clinical Decision-Making in Virtual Patient Simulations Niyel Hassan, Benjamin Liu, Jason Tsai, Jeffrey K Jopling, Dana Lin, Edward Melcer, Cara Liebert
120 Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey Yuchen Huang, Sijia Li, Minghao LIU, Wei Liu, Zhiyuan Fan, Yi R. Fung
121 RAISE: Reliable Agent Improvement via Simulated Experience Sahar Omidi Shayegan, Joshua Meyer, Victor Shih, Sebastian Sosa, Tianyi Peng, Kostis Kaffes, Eugene Wu, Andi Partovi, Mehdi Jamei
122 Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties Philipp J. Schneider, LIN TIAN, Marian-Andrei Rizoiu
123 CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein
125 Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications Aditi Tiwari, Akshit Bhalla
130 MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments Darshan Girish Deshpande, Varun Prashant Gangal, Hersh Mehta, Jędrzej Rosłaniec, Anand Kannappan, Rebecca Qian, Peng Wang
131 Enabling multi-agent collaboration in knowledge graph environments Iñaki Arango, Ayush Noori, Lucas Vittor, Joaquin Polonuer, Marinka Zitnik
135 AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song
136 MIRAI: Evaluating LLM Agents for International Event Forecasting Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang
137 PuzzleJAX: A Benchmark for Reasoning and Learning Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Zehua Jiang, Muhammad Umair Nasir, Andrzej Banburski-Fahey, Julian Togelius
141 Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios Hareeshwar Karthikeyan
146 Are LLMs Generalist Hanabi Agents? Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege
147 MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
150 SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
152 UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach
153 Steering Diffusion Policies with Value-Guided Denoising Hanming Ye
154 CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents Hanqing Yang, Narjes Nourzad, Shiyu Chen, Carlee Joe-Wong
156 Player-Coach Teamwork: Multi-agent Collaboration for Improving LLM Reasoning Heewon Park, Minhae Kwon
163 Automated Specialization of Stateful Agent Systems Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, Mayank Goel, Kevin Zhu
164 Scaling Open-Ended Reasoning to Predict the Future Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
165 Verifiable Chemical Reasoning through Tool-Calling Agentic Workflow Gabrielle Gaudeau, Shinnosuke Tanaka, Defne Circi, Ian W Kennedy, Movina Moses, Mohab Elkaref
166 Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL Shreyas Singh, Kunal Singh, Pradeep Moturi
167 SEA: Stateful Execution Environment for Conversational Big Data Analytics Rohit Kumar, Ajay Anil Kumar
Support Team Scaling Environments for Agents
  • Web Chair: Douglas Yueming Lai
  • Finance & Logistics Chair: Celine Yuqin Xie

We thank our support team for their dedication and behind-the-scenes work that made this workshop possible.

Sponsors Scaling Environments for Agents