Introducing OSWorld: Transforming Autonomous Agent Development using Real-World Computer Environments

Are you ready to dive into the future of computer assistants? Picture a world where digital agents can effortlessly handle complex tasks on your computer, navigating seamlessly across different operating systems and applications with minimal guidance. This fantasy prospect could transform productivity and accessibility in the digital realm, but until now, evaluating such autonomous agents has been limited by inadequate benchmarks.

Enter OSWorld, a groundbreaking platform developed by a team of researchers that promises to revolutionize the development of truly capable computer agents. This scalable, real computer environment is designed to challenge multimodal agents across Linux, Windows, macOS, and beyond. OSWorld stands out for its integrated, controllable environment that supports task setup, evaluation, and interactive learning, allowing agents to interact freely with any application installed on the system using raw mouse and keyboard inputs.

The researchers behind OSWorld have curated a benchmark of 369 real-world computer tasks spanning web browsers, office suites, media players, coding IDEs, and more. Each task is meticulously annotated with natural language instructions, setup configurations, and evaluation scripts to ensure accurate assessment. The results from testing state-of-the-art language and vision-language models on these tasks show significant deficiencies in GUI grounding, operational knowledge, and long-horizon planning capabilities.

But fear not, as these findings provide a roadmap for future research. Key areas for exploration include enhancing vision-language models’ GUI interaction, developing agent architectures for exploration and memory, addressing safety challenges in realistic environments, and expanding data and environments to fuel agent development. OSWorld represents a turning point in the pursuit of autonomous digital assistants, offering a realistic, scalable testing environment and a diverse benchmark to propel groundbreaking research.

