Introducing Pix2Act: An AI Agent Able to Use the Same Interface as Humans to Interact with GUIs through Pixel-Based Screenshots and Keyboard and Mouse Actions.

Are you curious about how artificial intelligence could revolutionize GUI-based digital assistants? Look no further than the latest research from Google DeepMind and Google, which introduces PIX2ACT, a model that takes pixel-based screenshots as input and chooses actions matching fundamental mouse and keyboard controls. This breakthrough technology is transforming the way we perceive and interact with graphical user interfaces.

PIX2ACT builds upon the success of PIX2STRUCT, a Transformer-based image-to-text model that has been trained on large-scale online data to convert screenshots into structured representations based on HTML. By utilizing tree search to repeatedly construct new expert trajectories for training, PIX2ACT employs a combination of human demonstrations and interactions with the environment.

Despite obstacles presented by learning from pixel-only inputs in conjunction with generic low-level actions, this research sets the first baseline for GUI-based instruction following with pixel-based inputs. The findings demonstrate the efficacy of PIX2STRUCT’s pre-training via screenshot parsing, raising MiniWob++ and WebShop task scores by 17.1 and 46.7, respectively. Using their proposed option, PIX2ACT outperforms human crowdworkers approximately four times on MiniWob++.

In the world of AI, the possibilities are endless, and PIX2ACT represents an exciting advancement in user interface technology. So why not join the excitement and read the full research paper on And while you’re at it, join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter at, where we share the latest AI research news, cool AI projects, and more. Don’t miss out on the chance to be a part of groundbreaking advancements in AI technology!

Leave a comment

Your email address will not be published. Required fields are marked *