Dataset + Models for Mobile GUI World Modeling

How Mobile World Model Guides GUI Agents?

A unified study of text, image, and renderable-code world models for mobile GUI agents: what they should predict, how they help at test time, and when imagined trajectories transfer to policy training.

Dataset
xwk123/Mobile-GUI-Worldmodel-SFT
Models
hf.co/collections/xwk123/mobileworldmodel
Authors
Weikai Xu1*, Kun Huang2*, Yunren Feng3*, Jiaxing Li1, Yuhan Chen1, Yuxuan Liu4, Zhizheng Jiang3, Heng Qu5, Pengzhi Gao2, Wei Liu2, Jian Luan2, Xiaolin Hu6, Bo An1†
1 Nanyang Technological University 2 MiLM Plus, Xiaomi Inc. 3 University of Electronic Science and Technology of China 4 Gaoling School of Artificial Intelligence, Renmin University of China 5 Wuhan University 6 Xiamen University
Paper Summary

Abstract and contribution

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths.

We filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. We evaluate their downstream utility on AITZ, AndroidControl, and AndroidWorld.

Finding 1Renderable code is strong in-distribution; text feedback is more robust for online OOD execution.
Finding 2Posterior self-reflection is limited by overconfident, low-entropy action policies.
Finding 3World-model imagination can transfer interaction experience, but does not preserve the source distribution.
Intro Figures

World-model formats and headline results

Overview

Figure 1 · Overview

Empirical map from prediction formats to test-time guidance and imagination-based fine-tuning.

Text vs image world model comparison

Figure 2 · Text vs image world models

Input-output comparison for GUI state prediction.

Open PDF
World-modeling paradigm comparison

World-modeling paradigms

Four prediction settings compared across judge models and MobileWorldModel.

Open PDF
Three Experimental Conclusions

What the experiments show

AndroidWorld overall SR

AndroidWorld SR

Online M3A success rate under text/image feedback.

Offline radar

Offline navigation radar

Six-dimensional summary of action-selection performance.

World-modeling paradigm comparison

Prediction quality by format

Reconstruction quality across Full Text, Delta Text, Diffusion Image, and Code2Image.

Entropy accuracy

Entropy vs accuracy

Entropy-conditioned behavior across GUI and non-GUI settings.

Self reflection change

Reflection change rate

Higher entropy enables more meaningful action revision.

Test-time scaling

Test-time scaling

Scaling trends on AITZ, AndroidControl, and GUI-Odyssey.

Offline WM-SFT results

Offline WM-SFT

Training behavior under imagined trajectories.

Online WM-SFT results

Online evaluation

AndroidWorld gains vs AndroidControl distribution shift.

Click position distribution statistics

Click distribution

Behavioral changes induced by WM-generated interaction traces.

Qualitative Cases

Where the world model helps and fails

AITZ HTML world-model feedback case

HTML world-model feedback

AITZ downstream case study.

Open PDF
AndroidControl delta-text case study

Delta-text feedback

AndroidControl downstream example.

Open PDF
Diffusion image case study

Diffusion image case

Visual prediction can be expressive but error-prone.

Open PDF
Code2Image case study

Code2Image case

Renderable-code prediction and screenshot reconstruction.

Open PDF
Real stopwatch

Stopwatch · real UI

Rendered stopwatch

Stopwatch · imagined render

Real account

Account · real UI

Rendered account

Account · imagined render