How Mobile World Model Guides GUI Agents?

Paper Summary

Abstract and contribution

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths.

We filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. We evaluate their downstream utility on AITZ, AndroidControl, and AndroidWorld.

Finding 1Renderable code is strong in-distribution; text feedback is more robust for online OOD execution.

Finding 2Posterior self-reflection is limited by overconfident, low-entropy action policies.

Finding 3World-model imagination can transfer interaction experience, but does not preserve the source distribution.

Intro Figures

World-model formats and headline results

Figure 1 · Overview

Empirical map from prediction formats to test-time guidance and imagination-based fine-tuning.

Figure 2 · Text vs image world models

Input-output comparison for GUI state prediction.

Open PDF

World-modeling paradigms

Four prediction settings compared across judge models and MobileWorldModel.

Open PDF

Three Experimental Conclusions

What the experiments show

AndroidWorld SR

Online M3A success rate under text/image feedback.

Offline navigation radar

Six-dimensional summary of action-selection performance.

Prediction quality by format

Reconstruction quality across Full Text, Delta Text, Diffusion Image, and Code2Image.

Entropy vs accuracy

Entropy-conditioned behavior across GUI and non-GUI settings.

Reflection change rate

Higher entropy enables more meaningful action revision.

Test-time scaling

Scaling trends on AITZ, AndroidControl, and GUI-Odyssey.

Offline WM-SFT

Training behavior under imagined trajectories.

Online evaluation

AndroidWorld gains vs AndroidControl distribution shift.

Click distribution

Behavioral changes induced by WM-generated interaction traces.

Qualitative Cases

Where the world model helps and fails

HTML world-model feedback

AITZ downstream case study.

Open PDF

Delta-text feedback

AndroidControl downstream example.

Open PDF

Diffusion image case

Visual prediction can be expressive but error-prone.

Open PDF

Code2Image case

Renderable-code prediction and screenshot reconstruction.

Open PDF

Stopwatch · real UI

Stopwatch · imagined render

Account · real UI

Account · imagined render