Abstract
Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapped positional encoding and a landmark-aware classifier-free guidance that jointly facilitate flexible yet decoupled interactions among instruction, geometry, and appearance, enabling strong identity preservation; and (3) a landmark predictor that leverages vision–language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency.
Pipeline
Overview of LaTo. The landmark predictor infers target landmarks from source image and instruction via structured chain of thought. A landmark tokenizer and visual VAE encode predicted landmarks and source image into tokens. The location-mapping positional encoding anchors each landmark token to its physical location, ensuring unified yet flexible alignment with instruction and visual tokens. The learned unconditional landmark token further guides the denoising process, keeping the edited image aligned with both the specified landmarks and instructions.
Experiments
• Landmark tokenization in LaTo preserves identity and produces natural results, whereas pixelwise alignment baselines rigidly follow the rendered landmark image and often lose identity under cross-identity landmark conditions (first four columns) or when self-identity landmarks differ substantially from the source.
• LaTo enables fine-grained facial expression editing, parametric head-pose editing, or their combination. The small images visualize generated landmarks via landmark predictor, enabling intuitive control signal acquisition.
Comparison
• Quantitative evaluation of state-of-the-art editing methods on HFL-150K test set and face attribute editing subsets from GEdit-Bench/ICE-Bench. SC: Semantic Consistency, VQ: Visual Quality, NA: Natural Appearance, IP: Identity Preservation. † Indicates models fine-tuned on HFL-150K training set.
• Qualitative comparison with state-of-the-art image editing methods.
BibTeX
@misc{zhang2026latolandmarktokenizeddiffusiontransformer,
title={LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing},
author={Zhenghao Zhang and Ziying Zhang and Junchao Liao and Xiangyu Meng and Qiang Hu and Siyu Zhu and Xiaoyun Zhang and Long Qin and Weizhi Wang},
year={2026},
eprint={2509.25731},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25731},
}