Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

ACM MM25

Zhenghao Zhang Junchao Liao Xiangyu Meng Long Qin Weizhi Wang

Alibaba Group

Abstract

Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation.

Pipeline

An overview of Tora2, which consists of a decoupled personalization extractor (DPE), a trajectory extractor, a video diffusion transformer, and a gated self-attention mechanism for entity binding. The DPE generates personalization embeddings by combining high-frequency detail information extracted via a decoupled strategy for both human and non-human objects with the low-frequency semantic features obtained from DINOv2. The trajectory extractor encodes provided trajectories into motion embeddings, which are bound to visual entities using a gated self-attention mechanism. These bound motion and personalization conditions are then fed into the video diffusion transformer, employing a motion-guidance fuser and an additional cross-attention layer to achieve both motion and appearance customization for multi entities.

Comparison

BibTeX

@misc{zhang2025tora2motionappearancecustomized,
      title={Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation}, 
      author={Zhenghao Zhang and Junchao Liao and Xiangyu Meng and Long Qin and Weizhi Wang},
      year={2025},
      eprint={2507.05963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.05963}, 
}