Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang*  Junchao Liao*  Menghao Li  ZuoZhuo Dai
Bingxue Qiu  Siyu Zhu  Long Qin  Weizhi Wang

Alibaba Group
*Equal Contribution
📢 Update: We released our inference code and model weights at https://github.com/alibaba/Tora

arXiv Github



A close-up shot reveals two roses, one in vibrant purple and the other in sunny yellow, swaying gently in unison against the backdrop of a snow-covered mountain range. The video meticulously highlights the striking contrast between the vivid colors of the roses and the pristine whiteness of the mountains. The delicate movements of the roses are captured with precision, enhancing the visual harmony and captivating the viewer with the interplay of color and nature. The scene remains free of additional text or objects, focusing solely on the enchanting beauty of the roses amidst the majestic mountain landscape.



A vintage wooden sailboat glides steadily down a mist-covered river, surrounded by dense, lush green forest. The boat takes center stage, its timeless elegance highlighted by the ethereal mist that envelops the scene, creating a mystical atmosphere. The video captures the serene journey, emphasizing the untouched beauty of nature. The scene remains free of additional text or objects, allowing viewers to be fully immersed in the tranquil and almost otherworldly setting as the sailboat cuts through the still waters.
A flock of seagulls soars gracefully through a vibrant underwater world teeming with colorful coral reefs. The video captures the surreal combination of seagulls flying underwater, their smooth movements in stark contrast to the lush corals surrounding them. The background is a vivid tapestry of oceanic hues, alive with the swaying forms of corals, creating a dreamlike atmosphere. The scene is devoid of additional text or objects, allowing the viewer to fully immerse in the fantastical and harmonious blend of avian grace and underwater wonder.
A picturesque countryside scene with rolling green hills, dotted with quaint cottage houses and a charming farmhouse with a chimney. The foreground features a gently flowing stream with a wooden fence along its banks and a path winding through fields of wildflowers. Puffy white clouds float in a clear blue sky. The camera moves left to reveal more of the idyllic landscape, including additional houses and trees.
A crucian carp swims gracefully across the red, rocky surface of Mars. The video focuses on the fluid motion of the carp, contrasting sharply with the striking Martian landscape. The stark juxtaposition of the aquatic life and the desolate Martian terrain underscores the dreamlike atmosphere, drawing viewers into this otherworldly scenario. No additional text or objects are present, allowing the viewer to fully appreciate the intriguing and unexpected fusion of elements.
Two men cycle steadily along a highway under a clear, sunny sky. The video captures the rhythmic cadence of their pedaling and the long, seemingly endless stretch of the highway ahead. The background, lined with expansive farmlands, enhances the sense of movement and freedom. The scene is devoid of additional text or distractions, allowing viewers to fully immerse in the simplicity and liberation of the journey, highlighting the harmony between the cyclists and the vast, open agricultural landscape.

A captivating snowflake glass ball revolves gently, set against a static winter display shelf in the background. The camera lingers on the intricate details within the glass, highlighting the delicate, crystalline patterns of the snowflake. Surrounding the centerpiece are various winter-themed decorations that enhance the festive ambiance. The backdrop remains consistently enchanting, evoking a serene and magical winter wonderland. The video invites viewers to appreciate the artistry and seasonal charm encapsulated within the snowflake glass ball, focusing solely on the elegance of the display without additional elements.
In a sunlit, verdant meadow, a joyful brown puppy sits contentedly on the lush green grass, its tail gently wagging with happiness. The video captures the delightful moment as the puppy turns its head with curiosity and bliss, its eyes sparkling with playful energy. The simple, expansive field in the background serves to highlight the puppy's cheerful demeanor, free from any distractions. The scene is bathed in warm sunlight, enhancing the natural beauty and serene atmosphere of the moment. This video beautifully encapsulates the innocent joy and wonder of the puppy's carefree existence.
A serene and captivating close-up shot of grass swaying gently in the wind. Each blade of grass moves in a rhythmic dance, the subtle undulations highlighted by the breeze. The background is softly blurred, ensuring the focus remains on the delicate movement of the grass, enhancing the calming and tranquil ambiance of the scene. The interplay of light and shadow on the green blades contributes to a harmonious, natural aesthetic that embodies peacefulness and simplicity. There are no extraneous texts or objects within the frame, allowing the viewer to fully immerse in the serene beauty of nature.
Two luminous lanterns ascend simultaneously into the night sky, their warm, gentle glow standing out vividly against the dark backdrop. Rising in perfect harmony, they create a serene and magical ascent, their soft light painting a tranquil picture. The scene unfolds in a quiet, open area, devoid of any other distractions, allowing the viewer's attention to focus solely on the ethereal beauty of the lanterns' flight. The stillness of the night is accentuated by the absence of other elements, making the moment feel profoundly peaceful.
On a bright, sunlit day, two adorable kittens walk side by side along the golden sands of a serene beach. The camera faithfully follows their gentle movements, capturing every delicate paw print and graceful stride. In the background, gentle waves softly lap against the shore under a vast, clear blue sky, where the horizon seems to blend seamlessly with the sea. The sunlight bathes the beach in a warm, golden glow, enhancing the tranquil and harmonious atmosphere of the scene. This creates a peaceful and charming coastal tableau, perfect in its simplicity and beauty.


Trajectory Control

Tora ensures the generated movements precisely follow the specified trajectories while authentically replicating the dynamics of the physical world.

trajectory/image
sample 1
sample 2
sample 3

A single soap bubble floats gently, rising amidst a field of blooming wildflowers. The video captures the bubble's delicate and iridescent nature, set against a beautiful floral backdrop. The colorful wildflowers sway softly in the background, enhancing the bubble's ephemeral beauty. The scene remains devoid of additional text or objects, allowing viewers to fully immerse themselves in the serene and captivating moment, highlighting the fragility and transient charm of the bubble amidst the vibrant natural surroundings.
A red helium balloon floats slowly upward into the sky over a barren desert and expansive Gobi landscape. The camera tracks the balloon's ascent, capturing its vivid color against the stark, arid terrain below. The endless stretches of sand and gravel provide a striking contrast to the balloon’s bright and buoyant presence. No additional text or objects intrude upon the scene, allowing viewers to fully immerse themselves in the serene and contemplative journey of the balloon as it rises against the boundless desert backdrop.
A jellyfish floats upward near the ocean floor, its translucent body pulsating gracefully with each movement. The camera captures the mysterious underwater environment, focusing solely on the jellyfish's elegant ascent against the dark, deep-toned water. The seabed below is dimly lit, adding to the ethereal and almost otherworldly atmosphere. The scene is devoid of additional text or objects, allowing viewers to fully immerse themselves in the mesmerizing and serene dance of the jellyfish amidst the vast, shadowy depths of the ocean.

Amidst a serene twilight setting, a teddy bear sits cheerfully on a skateboard, illuminated softly against the starlit sky. The video captures the endearing motion of the teddy bear as it sways gently from side to side, creating a whimsical and heartwarming scene. The dark silhouette of forest trees in the background adds to the quaint charm, evoking a sense of peaceful solitude. The scene allows viewers to fully immerse themselves in the warm and gentle movements of the teddy bear, its quiet joy set against the tranquil, starry night.
A serene wide-angle shot captures a sunflower gently swaying in the breeze, set against the backdrop of an expansive golden wheat field under the setting sun. The scene emphasizes the delicate movement of the sunflower, which stands tall amid the endless sea of wheat, bathed in the warm, orange glow of the sunset. The golden hues of the wheat field glisten in the twilight, adding depth and richness to the landscape. As the wind rustles through the fields, the camera lingers on the sunflower, highlighting its vibrant petals and the intricate details of its seeds. The video, devoid of any text or additional objects, immerses the viewer in the tranquil beauty and the harmonious dance between nature and light.
In a serene outdoor setting, a delicate glass jar hangs, gently swaying from side to side in the soft breeze. The video captures the jar's subtle motion against a soothing backdrop of blurred, lush greenery, bathed in natural light. With a focused close-up, the camera reveals intricate details of the jar’s smooth surface and the soft reflections dancing upon it. The play of light and shadow enhances the tranquil and harmonious atmosphere. Its simplicity and elegance amplifying the peacefulness of the natural surroundings.

A solitary maple leaf descends gracefully from left to right against the backdrop of a tranquil lake, surrounded by a frame of vibrant maple trees. The video captures the leaf’s elegant fall in exquisite detail, creating a striking contrast between its gentle motion and the stillness of the water. The scene exudes a sense of calm as the maple leaf floats downward. The video focuses solely on this peaceful moment, with no additional text or objects present, creating an immersive experience that highlights the delicate balance of movement and stillness in nature.
In the tranquil serenity of a quiet meadow, a single feather descends slowly and gracefully from the sky. As it floats through the gentle breeze, the camera meticulously tracks its delicate and effortless dance against a backdrop of lush green grass. The scene captures the essence of peacefulness, with the feather's light, airy movement standing out in contrast to the vibrant foliage. The feather sways gently, as if whispering secrets to the winds, before finally coming to rest softly upon the meadow's verdant carpet. This moment of natural beauty is enriched by the stillness of the atmosphere. The camera's elegant focus on the feather, uninterrupted by any additional text or objects, allows viewers to fully immerse themselves in the grace and simplicity of the scene.
In the foreground, a delicate rose sways rhythmically in the gentle breeze, its petals captured through a soft-focus lens that imparts an ethereal quality. The background features a vibrant city park, teeming with life and energy. People bustle about, children laugh and play, and friends gather on benches and grassy areas. This scene highlights the striking contrast between the graceful, solitary movement of the rose and the dynamic, lively atmosphere of the park that stretches out behind it. The video beautifully juxtaposes the serene elegance of nature with the vibrant pulse of urban life.

A dynamic aerial shot showcases a mountain waterfall cascading down in the early morning light. The video captures the powerful flow of the waterfall, with surrounding mist and rays of sunlight piercing through, creating a dramatic and enchanting effect. The rugged landscape of the mountain, combined with the cascading water and ethereal mist, paints a picture of natural beauty and raw power. No additional text or objects intrude upon the scene, allowing viewers to fully absorb the stunning, ever-changing interplay of light, water, and mist in this breathtaking natural setting.

Abstract


Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.

Method


Overview of the Tora Architecture. For achieving trajectory-controlled DiT-based video generation, we introduce two novel modules: the Trajectory Extractor and the Motion-guidance Fuser. The Trajectory Extractor employs a 3D motion VAE to embed trajectory vectors into the same latent space as video patches, effectively preserving motion information across consecutive frames. Subsequently, it uses stacked convolutional layers to extract hierarchical motion features. The Motion-guidance Fuser utilizes adaptive normalization layers to seamlessly inject these multi-level motion conditions into the corresponding DiT blocks, ensuring the generation of videos that consistently follow the defined trajectories. Our method aligns with the scalability of DiT, enabling the creation of high-resolution, motion-controllable videos with prolonged durations

BibTeX


  @misc{tora,
        title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation}, 
        author={Zhenghao Zhang and Junchao Liao and Menghao Li and Long Qin and Weizhi Wang},   
        year={2024},
        eprint={2407.21705},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.21705}, 
  }