Overall Framework
STR-Match first extracts STR score from a T2V model, then optimizes the target latent using these scores and (negative) cosine similarity. A binary mask can optionally be used to preserve specific regions.

Prior text-guided video editing methods often suffer from limited shape transformation, texture or color mismatches between foreground and background, frame inconsistency, and motion distortion. We attribute these issues to the inadequate modeling of spatiotemporal pixel relevances during the editing process.
To address this, we propose STR-Match, a training-free video editing algorithm that generates visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The STR score captures spatiotemporal pixel relevances across adjacent frames from 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without relying on computationally expensive 3D attention.
Integrated into a latent optimization framework with a latent mask strategy, STR-Match generates temporally consistent and visually faithful videos, supporting flexible shape transformation while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms previous methods in terms of both visual quality and spatiotemporal consistency. We plan to release the code on GitHub in the near future.
The STR score captures spatiotemporal pixel relevance across frames using 2D spatial and 1D temporal attention, enabling flexible shape transformation while preserving key source attributes.
STR-Match first extracts STR score from a T2V model, then optimizes the target latent using these scores and (negative) cosine similarity. A binary mask can optionally be used to preserve specific regions.
We compare our method with the baseline method, which uses concatenation of self- and temporal-attention instead of STR score.
The following videos show the results of our method and the baseline on three different examples. As observed, our method effectively changes the object’s shape in a stable manner, whereas the baseline fails to do so and exhibits flickering artifacts.
While STR-Match is demonstrated using LaVie in our main paper, it is compatible with any T2V model equipped with temporal modules, such as Zeroscope.
@article{.}