Overall Framework
STR-Match first extracts STR score from a T2V model, then optimizes the target latent using these scores and (negative) cosine similarity. A binary mask can optionally be used to preserve specific regions.

Previous text-guided video editing methods often struggle with limited shape transformation, texture mismatches between foreground and background, temporal inconsistency, and motion distortion. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process.
To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms.
Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, and is capable of handling various editing scenarios while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.
The STR score captures spatiotemporal pixel relevance across frames using 2D spatial and 1D temporal attention, enabling flexible shape transformation while preserving key source attributes.
STR-Match first extracts STR score from a T2V model, then optimizes the target latent using these scores and (negative) cosine similarity. A binary mask can optionally be used to preserve specific regions.
We compare our method with the baseline method, which uses concatenation of self- and temporal-attention instead of STR score.
The following videos show the results of our method and the baseline on three different examples. As observed, our method effectively changes the object’s shape in a stable manner, whereas the baseline fails to do so and exhibits flickering artifacts.
While STR-Match is demonstrated using LaVie in our main paper, it is compatible with any T2V model equipped with temporal modules, such as Zeroscope.
@misc{lee2025strmatchmatchingspatiotemporalrelevance,
title={STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing},
author={Junsung Lee and Junoh Kang and Bohyung Han},
year={2025},
eprint={2506.22868},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.22868},
}