STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing
Prior text-guided video editing methods often suffer from limited shape transformation, texture or color mismatches between foreground and background, frame inconsistency, and motion distortion. We attribute these issues to the inadequate modeling of spatiotemporal pixel relevances during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that generates visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The STR score captures spatiotemporal pixel relevances across adjacent frames from 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without relying on computationally expensive 3D attention. Integrated into a latent optimization framework with a latent mask strategy, STR-Match generates temporally consistent and visually faithful videos, supporting flexible shape transformation while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms previous methods in terms of both visual quality and spatiotemporal consistency. We plan to release the code on GitHub in the near future.