publications | Junsung Lee

2025

Under Review

STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Junsung Lee , Junoh Kang, and Bohyung Han

Under Review , May 2025

Abs HTML PDF

Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.
Under Review

Low-Resolution Editing is All You Need for High-Resolution Editing

Junsung Lee^* , Hyunsoo Lee^*, Yong Jae Lee, and Bohyung Han

Under Review , Nov 2025

* Equal Contribution
Working in Progress

Multiple-Attribute Disentanglement in Personalization on VLM Models

Junsung Lee, Inju Ha, Thao Nguyen, Bohyung Han, and Yong Jae Lee

Working in Progress , Nov 2025

2024

ECCV 2024

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Junsung Lee, Minsoo Kang, and Bohyung Han

Proc. European Conference on Computer Vision (ECCV) , Sep 2024

Abs HTML PDF Supp Video

We propose a simple but effective training-free approach tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a pretrained diffusion model by introducing a noise correction term. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network with a progressive interpolation of the source and target prompt embeddings, while the other is the noise prediction with the source prompt embedding. The final noise prediction network is given by a linear combination of the standard denoising term and the noise correction term, where the former is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Our approach can be easily incorporated into existing image-to-image translation methods based on diffusion models. Extensive experiments verify that the proposed technique achieves outstanding performance with low latency and consistently improves existing frameworks when combined with them.