EditVerse Project Page

EditVerse unifies a diverse range of generation and editing tasks for both images and videos within a single, powerful model.

Video Editing

We only show a few samples here. For more visualization, click here

Single Input

After

Before

Multiple Input

Source Video

Reference Image

Edited Video

Instruction: "[Source Video] edit the video following the first frame [Reference Image]"

Source Video

Reference Image

Edited Video

Instruction: "[Source Video] Insert a paper boat in the water [Reference Image] A graceful white swan glides silently across the still surface of a clear lake...Beside it, an orange paper boat drifts lightly."

Video Generation

Instruction: "A person is laughing."

Instruction: "A yellow bird."

Instruction: "A shark is swimming in the ocean, pixel art."

Instruction: "A panda standing on a surfboard in the ocean in sunset."

Instruction: "A squirrel eating a burger."

Instruction: "Slow motion cropped closeup of roasted coffee beans falling into an empty bowl."

Instruction: "A raccoon is playing the electronic guitar."

Instruction: "A cat wearing sunglasses at a pool"

Instruction: "A banana on the bottom of an apple, front view."

Image Editing & Generation

Before

After

Instruction: "Change the beach in the picture to a mountain landscape."

Before

After

Instruction: "Make the person raise her right arm."

Before

After

Instruction: "Transfer the image into a faceted low-poly 3D render style"

Instruction: "A photo of four chairs."

Instruction: "A photo of a blue elephant."

Instruction: "A photo of a stop sign above a fork."

Instruction: "A photo of a green suitcase and a blue boat."

Methodology

EditVerse is built on two core principles: enabling powerful in-context learning and ensuring maximum flexibility.

(1) Unified Representation for In-Context Learning: We represent all modalities (text, images, and videos) as a unified one-dimensional token sequence. This sequence is fed into the model built on a full self-attention architecture. This approach allows the model to jointly process and align different modalities, leading to enhanced text comprehension and improved editing quality. Most importantly, it facilitates natural knowledge transfer from the data-rich image domain to the data-scarce video domain.

Diagram showing text, image, and video tokens being combined

We design a unified framework for image and video editing and generation, which processes text and vision inputs into a unified sequence. The right part of the figure shows our position embedding design. This framework leverages full self-attention to facilitate robust in-context learning and effective knowledge transfer among modalities.

(2) Flexibility via Interleaved Design: Inspired by Multimodal Large Language Models (MLLMs), we use an interleaved design for text, image, and video inputs. This allows EditVerse to process inputs and outputs of arbitrary resolution, duration, and sequential position. To help the model distinguish between these different dimensions, we introduce a novel four-dimensional Rotary Position Embedding (RoPE) that incorporates sequential, temporal, height, and width information.

Quantitative Comparison

Method	VLM evaluation	Video Quality	Text Alignment		Temporal Consistency
Method	Editing Quality ↑	Pick Score ↑	Frame ↑	Video ↑	CLIP ↑	DINO ↑
Attention Manipulation (Training-free)
TokenFlow	5.26	19.73	25.57	22.70	98.36	98.09
STDF	4.41	19.45	25.24	22.26	96.04	95.22
First-Frame Propagation (w/ End-to-End Training)
Señorita-2M	6.97	19.71	26.34	23.24	98.05	97.99
Instruction-Guided (w/ End-to-End Training)
InsV2V	5.21	19.39	24.99	22.54	97.15	96.57
Lucy Edit	5.89	19.67	26.00	23.11	98.49	98.38
EditVerse (Ours)	7.65	20.07	26.73	23.93	98.56	98.42
Closed-Source Commercial Models
Runway Aleph	7.44	20.42	27.70	24.27	98.94	98.60

Quantitative comparison on EditVerseBench. For open-source research models, we compare two training-free methods (TokenFlow and STDF), one first-frame propagation method (Señorita-2M), and two instruction-guided video editing method (InsV2V and Lucy Edit). Best results are highlighted in bold. We also provide the results of a commercial model, Runway Aleph. While EditVerse lags Runway Aleph in generation quality due to base model differences, our proposed method EditVerse surpasses it in editing faithfulness (via VLM evaluation on editing quality), which is better aligned with human judgment that is further validated by user studies (see paper).

⚡️Click Here for Full Visual Comparison!

User Study

vs. EditVerse

98.6%

STDF

97.0%

TokenFlow

96.9%

InsV2V

63.2%

33.9%

Lucy Edit Dev

51.7%

26.3%

22.0%

Runway Aleph

Win

Tie

Lose

Citation

If you find our work useful for your research, please consider citing our paper:

@article{ju2025editverse,
  title   = {EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning},
  author  = {Xuan Ju and Tianyu Wang and Yuqian Zhou and He Zhang and Qing Liu and Nanxuan Zhao and Zhifei Zhang and Yijun Li and Yuanhao Cai and Shaoteng Liu and Daniil Pakhomov and Zhe Lin and Soo Ye Kim and Qiang Xu},
  journal = {arXiv preprint arXiv:2509.20360},
  year    = {2025},
  url     = {https://arxiv.org/abs/2509.20360}
}