I'll gather the five most recent research papers from top conferences (CVPR, NeurIPS, ACL, ICCV, ECCV, WACV) related to image editing using natural language instructions. Additionally, I'll determine the best model for this task and identify datasets commonly used in this field. I'll update you once I have the information.
-
InstructPix2Pix: Learning to Follow Image Editing Instructions (CVPR 2023) – Tim Brooks, Aleksander Holynski, Alexei A. Efros. This work introduced a diffusion-based model for image editing that follows natural language instructions. To overcome data scarcity, the authors generated a large synthetic training set by pairing a language model (GPT-3) with a text-to-image model (Stable Diffusion) to create many “before and after” image examples with instructions. The resulting model (InstructPix2Pix) is a conditional diffusion network that generalizes to real images and user-written instructions, performing edits in a single forward pass without per-image fine-tuning (making it relatively fast, on the order of seconds per edit).
-
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation (CVPR 2024) – Qin Guo et al. This paper proposes “FoI,” a method to handle complex image edits guided by multiple sequential instructions without additional training or optimization steps. The key idea is to modulate the cross-attention in a diffusion model so that each instruction only affects its relevant regions, preventing unwanted changes to other parts of the image. This approach targets the over-editing problem and achieves more precise, localized edits. In experiments, FoI set state-of-the-art results: it best preserved original image details while still following instructions, and human evaluators strongly preferred FoI’s results (over 80% preference) for multi-instruction editing tasks.
-
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing (NeurIPS 2023) – Kai Zhang et al. This work presented the first large-scale human-annotated dataset for text-driven image editing and used it to improve model performance. MagicBrush contains over 10,000 triplets of a source image, a user instruction, and the corresponding edited image (covering single-turn edits, multi-turn dialogues, cases with provided edit masks, and mask-free cases). Using this dataset, the authors fine-tuned an instruction-following diffusion model (InstructPix2Pix) and achieved significantly better edited images according to human evaluation. Their analysis also showed existing models struggle on this benchmark, underlining the gap between current methods and real-world editing needs.
-
Interactive Image Manipulation With Complex Text Instructions (WACV 2023) – Ryugo Morita et al. This research addresses challenges in interactive text-based image editing, where users may issue detailed or sequential instructions. Prior methods often failed to preserve parts of the image unrelated to the instruction or were limited to changing simple attributes. Morita et al. propose a novel framework that separates an image into text-relevant and text-irrelevant regions to protect unmentioned content. Their system can perform complex actions like enlarging or shrinking objects, removing objects, or replacing the background based on the instruction. They achieve this with three strategies: (1) editing only the relevant region and keeping other areas intact, (2) using a super-resolution module to upscale the region of interest for more precise editing, and (3) providing a user interface to manually adjust the segmentation mask if needed. Experiments on the CUB and COCO datasets demonstrated that this method enables flexible and accurate edits in real-time, outperforming previous state-of-the-art methods in both fidelity and manipulation capability.
-
Learning to Follow Object-Centric Image Editing Instructions Faithfully (Findings of EMNLP 2023) – Tuhin Chakrabarty et al. This NLP-focused work looks at improving the faithfulness and precision of image edits guided by language. The authors identify common issues: instructions can be underspecified, it may be unclear where to apply the edit (grounding), and models sometimes unintentionally alter parts of the image that should remain unchanged. To tackle this, they enhance the training process of a diffusion-based editor by generating better training pairs: they use vision segmentation and chain-of-thought prompting with a VQA model to pinpoint which object or region an instruction refers to. By training on these improved (less noisy) instruction-image pairs, their model performed more fine-grained, object-specific edits than prior baselines. In both automated metrics and human evaluations, it showed higher fidelity (preserving elements not meant to change) and could even generalize to some cases beyond its training domain (e.g. metaphorical instructions) better than state-of-the-art models at the time.
Emu Edit (CVPR 2024) – Currently, one of the best-performing models for instruction-driven image editing is Emu Edit, introduced by researchers at Meta AI. Emu Edit is a multi-task diffusion model that was trained to handle an unprecedented range of editing tasks within a single unified framework. Instead of specializing only in text-guided edits, it learns to perform 16 different tasks – including region-specific edits, global image changes, style transformation, object addition/removal, and even traditional computer vision tasks – all formulated as text-conditioned image generation tasks. A set of learned task embeddings helps the model understand the intent (e.g. whether the user wants to add an object or adjust the style) and apply the correct type of edit.
-
Key Features: Emu Edit’s multi-task training allows it to interpret a wide variety of instructions and apply them with high precision. For example, it can take an input image and an instruction like “Dress the person in a red shirt and remove the background,” and it will reliably carry out both a clothing change and background removal in one go. Its training included both free-form instruction-based editing and localized edits with masks or bounding boxes, so it learned to localize changes when necessary and leave the rest of the image untouched. Notably, Emu Edit can also leverage knowledge from classical vision tasks – it was trained on classification, segmentation, and other recognition objectives formulated in a generative manner – which helps it better identify what and where to edit in an image.
-
Strengths: Emu Edit achieves state-of-the-art performance in accuracy of edits and preservation of original content. It significantly outperforms earlier models like InstructPix2Pix on benchmark evaluations, more faithfully executing the given instructions while making minimal unintended alterations. Users found its results more aligned with the requested edits and closer to the original image’s look when irrelevant regions should be preserved. Thanks to its comprehensive training, Emu Edit is also very flexible: it can handle tasks it wasn’t explicitly trained on (such as a combination of edits or a new type of edit) with only a few example demonstrations, showing strong generalization ability. In terms of efficiency and usability, Emu Edit integrates many functions into one model – rather than requiring separate specialized models or lengthy per-image fine-tuning, a single Emu Edit model can address diverse user needs. This makes it a convenient one-stop solution for image editing by instruction. In practice, using Emu Edit involves providing an image and a text prompt, much like previous diffusion models, and it produces the edited image in a reasonable time (comparable to other diffusion-based generators).
-
Weaknesses: Despite its advantages, Emu Edit is a large and complex model. Training it required a massive amount of data and computing (given the 16-task learning setup), which means reproducing or fine-tuning the model can be challenging for those without substantial resources. This complexity could also mean that inference is computationally heavy – each edit still runs a diffusion process, so while it’s faster than optimization-based methods, it may not be real-time on everyday hardware. In terms of accessibility, as of its publication the full model may not be openly available (Meta released the benchmark and described the model, but the weights or code might be proprietary or not yet widely released). This can limit adoption by the community until an open-source equivalent or the model itself is released. Lastly, while Emu Edit generalizes better than past models, extremely complex or out-of-distribution instructions could still pose difficulties – it is not guaranteed to handle every imaginable instruction perfectly, but it represents the current state-of-the-art in balancing accuracy, efficiency, and usability for this task.
-
MagicBrush (2023) – Source: Introduced by Zhang et al. (NeurIPS 2023). Size & Content: ~10k human-annotated examples of instruction-based edits on real images. Each sample includes an original image, an edit instruction in natural language, and the manually edited result (ground truth). Many samples also include an edit mask highlighting the region changed, since the dataset covers both scenarios where a user provides a mask and where no mask is given. Key Attributes: It spans diverse edit types – from simple one-step modifications (“make the sky brighter”) to multi-turn interactive edits and from adding objects to changing styles – making it a comprehensive training and evaluation resource. Availability: Publicly released by the authors; useful for training models (as done with InstructPix2Pix in the MagicBrush paper) and for benchmarking how well models follow real human instructions.
-
UltraEdit Dataset (2024) – Source: Created by Zhao et al. (NeurIPS 2024) as part of their “UltraEdit” study. Size & Content: Enormous automatically-generated dataset with ~4 million image editing samples. Each sample has a source image, a text instruction, and a synthesized edited image. The instructions are very diverse, thanks to a generation process that used a large language model for creative phrasing plus in-context examples vetted by human raters. The source images are mostly real photographs or artwork (rather than purely AI-generated images) to ensure diversity and realism. Key Attributes: Notably, UltraEdit provides region-level annotations for edits – it knows which part of the image was modified – enabling models to learn localized editing. It was specifically designed to address weaknesses in previous datasets (which were either smaller or noisier). Availability: It is a research dataset associated with a NeurIPS paper; the authors indicate its use led to new state-of-the-art results on benchmarks like MagicBrush and Emu-Edit, implying it will likely be released for others. This dataset is valuable for training high-capacity models that need a vast amount of varied examples.
-
Emu-Edit Benchmark (2024) – Source: Released by Meta AI alongside the Emu Edit model (CVPR 2024). Size & Content: A comprehensive evaluation dataset that includes image-instruction pairs categorized into seven distinct edit types. The categories are: (1) background alteration (changing or replacing the background), (2) global change (a broad change affecting the whole image, e.g. time of day or weather), (3) style alteration (changing the art style or color tone), (4) object removal, (5) object addition, (6) localized edit (modifying a specific part or object, e.g. “make the shirt blue”), and (7) texture/color change (altering material or color of something). Each category has a set of examples with a reference image, an instruction, and the expected outcome description (or sometimes example result). Key Attributes: This benchmark is designed to test models on a wide range of tasks in a controlled way. By evaluating on each category, researchers can identify strengths and weaknesses of a model (for instance, one model might excel at style changes but falter at object removals). Availability: The benchmark is publicly released (e.g., via a Hugging Face dataset) to encourage consistent evaluation. It has become a standard for comparing instruction-following image editors – Emu Edit itself was shown to achieve top performance on this benchmark, surpassing prior methods.
-
InstructPix2Pix Synthetic Data (2023) – Source: Used by Brooks et al. for training InstructPix2Pix (CVPR 2023). Size & Content: Approximately 500k (hundreds of thousands of) synthetic training pairs were generated. Each pair consists of an input image, an editing instruction, and a generated output image that reflects the instruction. The images and instructions were not manually created but instead came from automation: GPT-3 was prompted to produce a plausible edit instruction and Stable Diffusion was applied to the input image to carry out that edit, yielding the target image. Key Attributes: The dataset covers a wide variety of edits in a synthetic manner. For example, GPT-3 might produce an instruction like “turn this landscape into winter,” and Stable Diffusion would generate the winter version of the input image. While the resulting pairs can sometimes be imperfect or “noisy” (the risk with automated data), the sheer volume provided enough signal for the model to learn the task. Availability: This was a one-off generated dataset for the research; the authors did not give it a specific name, and it may not be explicitly published for download. However, the concept set a trend – later works also use large-scale synthetic data generation to bootstrap instruction-based image editing models. It highlights an important data strategy in this field when curated datasets are lacking.
-
CUB and COCO (Caption-Based Edits) – Source: Existing vision datasets repurposed for editing tasks (commonly used in earlier research, e.g. in ManiGAN, CVPR 2020). Size & Content: CUB-200-2011 is a dataset of ~11k bird images with detailed annotations (including attributes and captions), and MS-COCO is a dataset of 120k images across 80 object categories with five captions each. These datasets do not come with “edited” versions of images; instead, researchers generate editing instructions from their annotations or captions. Use in Image Editing: For example, an original COCO caption “a man riding a brown horse on a beach” might be altered to “make the horse white”, and the model is tasked with producing an image reflecting that change. In CUB, an attribute-based instruction could be “change the bird’s wing color to blue” given an image and its attribute data. Models like ManiGAN were evaluated on how well they could modify the input image according to such text while keeping other details the same. Key Attributes: These datasets provide a wide range of real images and semantically rich captions, which researchers leveraged to test editing models’ fidelity and correctness. They are fully public and well-known. However, because they were not originally designed for editing tasks, the evaluation often required manual inspection or simple metrics (like image similarity and text similarity) to judge success. They served as important benchmarks especially before dedicated editing datasets (like MagicBrush or Emu-Edit) emerged, helping validate that a model can handle both fine-grained changes (in CUB, e.g. bird appearance) and diverse objects (in COCO). Both datasets remain useful for certain evaluation cases and can complement purpose-built editing benchmarks.