🧙‍♂️ MILD: Multi-Layer Diffusion Strategy for Multi-IP Aware Human Erasing

MILD Human Erasing Results

Interactive demonstration showing original images and MILD processing results. Each frame shows the input image followed by the human erasing result, showcasing our method's effectiveness across different scenarios.

Abstract

Recent years have witnessed the remarkable success of diffusion models, especially in various image-customized tasks. Previous works have achieved notable advances on simple human-oriented erasing task by leveraging explicit mask guidance and semantic-aware inpainting paradigm. Despite these progresses, existing methods still struggle with significant human removal challenges raised by “Multi-IP Interactions” under more complex realistic scenarios such as human-human occlusions, human-object entanglements and human-background interferences. These approaches are typically limited by: 1) Dataset deficiency. They often lack large scale multi-IP datasets, especially those covering complex scenarios such as dense occlusions, camouflaged or distracted backgrounds and diverse IP interactions. 2) Lack of spatial decoupling. Most methods lack effective spatial decoupling strategies to disentangle different foreground instances from the background, which limits their ability to achieve targeted human removal and clear background inpainting. These limitations result in performance degradation in precisely erasing overlapping individuals and restoring occluded regions in real-world scenes. In this work, we first introduce a high-quality human erasing dataset, MILD dataset, capturing diverse pose variations, occlusions, and complex backgrounds in multi-IP interactive scenarios.

Building on this foundation, we propose Multi-Layer Diffusion (MILD), a novel multi-layer diffusion strategy for complex and precise multi-IP aware human erasing. Specifically, MILD first decomposes the traditional generation process into semantically separated denoising pathways, enabling independent reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play guidance module that integrates pose, parsing, and spatial relations into the generation process, thereby improving the morphological awareness and promoting more effective restoration. We further present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to effectively modulate attention across semantic regions, resulting in fewer boundary artifacts and mitigated semantic leakage. Extensive experiments demonstrate that MILD consistently outperforms state-of-the-art methods on challenging human erasing benchmarks.

Diffusion Models Human Erasing Multi-IP Computer Vision Image Inpainting Deep Learning