Controllable Complex Image Generation without Instance Labeling
Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions (e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.
Figure 1. Overview of InstanceControl. Our framework consists of two core stages: (1) instance-level text-visual condition association via VLM, and (2) instance-aware controllable generation with adaptive mask refinement.
| Setting | Methods | MIoU↑ | Region-wise Quality | Global-wise Quality | |||||
|---|---|---|---|---|---|---|---|---|---|
| Local CLIP↑ | Spatial↑ | Color↑ | Shape↑ | Texture↑ | IR↑ | FID↓ | |||
|
w/ Instance Labeling |
EliGen | 0.6104 | 17.49 | 91.97% | 84.98% | 85.59% | 87.93% | 0.3708 | 27.8240 |
| CreatiLayout | 0.5247 | 16.84 | 88.16% | 80.87% | 83.00% | 85.91% | 0.2979 | 21.3988 | |
| Seg2Any | 0.8316 | 18.62 | 91.71% | 87.25% | 88.39% | 90.14% | 0.0893 | 21.2305 | |
| DreamRenderer (Canny) | 0.6497 | 16.25 | 80.74% | 67.55% | 69.20% | 74.68% | 0.1877 | 14.5079 | |
| InstanceControl (Canny) | 0.8381 | 18.97 | 95.34% | 92.11% | 92.29% | 93.66% | 0.2940 | 9.9340 | |
| DreamRenderer (Depth) | 0.7738 | 17.88 | 89.52% | 80.91% | 81.45% | 86.69% | 0.3206 | 15.3637 | |
| InstanceControl (Depth) | 0.8212 | 18.73 | 95.06% | 91.35% | 92.08% | 93.47% | 0.2921 | 11.9211 | |
| DreamRenderer (HED) | 0.7060 | 15.08 | 78.93% | 60.56% | 63.18% | 68.79% | 0.0392 | 22.6739 | |
| InstanceControl (HED) | 0.8504 | 19.08 | 95.77% | 92.58% | 92.77% | 94.10% | 0.3256 | 10.2175 | |
|
w/o Instance Labeling |
FLUX ControlNet (Canny) | 0.6526 | 16.48 | 84.67% | 73.30% | 73.93% | 79.24% | 0.2558 | 14.0427 |
| InstanceControl (Canny) | 0.8250 | 18.51 | 93.54% | 87.78% | 88.19% | 90.88% | 0.2847 | 10.0264 | |
| FLUX ControlNet (Depth) | 0.7782 | 17.73 | 90.14% | 75.87% | 78.20% | 82.33% | 0.3216 | 14.1077 | |
| InstanceControl (Depth) | 0.8116 | 18.18 | 92.87% | 86.11% | 86.49% | 89.91% | 0.2961 | 11.9181 | |
| FLUX ControlNet (HED) | 0.6817 | 15.67 | 83.09% | 70.15% | 70.51% | 77.08% | 0.2314 | 20.0755 | |
| InstanceControl (HED) | 0.8472 | 18.74 | 94.64% | 88.75% | 89.55% | 91.18% | 0.3268 | 10.14 | |
Comparison with state-of-the-art methods on multi-instance controllable generation across three visual conditions (Canny, Depth, HED). Bold values indicate the best result within each group. IR denotes Instance Reward. FID measures global image quality (lower is better). InstanceControl achieves the best performance in both settings, even surpassing methods that require instance labeling.
Click or hover over any instance text or image region to highlight the corresponding object
Hover over any instance text or image region to highlight the corresponding object