ControNet Inpainting: use masked_image to create initial latents #5498

yiyixuxu · 2023-10-23T16:23:25Z

In StableDiffusionControlNetInpaintPipeline we use entire image (including the mask area) to create the initial latents - This will cause the pixels in masked area to influence our generation results, and is particularly a problem when we use a strength value that's not 1.0. some of the issues reported in #3959 (comment) maybe related to this too

testing results

I've attached the script I use for testing in this PR. In this particular example, we removed the masked area from the image and paste it over with black background. so in the input image, mask area = black

image and mask inputs for the pipeline

image	mask

`strength=0.99`

our pipeline was not able to generate new background according to the text prompt for strength = 0.99. See outputs below, both outputs are generated with seed = 0 using the testing script I provided

main	this PR

`strength = 1.0`

with strength=1.0, the initial latent we use is pure noise. so our pipeline was able to generate new backgroun. However, because this line here

diffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

Line 1383 in bc7a4d4

latents = (1 - init_mask) * init_latents_proper + init_mask * latents

you can still see a lot of the black background from the input image got sneaked in the generated image and make the mask border super obvious. In this example we used a mask that's super closely aligned to the object, I think this issue would be more obvious if the mask is less carefully draw

main	this PR

fix unmasked area

in auto1111, it actually always paste the unmasked area over the generated output as final step. https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/5ef669de080814067961f28357256e8fe27544f4/modules/processing.py#L929C32-L929C32

We have a section in our doc about how to preserve the unmaksed area. But maybe we should add to the code since I found the results are significant better

diffusers/docs/source/en/using-diffusers/inpaint.md

Line 295 in bc7a4d4

## Preserve unmasked areas

improved outputs with this PR

I think the outputs are consistently pretty ok ((I think these generations are of same quality as auto1111 with similar settings) when we also:

set strength=0.99
fix unmasked area

image generated using the testing script, for seed 0 ~ 9

testing script

from diffusers import StableDiffusionControlNetInpaintPipeline, EulerAncestralDiscreteScheduler, ControlNetModel
from diffusers.utils import load_image
import torch
import PIL
from PIL import Image
import numpy as np

branch_name = "testing" # "main"/"testing"
strength = 0.95

rgba_init_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g.png")
image_mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g_mask.png")
control_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g_canny.png")

def flatten(img, bgcolor="#ffffff"):
    if img.mode == "RGBA":
        background = Image.new('RGBA', img.size, bgcolor)
        background.paste(img, mask=img)
        img = background
    return img.convert('RGB')

# create an init_image with white background
init_image = flatten(rgba_init_image)

dtype = torch.float16
controlnet_canny_model_id = "lllyasviel/sd-controlnet-canny"

controlnet = ControlNetModel.from_pretrained(
            controlnet_canny_model_id, torch_dtype=dtype).to("cuda")

controlnet_inpaint_pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            controlnet=controlnet,
            torch_dtype=dtype,
            safety_checker=None,
        )
controlnet_inpaint_pipe.to("cuda")

controlnet_inpaint_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(
            controlnet_inpaint_pipe.scheduler.config)


# Convert mask to grayscale NumPy array
mask_image_arr = np.array(image_mask.convert("L"))
# Add a channel dimension to the end of the grayscale mask
mask_image_arr = mask_image_arr[:, :, None]
# Binarize the mask: 1s correspond to the pixels which are repainted
mask_image_arr = mask_image_arr.astype(np.float32) / 255.0
mask_image_arr[mask_image_arr < 0.5] = 0
mask_image_arr[mask_image_arr >= 0.5] = 1


for seed in range(10):
    generator = torch.Generator(device="cuda").manual_seed(seed)
    output = controlnet_inpaint_pipe(
        image=init_image,
        mask_image=image_mask,
        strength=strength,
        prompt="a bottle emerging from ripples in a lake surrounded by plants and flowers",
        negative_prompt="blurry, bad quality, painting, cgi, malformation",
        guidance_scale=7.,
        num_inference_steps=40,
        control_image=control_image,
        controlnet_conditioning_scale=1.0,
        control_guidance_start=0.0,
        control_guidance_end=1.0,
        masked_content="blank",
        )
    repainted_image = output.images[0]
   
    output.images[0].save(f"out_{seed}_strength_{strength}_{branch_name}.png")

    # Take the masked pixels from the repainted image and the unmasked pixels from the initial image
    unmasked_unchanged_image_arr = (1 - mask_image_arr) * init_image + mask_image_arr * repainted_image
    unmasked_unchanged_image = PIL.Image.fromarray(unmasked_unchanged_image_arr.round().astype("uint8"))
    unmasked_unchanged_image.save(f"out_{seed}_strength_{strength}_fix_unmask_{branch_name}.png")

yiyixuxu · 2023-10-23T16:24:38Z

relevant
#5163
#3959

HuggingFaceDocBuilderDev · 2023-10-23T16:30:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten · 2023-10-23T17:10:49Z

Very nice find! I see the problematic here! It some sense the problem is that the mask image doesn't exactly match the mask of the other original image. Note that usually the original image (first image on the left here) should not include a mask, but instead be the full original image.

It makes sense to give the user the possibility to not have the original influence the result at all, but at the same time we should allow it for cases where the user wants to the generated content of the mask to look similar to how it was before. E.g. let's say I want to inpaint a bottle, but want the style to look mostly the same - in this case I do want the original image to influence my result. Also it doesn't really make sense

Could we maybe add a new function argument called use_masked_image_as_init?

yiyixuxu · 2023-10-24T02:35:47Z

@patrickvonplaten
added!
lf it's ok I will make same change to all other inpaint pipelines

yiyixuxu · 2023-10-24T02:42:31Z

@patrickvonplaten
actually think it might be corresponding to the options here in auto1111 :) should we make it a string type variable instead?

zengjie617789 · 2023-10-24T03:56:09Z

The modifed seems exciting. but when i use the latest code to test my own data, it seems not be good.
I use SAM to get the mask, and using the prompt about clothes style. the result is like below:

        image_0 = self.inpaint_pipe(prompt=prompt, negative_prompt=negative_prompt, image=image, mask_image=clothes_region,width=w_r,height=h_r, control_image=control_image,guidance_scale=7.,
        num_inference_steps=40,
        controlnet_conditioning_scale=1.0,
        control_guidance_start=0.0,
        control_guidance_end=1.0,
        use_masked_image_as_init=False,**kwargs).images[0]

Obvisouly, the results got a margin between original and generated image. The pipeline I used is StableDiffusionControlNetInpaintPipeline, while when changed to StableDiffusionInpaintPipeline, the results are below:

It seems work well, but without controlnet control it will be generated other hands sometimes.
I donot know how to imporve this next step, any help provided will be appreciated.

yiyixuxu · 2023-10-24T04:35:18Z

@zengjie617789
maybe this is actually a case you want to leave use_masked_image_as_init=True?
if you provide the inputs and script, happy to play around to see if we can make it work

zengjie617789 · 2023-10-24T04:40:31Z

Thank your instant responds. The original image and mask are like below, and the scripts are the same as you.
I changed the use_masked_image_as_init to True, but it doesnt work.

patrickvonplaten · 2023-10-25T14:42:37Z

@patrickvonplaten actually think it might be corresponding to the options here in auto1111 :) should we make it a string type variable instead?

Sounds good to make it a string type - good idea!

yiyixuxu · 2023-10-26T02:01:37Z

@patrickvonplaten

I had to update tests because I swapped the order of prepare_latents and prepare_mask_latents and therefore also the generator state used by these two methods. i.e. generator will be used by prepare_latents first and then prepare_mask_latents, now it is the opposite. the output changed because of this

I run all the doc string examples for a sanity check

inpaint (SD)

main	this PR

inpaint (SDXL)

main	this PR

inpaint (SD ControlNet Inpaint)

main	this PR

inpaint (SDXL ControlNet Inpaint)

main	this PR

sebi75 · 2023-10-26T07:26:28Z

Hi @yiyixuxu, can you please update also the test script provided with the new argument?

#edit
Also, after testing it for generating new backgrounds from scratch, using the "blank" value for the new parameter, the results are much worst in terms of details and overall image generation. Let's take this example:
Initial image

The settings for the generations are exactly the same for both generations.
For more testing, this issue applies for any image, not only for this case, they are very poor in details. Also, in AUTOMATIC1111 this issue isn't reproducible, the generations are well controlled and rich in details. Any ideas why this happens now?
Generations before:

Generations with this PR:

yiyixuxu · 2023-10-26T17:57:07Z

@sebi75
I updated the test script and ran it on my machine - the output looks expected.

did you use to the latest commit? your outputs look like they was generated with mask_content=original. think I messed up this parameter at some point but fixed it in more recent commit. I hope this is the case, if it's not please provide me with an example like what you did before (code + inputs), and I will take a look ASAP

Thanks!
YiYi

yiyixuxu · 2023-10-27T17:36:46Z

@sebi75 I noticed the issue you mentioned with a lower strength value, e.g. here is an some examples I generated with test script using strength = 0.75

I'm looking into this today!!!

sebi75 · 2023-10-27T20:25:39Z

@sebi75 I noticed the issue you mentioned with a lower strength value, e.g. here is an some examples I generated with test script using strength = 0.75

I'm looking into this today!!!

Thanks for looking into it, although I still think there really is an issue regarding image generation details with the new parameters. Seems that for more control for the inpainted image, there is a trade-off with the result image background details.
I couldn't get any results similar with the before version with the new parameter, no matter how many retries I did.
Looks like even the image quality is better.

For the same initial testing image, before the new parameter, I constantly get consistent results like:

With the new parameter set to blank:

yiyixuxu · 2023-10-27T20:41:50Z

@sebi75

oh thanks! definitely want to look into this too
can you share your script? I wasn't able to generate a background at all without this PR (unless I use strength=1.0)

sebi75 · 2023-10-27T20:46:49Z

@sebi75

oh thanks! definitely want to look into this too can you share your script? I wasn't able to generate a background at all without this PR (unless I use strength=1.0)

It's the same specified in the issue, I was always using strength=1.0 to get background generations ( Took some time to figure out that it needs to be exactly 1 in order to generate ).

src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

patrickvonplaten · 2023-10-30T16:05:26Z

PR looks nice! Let me know if it's ready for a final review :-)

yiyixuxu · 2023-10-31T04:53:33Z

I added an option from auto1111 for testing: masked_content=fill - It basically fills the masked region with colors from image. It works with strength value < 1.0, and when compared with "blank" option, it generates image with higher quality, at the cost of a more obvious "margin" around the mask.

Our inpaint pipeline does not work exactly same as auto1111 (main difference include how we resize the mask and apply it on latents, e.g. it apply on predicated_x0 vs we apply it on x_(t-1)). But I don't think these difference mount to significant difference in generation quality. IMO our generation with fill has very similar quality when compare to auto1111 with same configuration. (see side to side comparison below for strength = 0.9 and 0.99)

I think "blank" makes sense when you want to minimize the undesired "margin". I don't think we have to add fill, It seems that we can achieve better results with masked_content == original + strength==1.0 - at least for this example. Maybe we can just leave it as a method in the Image_processor that user can use to create input?

compare `"blank"`, `"fill"` and `"original"`

I picked 5 examples (with seed = 0,1,2,3,4) to compare "blank" vs "fill" for strength level 0.9, 0.99, I also included image generated from auto1111 "fill" as a reference. I did not include "original" option in this comparison because for our example image and mask inputs, it's not able to generate new background with strength < 1

I also compare all 3 options "blank", "fill" and "original" for strength = 1.0, I did not include auto1111 here because it does not use pure noise as initial latents for strength 1.0 so wouldn't make too much sense to compare

strength = 0.9

blank	fill	fill (auto1111)

strength = 0.99

blank	fill	fill (auto1111)

strength = 1.0

blank	fill	original

yiyixuxu · 2023-10-31T06:09:53Z

I also observed the trade off mentioned by @sebi75 here #5498 (comment) and I'm struggling to understand why

from what I understand, when we strength = 1.0, and if our init_mask is "accurate" here, i.e. it's able to mask out the EXACT region in latents that's corresponding to the unmasked pixels - the blank and original option should work exactly the same because the latents we calculated with this line of code below should be same. Are we seeing vastly different generation results between these two options with strength=1.0 because of latent mask mismatch? the images are blurry because it's trying to make the mismatched region look better?

Not super confident in my understanding. cc @patrickvonplaten to see if he has any insights here

diffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

Line 1381 in 32fea1c

latents = (1 - init_mask) * init_latents_proper + init_mask * latents

Co-authored-by: Patrick von Platen <[email protected]>

patrickvonplaten · 2023-10-31T18:18:43Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

+            masked_content(`str`, *optional*, defaults to `"original"`):
+                This option determines how the masked content on the original image would affect the generation
+                process. Choose from `"original"` or `"blank"`. If `"original"`, the entire image will be used to
+                create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the


Suggested change

create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the

create the initial latent, therefore the masked content in will influence the result. If `"blank"`, the

patrickvonplaten · 2023-10-31T18:19:21Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

@@ -813,6 +814,14 @@ def __call__(
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
+            masked_content(`str`, *optional*, defaults to `"original"`):


Suggested change

masked_content(`str`, *optional*, defaults to `"original"`):

masked_content (`str`, *optional*, defaults to `"original"`):

George0726 · 2023-11-12T01:22:34Z

Hi author, any updates on this PR. When will the merge process be done? Thanks

sebi75 · 2023-11-27T19:30:18Z

Any updates on this?

darshats · 2023-12-01T04:29:06Z

Hi @yiyixuxu - I'm a bit confused as to how the controlnet is used here. The canny image you pointed is a canny of the bottle, but you are trying to paint the background (water, etc.) right? there the controlnet is fully blank. Wont that confuse the model to not draw in that area?

yiyixuxu · 2023-12-01T17:23:29Z

@darshats
hey thanks for the questions! in this case the outline of the bottle helps define the shape of the bottle and avoid the over growing issue that we often encounter:) the canny inside the bottle does not make a difference to the output

babyta · 2023-12-02T03:54:05Z

Hello, while the experts are here, I would like to ask some questions outside the topic. I will compare the same prompt words under our library and sd-webui. sd-webui is often better in terms of structure and final effect. Okay, what's the reason? For example: bottle placed on concrete countertop,with background of minimalist,hoya plants,blurred background

babyta · 2023-12-04T02:13:20Z

Hello, while the experts are here, I would like to ask some questions outside the topic. I will compare the same prompt words under our library and sd-webui. sd-webui is often better in terms of structure and final effect. Okay, what's the reason? For example: bottle placed on concrete countertop,with background of minimalist,hoya plants,blurred background

I know the reason. It is a problem with the SD base model. Changing to cyberrealistic_v33 will get relatively better results.

kadirnar · 2023-12-08T21:41:40Z

@yiyixuxu @patrickvonplaten
When will you merge? I need this.

yiyixuxu · 2023-12-08T22:27:19Z

hi @kadirnar:
I think it is probably not worth adding because the quality is worse than simply using strength = 1.0 with the current diffusers Implementation. This PR will be merged soon instead, and it will help with the mask border issue #6072

kadirnar · 2023-12-10T07:07:46Z

hi @kadirnar: I think it is probably not worth adding because the quality is worse than simply using strength = 1.0 with the current diffusers Implementation. This PR will be merged soon instead, and it will help with the mask border issue #6072

Thank you. I'm waiting for it to be merged.

github-actions · 2024-01-03T15:07:23Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu added 2 commits October 17, 2023 05:17

add prints

1bac1e3

use maksed_image to create latent

d9ec05d

yiyixuxu requested a review from patrickvonplaten October 23, 2023 16:25

remove the fill method - don't think it makes a huge difference

4e85cd6

yiyixuxu added 2 commits October 24, 2023 02:25

add new arg use_masked_image_as_init

1df28d0

update

93b9543

yiyixuxu added 6 commits October 25, 2023 18:51

masked_content

6d319ae

controlnet-inpaint-sdxl

7475c09

inpaint

a40c652

update test for sd inpaint

0b60de6

sdxl-inpaint

0ebf620

add tests for controlnet inpaint sdxl

2117bbe

yiyixuxu added 2 commits October 26, 2023 03:21

update more tests

7fceeec

one more

8ff0b72

yiyixuxu marked this pull request as draft October 27, 2023 20:23

patrickvonplaten reviewed Oct 30, 2023

View reviewed changes

src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Oct 30, 2023

View reviewed changes

src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py Show resolved Hide resolved

add fill option

51055e2

yiyixuxu marked this pull request as ready for review October 31, 2023 06:10

yiyixuxu and others added 2 commits October 30, 2023 20:10

Update src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

e8e0647

Co-authored-by: Patrick von Platen <[email protected]>

apply feedback

49255f2

patrickvonplaten reviewed Oct 31, 2023

View reviewed changes

yiyixuxu mentioned this pull request Nov 27, 2023

problem with StableDiffusionXLControlNetInpaintPipeline #5904

Closed

github-actions bot added the stale Issues that haven't received updates label Jan 3, 2024

yiyixuxu closed this Jan 3, 2024

yiyixuxu deleted the auto1111-contronet-inpaint branch January 3, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ControNet Inpainting: use masked_image to create initial latents #5498

ControNet Inpainting: use masked_image to create initial latents #5498

yiyixuxu commented Oct 23, 2023 •

edited

Loading

yiyixuxu commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023

patrickvonplaten commented Oct 23, 2023 •

edited

Loading

yiyixuxu commented Oct 24, 2023

yiyixuxu commented Oct 24, 2023 •

edited

Loading

zengjie617789 commented Oct 24, 2023

yiyixuxu commented Oct 24, 2023

zengjie617789 commented Oct 24, 2023

patrickvonplaten commented Oct 25, 2023

yiyixuxu commented Oct 26, 2023 •

edited

Loading

sebi75 commented Oct 26, 2023 •

edited

Loading

yiyixuxu commented Oct 26, 2023 •

edited

Loading

yiyixuxu commented Oct 27, 2023

sebi75 commented Oct 27, 2023 •

edited

Loading

yiyixuxu commented Oct 27, 2023

sebi75 commented Oct 27, 2023

patrickvonplaten commented Oct 30, 2023

yiyixuxu commented Oct 31, 2023 •

edited

Loading

yiyixuxu commented Oct 31, 2023 •

edited

Loading

patrickvonplaten Oct 31, 2023

patrickvonplaten Oct 31, 2023

George0726 commented Nov 12, 2023

sebi75 commented Nov 27, 2023

darshats commented Dec 1, 2023

yiyixuxu commented Dec 1, 2023

babyta commented Dec 2, 2023

babyta commented Dec 4, 2023

kadirnar commented Dec 8, 2023

yiyixuxu commented Dec 8, 2023

kadirnar commented Dec 10, 2023

github-actions bot commented Jan 3, 2024

	create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the
	create the initial latent, therefore the masked content in will influence the result. If `"blank"`, the

	masked_content(`str`, optional, defaults to `"original"`):
	masked_content (`str`, optional, defaults to `"original"`):

ControNet Inpainting: use masked_image to create initial latents #5498

ControNet Inpainting: use masked_image to create initial latents #5498

Conversation

yiyixuxu commented Oct 23, 2023 • edited Loading

testing results

strength=0.99

strength = 1.0

fix unmasked area

improved outputs with this PR

testing script

yiyixuxu commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023

patrickvonplaten commented Oct 23, 2023 • edited Loading

yiyixuxu commented Oct 24, 2023

yiyixuxu commented Oct 24, 2023 • edited Loading

zengjie617789 commented Oct 24, 2023

yiyixuxu commented Oct 24, 2023

zengjie617789 commented Oct 24, 2023

patrickvonplaten commented Oct 25, 2023

yiyixuxu commented Oct 26, 2023 • edited Loading

inpaint (SD)

inpaint (SDXL)

inpaint (SD ControlNet Inpaint)

inpaint (SDXL ControlNet Inpaint)

sebi75 commented Oct 26, 2023 • edited Loading

yiyixuxu commented Oct 26, 2023 • edited Loading

yiyixuxu commented Oct 27, 2023

sebi75 commented Oct 27, 2023 • edited Loading

yiyixuxu commented Oct 27, 2023

sebi75 commented Oct 27, 2023

patrickvonplaten commented Oct 30, 2023

yiyixuxu commented Oct 31, 2023 • edited Loading

compare "blank", "fill" and "original"

strength = 0.9

strength = 0.99

strength = 1.0

yiyixuxu commented Oct 31, 2023 • edited Loading

patrickvonplaten Oct 31, 2023

Choose a reason for hiding this comment

patrickvonplaten Oct 31, 2023

Choose a reason for hiding this comment

George0726 commented Nov 12, 2023

sebi75 commented Nov 27, 2023

darshats commented Dec 1, 2023

yiyixuxu commented Dec 1, 2023

babyta commented Dec 2, 2023

babyta commented Dec 4, 2023

kadirnar commented Dec 8, 2023

yiyixuxu commented Dec 8, 2023

kadirnar commented Dec 10, 2023

github-actions bot commented Jan 3, 2024

yiyixuxu commented Oct 23, 2023 •

edited

Loading

`strength=0.99`

`strength = 1.0`

patrickvonplaten commented Oct 23, 2023 •

edited

Loading

yiyixuxu commented Oct 24, 2023 •

edited

Loading

yiyixuxu commented Oct 26, 2023 •

edited

Loading

sebi75 commented Oct 26, 2023 •

edited

Loading

yiyixuxu commented Oct 26, 2023 •

edited

Loading

sebi75 commented Oct 27, 2023 •

edited

Loading

yiyixuxu commented Oct 31, 2023 •

edited

Loading

compare `"blank"`, `"fill"` and `"original"`

yiyixuxu commented Oct 31, 2023 •

edited

Loading