Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ControNet Inpainting: use masked_image to create initial latents #5498

Closed
wants to merge 16 commits into from

Conversation

yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Oct 23, 2023

In StableDiffusionControlNetInpaintPipeline we use entire image (including the mask area) to create the initial latents - This will cause the pixels in masked area to influence our generation results, and is particularly a problem when we use a strength value that's not 1.0. some of the issues reported in #3959 (comment) maybe related to this too

testing results

I've attached the script I use for testing in this PR. In this particular example, we removed the masked area from the image and paste it over with black background. so in the input image, mask area = black

image and mask inputs for the pipeline

image mask
g_rgb g_mask

strength=0.99

our pipeline was not able to generate new background according to the text prompt for strength = 0.99. See outputs below, both outputs are generated with seed = 0 using the testing script I provided

main this PR
yiyi_test_1_out_0_strength_0 99_main yiyi_test_1_out_0_strength_0 99_testing

strength = 1.0

with strength=1.0, the initial latent we use is pure noise. so our pipeline was able to generate new backgroun. However, because this line here

latents = (1 - init_mask) * init_latents_proper + init_mask * latents
you can still see a lot of the black background from the input image got sneaked in the generated image and make the mask border super obvious. In this example we used a mask that's super closely aligned to the object, I think this issue would be more obvious if the mask is less carefully draw

main this PR
yiyi_test_1_out_0_main yiyi_test_1_out_0_testing

fix unmasked area

in auto1111, it actually always paste the unmasked area over the generated output as final step. https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/5ef669de080814067961f28357256e8fe27544f4/modules/processing.py#L929C32-L929C32

We have a section in our doc about how to preserve the unmaksed area. But maybe we should add to the code since I found the results are significant better

## Preserve unmasked areas

improved outputs with this PR

I think the outputs are consistently pretty ok ((I think these generations are of same quality as auto1111 with similar settings) when we also:

  1. set strength=0.99
  2. fix unmasked area

image generated using the testing script, for seed 0 ~ 9
yiyi_test_1_out_0_strength_0 99_fix_unmask_testing
yiyi_test_1_out_1_strength_0 99_fix_unmask_testing
yiyi_test_1_out_2_strength_0 99_fix_unmask_testing
yiyi_test_1_out_3_strength_0 99_fix_unmask_testing
yiyi_test_1_out_4_strength_0 99_fix_unmask_testing
yiyi_test_1_out_5_strength_0 99_fix_unmask_testing
yiyi_test_1_out_6_strength_0 99_fix_unmask_testing
yiyi_test_1_out_7_strength_0 99_fix_unmask_testing
yiyi_test_1_out_8_strength_0 99_fix_unmask_testing
yiyi_test_1_out_9_strength_0 99_fix_unmask_testing

testing script

from diffusers import StableDiffusionControlNetInpaintPipeline, EulerAncestralDiscreteScheduler, ControlNetModel
from diffusers.utils import load_image
import torch
import PIL
from PIL import Image
import numpy as np

branch_name = "testing" # "main"/"testing"
strength = 0.95

rgba_init_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g.png")
image_mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g_mask.png")
control_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/g_canny.png")

def flatten(img, bgcolor="#ffffff"):
    if img.mode == "RGBA":
        background = Image.new('RGBA', img.size, bgcolor)
        background.paste(img, mask=img)
        img = background
    return img.convert('RGB')

# create an init_image with white background
init_image = flatten(rgba_init_image)

dtype = torch.float16
controlnet_canny_model_id = "lllyasviel/sd-controlnet-canny"

controlnet = ControlNetModel.from_pretrained(
            controlnet_canny_model_id, torch_dtype=dtype).to("cuda")

controlnet_inpaint_pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            controlnet=controlnet,
            torch_dtype=dtype,
            safety_checker=None,
        )
controlnet_inpaint_pipe.to("cuda")

controlnet_inpaint_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(
            controlnet_inpaint_pipe.scheduler.config)


# Convert mask to grayscale NumPy array
mask_image_arr = np.array(image_mask.convert("L"))
# Add a channel dimension to the end of the grayscale mask
mask_image_arr = mask_image_arr[:, :, None]
# Binarize the mask: 1s correspond to the pixels which are repainted
mask_image_arr = mask_image_arr.astype(np.float32) / 255.0
mask_image_arr[mask_image_arr < 0.5] = 0
mask_image_arr[mask_image_arr >= 0.5] = 1


for seed in range(10):
    generator = torch.Generator(device="cuda").manual_seed(seed)
    output = controlnet_inpaint_pipe(
        image=init_image,
        mask_image=image_mask,
        strength=strength,
        prompt="a bottle emerging from ripples in a lake surrounded by plants and flowers",
        negative_prompt="blurry, bad quality, painting, cgi, malformation",
        guidance_scale=7.,
        num_inference_steps=40,
        control_image=control_image,
        controlnet_conditioning_scale=1.0,
        control_guidance_start=0.0,
        control_guidance_end=1.0,
        masked_content="blank",
        )
    repainted_image = output.images[0]
   
    output.images[0].save(f"out_{seed}_strength_{strength}_{branch_name}.png")

    # Take the masked pixels from the repainted image and the unmasked pixels from the initial image
    unmasked_unchanged_image_arr = (1 - mask_image_arr) * init_image + mask_image_arr * repainted_image
    unmasked_unchanged_image = PIL.Image.fromarray(unmasked_unchanged_image_arr.round().astype("uint8"))
    unmasked_unchanged_image.save(f"out_{seed}_strength_{strength}_fix_unmask_{branch_name}.png")

@yiyixuxu
Copy link
Collaborator Author

relevant
#5163
#3959

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Oct 23, 2023

Very nice find! I see the problematic here! It some sense the problem is that the mask image doesn't exactly match the mask of the other original image. Note that usually the original image (first image on the left here) should not include a mask, but instead be the full original image.

It makes sense to give the user the possibility to not have the original influence the result at all, but at the same time we should allow it for cases where the user wants to the generated content of the mask to look similar to how it was before. E.g. let's say I want to inpaint a bottle, but want the style to look mostly the same - in this case I do want the original image to influence my result. Also it doesn't really make sense

Could we maybe add a new function argument called use_masked_image_as_init?

@yiyixuxu
Copy link
Collaborator Author

@patrickvonplaten
added!
lf it's ok I will make same change to all other inpaint pipelines

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Oct 24, 2023

@patrickvonplaten
actually think it might be corresponding to the options here in auto1111 :) should we make it a string type variable instead?

Screenshot 2023-10-23 at 4 37 42 PM

@zengjie617789
Copy link

The modifed seems exciting. but when i use the latest code to test my own data, it seems not be good.
I use SAM to get the mask, and using the prompt about clothes style. the result is like below:
image

        image_0 = self.inpaint_pipe(prompt=prompt, negative_prompt=negative_prompt, image=image, mask_image=clothes_region,width=w_r,height=h_r, control_image=control_image,guidance_scale=7.,
        num_inference_steps=40,
        controlnet_conditioning_scale=1.0,
        control_guidance_start=0.0,
        control_guidance_end=1.0,
        use_masked_image_as_init=False,**kwargs).images[0]

Obvisouly, the results got a margin between original and generated image. The pipeline I used is StableDiffusionControlNetInpaintPipeline, while when changed to StableDiffusionInpaintPipeline, the results are below:
image
It seems work well, but without controlnet control it will be generated other hands sometimes.
I donot know how to imporve this next step, any help provided will be appreciated.

@yiyixuxu
Copy link
Collaborator Author

@zengjie617789
maybe this is actually a case you want to leave use_masked_image_as_init=True?
if you provide the inputs and script, happy to play around to see if we can make it work

@zengjie617789
Copy link

Thank your instant responds. The original image and mask are like below, and the scripts are the same as you.
I changed the use_masked_image_as_init to True, but it doesnt work.
zhoujielun_sam_clothes
zhouejielun (15)

@patrickvonplaten
Copy link
Contributor

@patrickvonplaten actually think it might be corresponding to the options here in auto1111 :) should we make it a string type variable instead?

Screenshot 2023-10-23 at 4 37 42 PM

Sounds good to make it a string type - good idea!

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Oct 26, 2023

@patrickvonplaten

I had to update tests because I swapped the order of prepare_latents and prepare_mask_latents and therefore also the generator state used by these two methods. i.e. generator will be used by prepare_latents first and then prepare_mask_latents, now it is the opposite. the output changed because of this

I run all the doc string examples for a sanity check

inpaint (SD)

main this PR
yiyi_test_4_out_sd_inpaint_main yiyi_test_4_out_sd_inpaint_testing

inpaint (SDXL)

main this PR
yiyi_test_4_out_sdxl_inpaint_main yiyi_test_4_out_sdxl_inpaint_testing

inpaint (SD ControlNet Inpaint)

main this PR
yiyi_test_4_out_sd_inpaint_control_main yiyi_test_4_out_sd_inpaint_control_testing

inpaint (SDXL ControlNet Inpaint)

main this PR
yiyi_test_4_out_sdxl_inpaint_control_main yiyi_test_4_out_sdxl_inpaint_control_testing

@sebi75
Copy link

sebi75 commented Oct 26, 2023

Hi @yiyixuxu, can you please update also the test script provided with the new argument?

#edit
Also, after testing it for generating new backgrounds from scratch, using the "blank" value for the new parameter, the results are much worst in terms of details and overall image generation. Let's take this example:
Initial image
cb5739ff-095d-454-a07-2d4fa5e723de

The settings for the generations are exactly the same for both generations.
For more testing, this issue applies for any image, not only for this case, they are very poor in details. Also, in AUTOMATIC1111 this issue isn't reproducible, the generations are well controlled and rich in details. Any ideas why this happens now?
Generations before:
kleerkai-main s3 eu-central-1 amazonaws (1)
kleerkai-main s3 eu-central-1 amazonaws (2)
kleerkai-main s3 eu-central-1 amazonaws

Generations with this PR:
generation_1
generation_2

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Oct 26, 2023

@sebi75
I updated the test script and ran it on my machine - the output looks expected.

did you use to the latest commit? your outputs look like they was generated with mask_content=original. think I messed up this parameter at some point but fixed it in more recent commit. I hope this is the case, if it's not please provide me with an example like what you did before (code + inputs), and I will take a look ASAP

Thanks!
YiYi

@yiyixuxu
Copy link
Collaborator Author

@sebi75 I noticed the issue you mentioned with a lower strength value, e.g. here is an some examples I generated with test script using strength = 0.75

I'm looking into this today!!!

yiyi_test_1_out_0_strength_0 75_fix_unmask_testing

@yiyixuxu yiyixuxu marked this pull request as draft October 27, 2023 20:23
@sebi75
Copy link

sebi75 commented Oct 27, 2023

@sebi75 I noticed the issue you mentioned with a lower strength value, e.g. here is an some examples I generated with test script using strength = 0.75

I'm looking into this today!!!

yiyi_test_1_out_0_strength_0 75_fix_unmask_testing

Thanks for looking into it, although I still think there really is an issue regarding image generation details with the new parameters. Seems that for more control for the inpainted image, there is a trade-off with the result image background details.
I couldn't get any results similar with the before version with the new parameter, no matter how many retries I did.
Looks like even the image quality is better.

For the same initial testing image, before the new parameter, I constantly get consistent results like:
alpha_result_5
alpha_result_4
alpha_result_1

With the new parameter set to blank:
result_2
result_1

@yiyixuxu
Copy link
Collaborator Author

@sebi75

oh thanks! definitely want to look into this too
can you share your script? I wasn't able to generate a background at all without this PR (unless I use strength=1.0)

@sebi75
Copy link

sebi75 commented Oct 27, 2023

@sebi75

oh thanks! definitely want to look into this too can you share your script? I wasn't able to generate a background at all without this PR (unless I use strength=1.0)

It's the same specified in the issue, I was always using strength=1.0 to get background generations ( Took some time to figure out that it needs to be exactly 1 in order to generate ).

@patrickvonplaten
Copy link
Contributor

PR looks nice! Let me know if it's ready for a final review :-)

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Oct 31, 2023

I added an option from auto1111 for testing: masked_content=fill - It basically fills the masked region with colors from image. It works with strength value < 1.0, and when compared with "blank" option, it generates image with higher quality, at the cost of a more obvious "margin" around the mask.

Our inpaint pipeline does not work exactly same as auto1111 (main difference include how we resize the mask and apply it on latents, e.g. it apply on predicated_x0 vs we apply it on x_(t-1)). But I don't think these difference mount to significant difference in generation quality. IMO our generation with fill has very similar quality when compare to auto1111 with same configuration. (see side to side comparison below for strength = 0.9 and 0.99)

I think "blank" makes sense when you want to minimize the undesired "margin". I don't think we have to add fill, It seems that we can achieve better results with masked_content == original + strength==1.0 - at least for this example. Maybe we can just leave it as a method in the Image_processor that user can use to create input?

compare "blank", "fill" and "original"

I picked 5 examples (with seed = 0,1,2,3,4) to compare "blank" vs "fill" for strength level 0.9, 0.99, I also included image generated from auto1111 "fill" as a reference. I did not include "original" option in this comparison because for our example image and mask inputs, it's not able to generate new background with strength < 1

I also compare all 3 options "blank", "fill" and "original" for strength = 1.0, I did not include auto1111 here because it does not use pure noise as initial latents for strength 1.0 so wouldn't make too much sense to compare

strength = 0.9

blank fill fill (auto1111)
yiyi_test_6_out_0_0 9_blank_testing yiyi_test_6_out_0_0 9_fill_testing image0
yiyi_test_6_out_1_0 9_blank_testing yiyi_test_6_out_1_0 9_fill_testing image1
yiyi_test_6_out_2_0 9_blank_testing yiyi_test_6_out_2_0 9_fill_testing image2
yiyi_test_6_out_3_0 9_blank_testing yiyi_test_6_out_3_0 9_fill_testing image3
yiyi_test_6_out_4_0 9_blank_testing yiyi_test_6_out_4_0 9_fill_testing image4

strength = 0.99

blank fill fill (auto1111)
yiyi_test_6_out_0_0 99_blank_testing yiyi_test_6_out_0_0 99_fill_testing image_0
yiyi_test_6_out_1_0 99_blank_testing yiyi_test_6_out_1_0 99_fill_testing image_1
yiyi_test_6_out_2_0 99_blank_testing yiyi_test_6_out_2_0 99_fill_testing image_2
yiyi_test_6_out_3_0 99_blank_testing yiyi_test_6_out_3_0 99_fill_testing image_3
yiyi_test_6_out_4_0 99_blank_testing yiyi_test_6_out_4_0 99_fill_testing image_5

strength = 1.0

blank fill original
yiyi_test_6_out_0_1 0_blank_testing yiyi_test_6_out_0_1 0_fill_testing yiyi_test_6_out_0_1 0_original_testing
yiyi_test_6_out_1_1 0_blank_testing yiyi_test_6_out_1_1 0_fill_testing yiyi_test_6_out_1_1 0_original_testing
yiyi_test_6_out_2_1 0_blank_testing yiyi_test_6_out_2_1 0_fill_testing yiyi_test_6_out_2_1 0_original_testing
yiyi_test_6_out_3_1 0_blank_testing yiyi_test_6_out_3_1 0_fill_testing yiyi_test_6_out_3_1 0_original_testing
yiyi_test_6_out_4_1 0_blank_testing yiyi_test_6_out_4_1 0_fill_testing yiyi_test_6_out_4_1 0_original_testing

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Oct 31, 2023

I also observed the trade off mentioned by @sebi75 here #5498 (comment) and I'm struggling to understand why

from what I understand, when we strength = 1.0, and if our init_mask is "accurate" here, i.e. it's able to mask out the EXACT region in latents that's corresponding to the unmasked pixels - the blank and original option should work exactly the same because the latents we calculated with this line of code below should be same. Are we seeing vastly different generation results between these two options with strength=1.0 because of latent mask mismatch? the images are blurry because it's trying to make the mismatched region look better?

Not super confident in my understanding. cc @patrickvonplaten to see if he has any insights here

latents = (1 - init_mask) * init_latents_proper + init_mask * latents

@yiyixuxu yiyixuxu marked this pull request as ready for review October 31, 2023 06:10
masked_content(`str`, *optional*, defaults to `"original"`):
This option determines how the masked content on the original image would affect the generation
process. Choose from `"original"` or `"blank"`. If `"original"`, the entire image will be used to
create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the
create the initial latent, therefore the masked content in will influence the result. If `"blank"`, the

@@ -813,6 +814,14 @@ def __call__(
clip_skip (`int`, *optional*):
Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
the output of the pre-final layer will be used for computing the prompt embeddings.
masked_content(`str`, *optional*, defaults to `"original"`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
masked_content(`str`, *optional*, defaults to `"original"`):
masked_content (`str`, *optional*, defaults to `"original"`):

@George0726
Copy link

Hi author, any updates on this PR. When will the merge process be done? Thanks

@sebi75
Copy link

sebi75 commented Nov 27, 2023

Any updates on this?

@darshats
Copy link

darshats commented Dec 1, 2023

Hi @yiyixuxu - I'm a bit confused as to how the controlnet is used here. The canny image you pointed is a canny of the bottle, but you are trying to paint the background (water, etc.) right? there the controlnet is fully blank. Wont that confuse the model to not draw in that area?

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Dec 1, 2023

@darshats
hey thanks for the questions! in this case the outline of the bottle helps define the shape of the bottle and avoid the over growing issue that we often encounter:) the canny inside the bottle does not make a difference to the output

@babyta
Copy link

babyta commented Dec 2, 2023

Hello, while the experts are here, I would like to ask some questions outside the topic. I will compare the same prompt words under our library and sd-webui. sd-webui is often better in terms of structure and final effect. Okay, what's the reason? For example: bottle placed on concrete countertop,with background of minimalist,hoya plants,blurred background

@babyta
Copy link

babyta commented Dec 4, 2023

Hello, while the experts are here, I would like to ask some questions outside the topic. I will compare the same prompt words under our library and sd-webui. sd-webui is often better in terms of structure and final effect. Okay, what's the reason? For example: bottle placed on concrete countertop,with background of minimalist,hoya plants,blurred background

I know the reason. It is a problem with the SD base model. Changing to cyberrealistic_v33 will get relatively better results.

@kadirnar
Copy link
Contributor

kadirnar commented Dec 8, 2023

@yiyixuxu @patrickvonplaten
When will you merge? I need this.

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Dec 8, 2023

hi @kadirnar:
I think it is probably not worth adding because the quality is worse than simply using strength = 1.0 with the current diffusers Implementation. This PR will be merged soon instead, and it will help with the mask border issue #6072

@kadirnar
Copy link
Contributor

hi @kadirnar: I think it is probably not worth adding because the quality is worse than simply using strength = 1.0 with the current diffusers Implementation. This PR will be merged soon instead, and it will help with the mask border issue #6072

Thank you. I'm waiting for it to be merged.

Copy link

github-actions bot commented Jan 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 3, 2024
@yiyixuxu yiyixuxu closed this Jan 3, 2024
@yiyixuxu yiyixuxu deleted the auto1111-contronet-inpaint branch January 3, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants