-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ControNet Inpainting: use masked_image to create initial latents #5498
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Very nice find! I see the problematic here! It some sense the problem is that the mask image doesn't exactly match the mask of the other original image. Note that usually the original image (first image on the left here) should not include a mask, but instead be the full original image. It makes sense to give the user the possibility to not have the original influence the result at all, but at the same time we should allow it for cases where the user wants to the generated content of the mask to look similar to how it was before. E.g. let's say I want to inpaint a bottle, but want the style to look mostly the same - in this case I do want the original image to influence my result. Also it doesn't really make sense Could we maybe add a new function argument called |
@patrickvonplaten |
@patrickvonplaten |
@zengjie617789 |
Sounds good to make it a string type - good idea! |
Hi @yiyixuxu, can you please update also the test script provided with the new argument? #edit The settings for the generations are exactly the same for both generations. |
@sebi75 did you use to the latest commit? your outputs look like they was generated with Thanks! |
@sebi75 I noticed the issue you mentioned with a lower I'm looking into this today!!! |
Thanks for looking into it, although I still think there really is an issue regarding image generation details with the new parameters. Seems that for more control for the inpainted image, there is a trade-off with the result image background details. For the same initial testing image, before the new parameter, I constantly get consistent results like: |
oh thanks! definitely want to look into this too |
It's the same specified in the issue, I was always using strength=1.0 to get background generations ( Took some time to figure out that it needs to be exactly 1 in order to generate ). |
src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
Outdated
Show resolved
Hide resolved
PR looks nice! Let me know if it's ready for a final review :-) |
I added an option from auto1111 for testing: Our inpaint pipeline does not work exactly same as auto1111 (main difference include how we resize the mask and apply it on latents, e.g. it apply on I think compare
|
blank | fill | fill (auto1111) |
---|---|---|
strength = 0.99
blank | fill | fill (auto1111) |
---|---|---|
strength = 1.0
blank | fill | original |
---|---|---|
I also observed the trade off mentioned by @sebi75 here #5498 (comment) and I'm struggling to understand why from what I understand, when we Not super confident in my understanding. cc @patrickvonplaten to see if he has any insights here
|
Co-authored-by: Patrick von Platen <[email protected]>
masked_content(`str`, *optional*, defaults to `"original"`): | ||
This option determines how the masked content on the original image would affect the generation | ||
process. Choose from `"original"` or `"blank"`. If `"original"`, the entire image will be used to | ||
create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create the initial latent, therefore the maksed content in will influence the result. If `"blank"`, the | |
create the initial latent, therefore the masked content in will influence the result. If `"blank"`, the |
@@ -813,6 +814,14 @@ def __call__( | |||
clip_skip (`int`, *optional*): | |||
Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that | |||
the output of the pre-final layer will be used for computing the prompt embeddings. | |||
masked_content(`str`, *optional*, defaults to `"original"`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
masked_content(`str`, *optional*, defaults to `"original"`): | |
masked_content (`str`, *optional*, defaults to `"original"`): |
Hi author, any updates on this PR. When will the merge process be done? Thanks |
Any updates on this? |
@darshats |
Hello, while the experts are here, I would like to ask some questions outside the topic. I will compare the same prompt words under our library and sd-webui. sd-webui is often better in terms of structure and final effect. Okay, what's the reason? For example: bottle placed on concrete countertop,with background of minimalist,hoya plants,blurred background |
I know the reason. It is a problem with the SD base model. Changing to cyberrealistic_v33 will get relatively better results. |
@yiyixuxu @patrickvonplaten |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
In
StableDiffusionControlNetInpaintPipeline
we use entire image (including the mask area) to create the initiallatents
- This will cause the pixels in masked area to influence our generation results, and is particularly a problem when we use astrength
value that's not1.0
. some of the issues reported in #3959 (comment) maybe related to this tootesting results
I've attached the script I use for testing in this PR. In this particular example, we removed the masked area from the image and paste it over with black background. so in the input image, mask area = black
image and mask inputs for the pipeline
strength=0.99
our pipeline was not able to generate new background according to the text prompt for
strength = 0.99
. See outputs below, both outputs are generated withseed = 0
using the testing script I providedstrength = 1.0
with
strength=1.0
, the initial latent we use is pure noise. so our pipeline was able to generate new backgroun. However, because this line herediffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
Line 1383 in bc7a4d4
fix unmasked area
in auto1111, it actually always paste the unmasked area over the generated output as final step. https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/5ef669de080814067961f28357256e8fe27544f4/modules/processing.py#L929C32-L929C32
We have a section in our doc about how to preserve the unmaksed area. But maybe we should add to the code since I found the results are significant better
diffusers/docs/source/en/using-diffusers/inpaint.md
Line 295 in bc7a4d4
improved outputs with this PR
I think the outputs are consistently pretty ok ((I think these generations are of same quality as auto1111 with similar settings) when we also:
strength=0.99
image generated using the testing script, for seed
0 ~ 9
testing script