Is there a more precise documentation on cli option ? #1635

RousseauRemi · 2023-07-07T21:09:00Z

RousseauRemi
Jul 7, 2023

Hello everyone,

I try for some time to generate from a prompt an image (txt2Img). I succeed to generate a few with a prompt and a negative prompt but I can't succeed to generate one with custom lora or model/checkpoint.

Like doing something like :
.\apps\stable_diffusion\scripts\main.py --app="txt2img" --precision="fp16" --prompt="planetes and stars" --device="vulkan" --negative_prompt="text" --????="anythingV3_fp16.ckpt"

"--????" should be model_id I think but there is some kind of get with "get_schedulers" that throw an exception when using with a custom argument
There is a check after to be sure it's end with ckpt or sentensor, so what am I missing ?

For the lora or vae, how could I use a custom ones ?

Some simple example on the documentation, like for the prompt would be so much clearer.
Like just to recap for some basic feature (batch size, count, width height, vae, lora, model)

If someone has played with that in CLI, It will be so great if it could just give some example with rapid explanation.

Good night and thanks for the work !!!

NeedsMoar · 2023-08-21T09:17:01Z

NeedsMoar
Aug 21, 2023

Running with --help will list everything but it's pretty much unsorted so:

You don't need the quotes around anything that wouldn't normally need them on windows (filenames with spaces, the prompts if they contain characters that require spaces around them). You don't need the = between the arg name and value either as far as I can tell.

Your get_schedulers issue sounds like either diffusers isn't installed, it's the wrong version, or something else is broken; all that function does is pull the one GPU-based scheduler available (defined in shark) and the ones from diffusers into an array. Without knowing the exact error I have no idea what the cause is. There's no --model_id command line option unless they've renamed something.

The precision defaults to fp16 so you don't need to specify that (in fact specifying fp32 to override it doesn't seem to work in my experience; fp16 should be ~2x as fast on AMD cards but forcing fp32 didn't change generation speed at all). Likewise the default device is vulkan, so you don't need that.

Prompting

Prompt is just
-p 'prompt text' or -p ['prompt', 'prompt'] (--prompt_s_) will work too, but --prompt won't.
I'm pretty sure the second form is used with the --batch_size option when generating multiple images at the same time so a different prompt can be used for each (the tokenize function can take arrays to be used for each of the batch's images instead of strings but it's not clear how to enter them in the web UI). The ' should probably be " on Windows but I'm just quoting the help.
--negative_prompt_s_ 'prompt text' or ['prompt', 'prompt'] same as -p but negative prompts, if there are multiples it would need to match the batch size and the size of the array given to -p.

Shark unhelpfully doesn't warn or anything if a command line argument isn't typed correctly, doesn't exist, or the parameter isn't formatted the way it wants.

Selecting Models

Checkpoint is --ckpt_loc [filename] OR --hf_model_id [huggingface repo ID] (if you don't specify it'll load the 512x512 version of SD 2.1)
LoRA is --use_lora [filename]
VAE is --custom_vae [filename]

Warning: User-made SD2.0 768 or SD2.1 768 models won't work right or at all (neither does the default huggingface one but it'll pretend to let you use it). Shark forces a fallback to the base 512x512 models on huggingface for both of those and optimizes that code. It also seems to ignore model config yaml files. With the user-made models I tried this resulted in it downloading the gigantic version of CLiP then failing with a size mismatch. With huggingface models you can get it to download the one you actually asked it for using the JSON in the base directory, but it'll apply a tuning file created for the 512x512 model to the higher resolution version that results in catastrophic miscompilation and some kind of nearly pure-computation loop on the GPU that gets it running 200MHz higher than max boost clock (but only using 30% of power limit) on my machine that takes over 10 seconds per iteration, makes the entire Windows UI stutter, and if you actually wait for it to complete,

Note that it only supports original LoRA, not LyCORIS or any of the other variants of it, only one at a time, and the LoRA weight cannot be specified. It always uses 0.75 which is a little bit high for some LoRAs, WAY too high for lots of them, and not high enough for a rare few. The default being that high means you're going to be more likely to run into LoRA / model combos that seem really broken when you'd normally just be able to turn the strength down. Shark does it this way because most parameters to the compiled programs are pre-cooked before compilation (except the prompts) the way things are done now, so changing LoRA requires recompiling another 3GB of flatbuffers. Changing their strength would too.

Image Count

--batch_size X controls how many images are generated simultaneously. It usually slowed things down more than expected on my 7900XTX when I tried it in the past because it pushes the memory boost clock and GPU boost clock to full speed at the same time, the card hits the power limit, and everything downclocks sharply and the frequency hops all over. (Single image generation doesn't max out the memory clock and things stay just under the power limit so computation is full speed). More recent versions of shark have a longer delay between images when batch_count is used for some reason so there might be some advantage to this, but it's broken in the UI so I haven't tested it.
--batch_count N controls the total number of images generated

Low Memory Options

--ondemand only loads the current stage of the model into VRAM and unloads when done; the default is to keep every stage between images which is much faster but might not work if you only have 6GB or something.

Recommendations

I'd suggest generating multiple images with --batch_count N if you're running from the command line since startup takes longer than producing a single image on my machine and because VMFB files are ZIP64 archives of multiple smaller files (not compressed) which prevents Windows superfetch / standby memory from working correctly (requires the file to be extracted in memory again, possibly reloaded from disk depending on how badly Python implemented mmap; They don't seem to be aware of how to use the equivalent on Windows last I looked and some files are created as uncacheable temporaries but treated as permanent files on the first run anyway for some unknown reason) then copied to the GPU (again) so you'll incur a huge time penalty between runs. Switching models even if they're already compiled and should still be in standby memory (I have 512GB of ram so for me this is anything in the past couple of weeks in general) in the webUI is slow enough. I suspect python's ZIP64 implementation isn't so hot either. Uncompressed / STOREd files shouldn't take much time to copy elsewhere in memory... although llvm-iree splits the constants out into a single module and each layer of the neural net into another module so it may need to do something silly to this before loading it too.

I'd set
--device_allocator caching ... this is supposed to be the default, but they specified the default as ["caching"] instead of "caching" in the argument parser and since it calls into native code that expects a null terminated string with this value I'm pretty sure the option won't parse right, but I try not to be too familiar with the internal structures of terrible programming langauges. In any case it speeds things up quite a bit for me.
--vulkan_large_heap_block_size 3221225472 or --vulkan_large_heap_block_size 4294967296 for 3GB or 4GB. The command line option's help says the default is 4GB then immediately shows 2GB as the default. The unet512 models at the very least are slightly larger than 2GB after built to flatbuffers so making it large enough for that is probably useful.

Don't set
--max_length to 77, tuning files seem to be broken and it'll apply to models with a max length of 64 as well; applying the length 64 tuning to length 77 models doesn't seem to cause the same horrible performance hit. The actual max token length will be expanded by building a larger model if needed anyway.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a more precise documentation on cli option ? #1635

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is there a more precise documentation on cli option ? #1635

RousseauRemi Jul 7, 2023

Replies: 1 comment

NeedsMoar Aug 21, 2023

Prompting

Selecting Models

Image Count

Low Memory Options

Recommendations

RousseauRemi
Jul 7, 2023

NeedsMoar
Aug 21, 2023