You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the multi-lora inference, I am curious about how the back-end GPU utilization is working and how the caching and overhead is going on.
Like for example, when at the beginning, there is none of the lora path being loaded via the curl command but only allowing the enable_dynamic_loading, may I know about how the caching takes on or else is there any pre-allocate gpu already reserved for loading the lora modules and like how many lora modules are allowed.
Also, I am curious that in other case, maybe I have loaded 2 lora modules already at the beginning curl command, are they being stored in the GPU or RAM memory, and when switching them for inference, how much is the overhead and computational cost for switching
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When using the multi-lora inference, I am curious about how the back-end GPU utilization is working and how the caching and overhead is going on.
Like for example, when at the beginning, there is none of the lora path being loaded via the curl command but only allowing the enable_dynamic_loading, may I know about how the caching takes on or else is there any pre-allocate gpu already reserved for loading the lora modules and like how many lora modules are allowed.
Also, I am curious that in other case, maybe I have loaded 2 lora modules already at the beginning curl command, are they being stored in the GPU or RAM memory, and when switching them for inference, how much is the overhead and computational cost for switching
Beta Was this translation helpful? Give feedback.
All reactions