Replies: 1 comment 1 reply
-
It seems that LAVIS leverages the power of CLIP/BLIP by utilizing a combination of techniques. While the resolution of 256x256 may seem small for detailed recognition, LAVIS likely employs advanced algorithms that focus on extracting relevant image sections and intelligently upscaling them. This process allows it to make accurate visual assessments, even with limited pixel data. The integration of CLIP/BLIP may play a crucial role in enhancing the contextual understanding and overall accuracy of the results. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
Can anyone explain how LAVIS is able to give accurate visual results, with CLIP/BLIP only being 256x256? That's really small. With this size you hardly can recognise any detail. Does it extract image sections, scaling these up and judging these individually?
Beta Was this translation helpful? Give feedback.
All reactions