-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question not an issue #2
Comments
Hello, thanks! I (mean) pool all patch embeddings into a single embedding. This is pretty standard for linear probes, but is somewhat atypical for k-NN tests (which usually use a [CLS] token e.g. DINO and iBOT); I did not use a [CLS] token because the JEPA paper didn’t use one. |
hi, thank you. Ye the pooling is the clear path but I was wondering maybe I missed something and wanted to see. Just a thought, what if the predictor also outputs the CLS token and its compared to the CLS token of the encoder of the target or the context? As an additional loss term (on top of the patch level)? Also I wanted to ask what is your affiliation to the main Yann Lacun paper, are you one of the authors? |
That's an interesting idea, I can look into implementing a CLS token but it might be tough with how I have structured everything so far. I am not one of the original authors, just implemented the paper out of curiosity and interest! If you are curious about seeing the official implementation, they actually open-sourced their model and code the other day: |
Hi, this looks amazing. I have a quick question. given that we have the encoder transformer and the predictor transformer . How would you go about testing the KNN? Lets say we have a 224x224 image, if you use patches , the context encoder (no masking) would output many many embeddings for one image. Thank you!
The text was updated successfully, but these errors were encountered: