Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am not getting relevant results with m2-bert-80M-32k-retrieval #38

Open
legaltextai opened this issue Aug 5, 2024 · 3 comments
Open

Comments

@legaltextai
Copy link

legaltextai commented Aug 5, 2024

Embedded using the API
Significantly underperforms vs other models
In most of the cases, each embedding is a full text of the Supreme Court decision
Indexed with hnsw.
Should I use a different index?
I store in Postgres and use pgvector for similarity search.
togethercomputer/m2-bert-80M-32k-retrieval
Thanks

@DanFu09
Copy link
Collaborator

DanFu09 commented Aug 5, 2024

CC @jonsaadfalcon

Try the V1 models: https://huggingface.co/hazyresearch/M2-BERT-32K-Retrieval-Encoder-V1

Those have seen some legal data during training so hopefully they should work a bit better :)

If they still don't work, would love to hear more about the setup and see if there's a mismatch with how we trained them!

@legaltextai
Copy link
Author

legaltextai commented Aug 5, 2024

both query and the documents use the same embedding protocol, correct? i don't need to add any extra when embedding the prompt, like in UAE Large, right? OK to use cosine similarity and hnsw index? there are some small models, that i am also testing, with smaller context. i doubt they have been trained on any legal data and you can test them and see how they perform.

@DanFu09
Copy link
Collaborator

DanFu09 commented Aug 5, 2024

Yes, both queries and documents use the same protocol and model, there's no extra prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants