Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT for Patents yields 1024 element array, but embedding_v1 is 64 element #49

Open
sthorpe11 opened this issue Jun 30, 2021 · 5 comments

Comments

@sthorpe11
Copy link

How should I generate an embedding equivalent to embedding_v1? BERT for Patents generates a 1024 element embedding, but the embedding_v1 is a 64 element embedding.

@wetherbeei
Copy link
Collaborator

The model to generate embedding_v1 has not been released, and we also haven't released pre-embedded patents with the BERT model in BigQuery.

You could experiment with learning a mapping from BERT to embedding_v1 with a linear layer - they should match up well because they're both based on text. embedding_v1 is a set-of-words unigram model.

@sanealytics
Copy link

Can you give some insight into how you dealt with limited window size for BERT?
Eg did you choose between abstract/patent/etc; Pool things? Something else?

@sthorpe11
Copy link
Author

sthorpe11 commented Dec 2, 2021 via email

@sanealytics
Copy link

Thanks for that quick response.
This repo is a great resource.

@KNT-AJ
Copy link

KNT-AJ commented Feb 16, 2024

This repo is great. Thank you! Any plans to release the model that generated embedding_v1 or the BERT pre-embedded patents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants