-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding longer sequences #21
Comments
I have the same question, did you get any reply? |
No, unfortunately...I guess you should sum the embeddings of all the k-mers of a read to get its final embedding |
Did it work to sum the embeddings of different k-mers? |
I'm sorry for the delay! I didn't try anymore, but I think it can make sense, it would be similar to work with word2vec. |
I think need to do something like average pooling for a sequence kmer vecs, may be like this: import numpy as np
from dna2vec.multi_k_model import MultiKModel
filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)
def get_kmer(seq, k):
# extract kmer function...
seq = "AGCTACG......"
vecs = [ np.array(mk_model.vector(k)) for kmer in get_kmer(sequence, 3) ]
# get the 100-dim average vec for a different length sequence
vec_pool = np.mean(vecs, axis=0) |
Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?
The text was updated successfully, but these errors were encountered: