Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding longer sequences #21

Open
ashenflower opened this issue May 24, 2021 · 5 comments
Open

Encoding longer sequences #21

ashenflower opened this issue May 24, 2021 · 5 comments

Comments

@ashenflower
Copy link

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

@luciabarb
Copy link

I have the same question, did you get any reply?

@ashenflower
Copy link
Author

I have the same question, did you get any reply?

No, unfortunately...I guess you should sum the embeddings of all the k-mers of a read to get its final embedding

@EspinosaLeal
Copy link

Did it work to sum the embeddings of different k-mers?

@ashenflower
Copy link
Author

Did it work to sum the embeddings of different k-mers?

I'm sorry for the delay! I didn't try anymore, but I think it can make sense, it would be similar to work with word2vec.

@eternal-bug
Copy link

I think need to do something like average pooling for a sequence kmer vecs, may be like this:

import numpy as np
from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

def get_kmer(seq, k):
    # extract kmer function...

seq = "AGCTACG......"

vecs = [ np.array(mk_model.vector(k)) for kmer in get_kmer(sequence, 3) ]
# get the 100-dim average vec for a different length sequence
vec_pool = np.mean(vecs, axis=0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants