Encoding longer sequences #21

ashenflower · 2021-05-24T12:29:34Z

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

luciabarb · 2021-06-21T15:19:09Z

I have the same question, did you get any reply?

ashenflower · 2021-07-07T16:04:30Z

I have the same question, did you get any reply?

No, unfortunately...I guess you should sum the embeddings of all the k-mers of a read to get its final embedding

EspinosaLeal · 2022-02-03T21:32:31Z

Did it work to sum the embeddings of different k-mers?

ashenflower · 2022-03-03T14:13:39Z

Did it work to sum the embeddings of different k-mers?

I'm sorry for the delay! I didn't try anymore, but I think it can make sense, it would be similar to work with word2vec.

eternal-bug · 2023-04-23T05:09:56Z

I think need to do something like average pooling for a sequence kmer vecs, may be like this:

import numpy as np
from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

def get_kmer(seq, k):
    # extract kmer function...

seq = "AGCTACG......"

vecs = [ np.array(mk_model.vector(k)) for kmer in get_kmer(sequence, 3) ]
# get the 100-dim average vec for a different length sequence
vec_pool = np.mean(vecs, axis=0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding longer sequences #21

Encoding longer sequences #21

ashenflower commented May 24, 2021

luciabarb commented Jun 21, 2021

ashenflower commented Jul 7, 2021

EspinosaLeal commented Feb 3, 2022

ashenflower commented Mar 3, 2022

eternal-bug commented Apr 23, 2023

Encoding longer sequences #21

Encoding longer sequences #21

Comments

ashenflower commented May 24, 2021

luciabarb commented Jun 21, 2021

ashenflower commented Jul 7, 2021

EspinosaLeal commented Feb 3, 2022

ashenflower commented Mar 3, 2022

eternal-bug commented Apr 23, 2023