Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subseq enhancement - fasta ID and comment for bed, gtf and gff #177

Open
thackl opened this issue Jan 22, 2021 · 2 comments
Open

subseq enhancement - fasta ID and comment for bed, gtf and gff #177

thackl opened this issue Jan 22, 2021 · 2 comments

Comments

@thackl
Copy link

thackl commented Jan 22, 2021

Hi,

thanks so much for seqkit - it's one of the tools I use almost on a daily basis and I love it!

There is one feature though, that I feel is lacking a bit. I regularly use seqkit subseq to extract features based on annotations files. Ideally, I would like to have more control over the ID that is given to the extracted fasta sequences and possibly the comment as well. Usually, my annotations come in gff3 (which is not supported), so I convert to bed, and then I use a perl-one-liner to turn the fasta comment into the fasta ID. It works, but since I use it so often, and I guess others might too, it feels like it would be very useful if seqkit natively provided that functionality.

This issue has been raised before: #154, #89

# foo.fna
>foo
GTTTTGTTGACCAGACGATA

# foo.gff
foo     .       .       3       10      .       +       .       ID=foo_1;Name=gene-foo;product="bla bla bla";

# This is how most my calls look - it works but it feels a bit clunky :)
seqkit subseq --bed <(gff2bed foo.gff) foo.fna | perl -pe 's/>\S+\s/>/' 
[INFO] read BED file ...
Processing foo.gff
[INFO] 1 BED features loaded
[INFO] create FASTA index for foo.fna
>foo_1
TTTGTTGA

So what I would which for in terms of enhancements for seqkit would be:

  • gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.

  • ID tag: A flag like --id-tag could be added, similar to --gtf-tag. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs...

  • In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple --gtf-tags gene_id,transcript_id,product.

Obviously, those aren't crucial issues, but I think they would help to further improve seqkit API.
Cheers
Thomas

@shenwei356
Copy link
Owner

shenwei356 commented Jan 22, 2021

Thank you Thomas!

Hmm, seqkit does not mean to implement comprehensive gene feature files manipulations that bedtools and bedops have already provided.

gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.

GFF3 is much more complicated, I think.

ID tag: A flag like --id-tag could be added, similar to --gtf-tag. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs...

Good point.

In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple --gtf-tags gene_id,transcript_id,product.

I don't know, may be try later.

Sorry, I'm a little bit busy these days. I may handle this when I'm free.

@thackl
Copy link
Author

thackl commented Jan 23, 2021

Sorry, I'm a little bit busy these days. I may handle this when I'm free.

Sure, no worries! This is really just cherries on the cake. But just in case you pick it up at some point:

GFF3 is much more complicated [than GTF], I think.

Actually not at all. GFF3 derives from GTF and can store more complicated features, but when it comes to the 9th columns with the attributes, they are almost the same from a parsing perspective (they have different reserved keys but we don't care about that here)

# attributes in gtf (9th field)
gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana";...
# Same attributes in gff3 (9th field)
gene_id=ENSG00000223972;gene_name=DDX11L1;gene_source=havana;...

The differences in gff3: key and value are separated by "=" instead of a space; values are not enclosed in "". And no extra space before ";". In Perl, for example, they could be read with one and the same regexp

perl -MData::Dumper -ne 'chomp(); 
  # captures:        (key)      (value)
  %tags = m/(?:^|;) ?(\w+)[ =]"?([^;"]+)/g;
print Dumper(\%tags)' gtf-gff3-tags.txt
$VAR1 = {
          'gene_source' => 'havana',
          'gene_name' => 'DDX11L1',
          'gene_id' => 'ENSG00000223972'
        };
$VAR1 = {
          'gene_source' => 'havana',
          'gene_name' => 'DDX11L1',
          'gene_id' => 'ENSG00000223972'
        };

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants