Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene vs. transcript level expression #17

Open
fairliereese opened this issue Jul 3, 2020 · 2 comments
Open

Gene vs. transcript level expression #17

fairliereese opened this issue Jul 3, 2020 · 2 comments

Comments

@fairliereese
Copy link
Collaborator

fairliereese commented Jul 3, 2020

I know I keep asking a bunch of questions that are probably specific for the pipeline we use, sorry if these are mostly irrelevant for others...

In our case, we use filtered and unfiltered abundance files to quantify transcripts and genes respectively. This is because our pipeline does not try to assign incomplete transcript reads to known transcript models and instead creates their own transcript models. These models don't often pass our filter so we don't perform transcript-level quantification on them, but we still do use them for gene-level quantification.

What would be the recommended course of action here? Especially since including the unfiltered transcripts in the expression matrix as it exists now would yield transcript entries that do not have a corresponding gene entry as we exclude them from the models.gtf file.

I also can't recall if performing gene-level quantification is even one of the challenges that we're going to be scoring and if it isn't this question is irrelevant.

@julienlag
Copy link
Collaborator

According to https://github.com/diekhans/lrgasp-submissions/blob/master/docs/expression_matrix_format.md "Gene expression will be calculated summing up the expression values of all the transcripts coming from the same locus." I think that means we don't expect participants to submit gene expression values. Instead, the evaluation pipeline will build gene models out of transcript model submissions and calculate GE values by summing up TE values, if I understand right. You're raising a valid point which would be worth discussing, however

@fairliereese
Copy link
Collaborator Author

Perhaps we can allow for users to optionally include a list of transcript models to consider as their actual high-confidence models? That way they can provide gene to transcript id information in the GTF for models that definitely belong to a specific gene but are not good enough to be considered models on their own, and can still use them for gene-level quantification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants