-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensembl alt_allele tables does not contains all alternative allele gene groups #9
Comments
Here's a spreadsheet that lists all of the representative genes with duplicate symbols in
The second lists all representative ensembl genes with duplicate symbols:
Looking at the GP6 genes, it appears Some stats regarding the extent of duplicated symbols:
Expand for source codeimport pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
# filter to representative genes, which restricts to 1 gene per alt_allele group
repr_genes_df = genes_df.query("ensembl_gene_id == ensembl_representative_gene_id")
# find representative genes with duplicate symbols
dupl_repr_genes_df = (
repr_genes_df
[repr_genes_df.gene_symbol.duplicated(keep=False)]
.sort_values(["gene_symbol", "ensembl_gene_id"])
[["ensembl_gene_id", "gene_symbol", "gene_symbol_source_db", "gene_biotype", "chromosome", "seq_region_exc_type", "seq_region", "primary_assembly", "mhc"]]
)
# summarize number of duplicates for each symbol
def summarize(df):
return pd.Series({
"n_genes": len(df),
"n_primary_assembly_genes": sum(df.primary_assembly),
"examples": ", ".join(df.ensembl_gene_id.head(3)),
})
dupl_gene_symbol_df = dupl_repr_genes_df.groupby("gene_symbol").apply(summarize).reset_index().sort_values("n_genes", ascending=False)
# export to excel
writer = pd.ExcelWriter(path="ensembl-duplicate-genes-not-in-alt-allele.xlsx")
with writer:
dupl_gene_symbol_df.to_excel(writer, sheet_name="symbols", index=False, freeze_panes=(1, 0))
dupl_repr_genes_df.to_excel(writer, sheet_name="genes", index=False, freeze_panes=(1, 0)) |
Unfortunately, we had trouble setting the alt_allele table. This is currently being reviewed for human and mouse by the Havana team. |
Good to know. Curious whether this will be fixed in time for release 105. |
I'm afraid it's too late for release 105, as it's scheduled for the next week. |
Rerunning the snippet above for @michalszpak any updates from the Havana team? Should we switch to using gene symbols to detect alternative alleles rather than the alt_allele table? |
Hi @dhimmel - thank you for your patience while we looked into this in more detail. We are currently working to fix the issue with the alt_allele table in Ensembl 108, which is scheduled for later this year. |
Great to hear! Thanks @Ben-Ensembl for looking into this. |
Ensembl 108Was excited to see ensembl genes 108 released today! I reran the analysis above based on the output in 57e3c3a from In Ensembl 104, there were 151 groups of genes with duplicate symbols that were not grouped by the
Question: should these examples also be added to the alt_allele table? What prevented them from being grouped via alt alleles in release 108? There is a new source of duplicate gene symbols that are not grouped by the
Examples:
Question: what is the rationale for having two gene records for transcribed unprocessed pseudogenes? If a biologist mentions A2MP1, which one are they referring to? Are they conceptually different? Should a single gene be selected as representative via the alt_allele table for these cases? There are also some additional cases with duplicate symbols that do not fall into these two classes such as the plentiful
General question: our goal is to create a gene catalog with alternative alleles collapsed onto representative genes. Ideally there would be one representative ensembl gene per symbol, since anything else tends to mess up downstream bioinformatics analyses by duplicating certain genes and confusing users. Should we switch from using the @Ben-Ensembl are you able to help me with these questions? Thanks ahead of time. |
Hi @dhimmel Thank you for your patience whilst we looked into your query. Please find below comments from the Ensembl Genebuild and HAVANA team. Q1: When I looked into the missing alt alleles for release 108, my search was based on gene symbols and genomic locations. As I am aware that gene symbols are not always assigned correctly by the Xref pipeline (question 2 is an example of that), I set a stringent filter so that there should be only an alt allele for each reference allele on each alternate region. In addition, the alt allele location on the alternate region should map to the location of the reference allele on the reference chromosome. In the examples provided by you, another gene having the same symbol in the same alternate region has already been added to the alt_allele table. For instance, there is another DEFB103A alt allele gene on CHR_HG76_PATCH, and another GOLGA6L10 alt allele gene on CHR_HSCHR15_5_CTG8. The missing CCL4L2 alt allele on CHR_HSCHR17_7_CTG4 seems a genuine error and we will look into it. Q2: This is an unintended consequence of having split transcribed_unprocessed_pseudogene genes into their pseudogene and lncRNA components as separate genes. Regarding ITFG2-AS1 as another example of missing alt alleles. In this case, both genes are annotated on chromosome 12 and overlap each other. It seems that they could be merged. In principle, alt_allele search avoided calling alt_allele genes on the same reference chromosome. My understanding is that the alt_allele table should link genes between reference and alternate regions. Going forward, I think that dubious cases like this should be assessed manually by Havana. Q3: We plan to revise the alt alleles for release 110 with the addition of new patch regions from GRCh38.p14. For instance, it is unlikely that we add alt alleles for the Y_RNA and Metazoa_SRP symbols and other symbols for small RNA genes that can be found multiple times on the primary assembly. I don't think we will add the transcribed unprocessed pseudogenes from Q2 either. From that point of view, if you must choose a representative gene for every gene symbol, you may need to switch from relying on the alt_allele table to doing your own gene grouping by symbol. I hope this helps, |
Thanks @amushtaq102 and Ensembl Team for that nice explanation. I've updated the analysis for the human release 111: Many of these are the transcribed_unprocessed_pseudogene / lncRNA pairs you touch on. However, there are still many examples that it seems the alt_allele table should pick up, like where only a single ensembl gene is on the primary assembly from the symbol group.
The key question we're after is whether the ensembl genes with duplicate symbols conceptually represent a single gene or multiple. I agree the Y_RNA and Metazoa_SRP symbols could be referring to conceptually different genes. But I think the vast majority of the 507 duplicated symbols that are not resolved in the Ideally there'd be a way to identify the valid exceptions to the one-symbol-one-gene paradigm like the small RNA genes. But absent that, I think our users will benefit more from no duplicated gene symbols among representative genes, even if that is technically incorrect for situations like Y_RNA and Metazoa_SRP. Expand for codeimport pandas as pd
commit = "8f756b808e7ef75eaa3e32c35b4e7bb1c594ff26" # homo_sapiens_core_111_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
# filter to representative genes, which restricts to 1 gene per alt_allele group
repr_genes_df = genes_df.query("ensembl_gene_id == ensembl_representative_gene_id")
# find representative genes with duplicate symbols
dupl_repr_genes_df = (
repr_genes_df
[repr_genes_df.gene_symbol.duplicated(keep=False)]
.sort_values(["gene_symbol", "ensembl_gene_id"])
[["ensembl_gene_id", "gene_symbol", "gene_symbol_source_db", "gene_biotype", "chromosome", "seq_region", "primary_assembly", "mhc"]]
)
# summarize number of duplicates for each symbol
def summarize(df):
return pd.Series({
"n_genes": len(df),
"n_primary_assembly_genes": sum(df.primary_assembly),
"examples": ", ".join(df.ensembl_gene_id.head(3)),
})
dupl_gene_symbol_df = dupl_repr_genes_df.groupby("gene_symbol").apply(summarize).reset_index().sort_values("n_genes", ascending=False)
# export to excel
writer = pd.ExcelWriter(path="ensembl-duplicate-genes-not-in-alt-allele.xlsx")
with writer:
dupl_gene_symbol_df.to_excel(writer, sheet_name="symbols", index=False, freeze_panes=(1, 0))
dupl_repr_genes_df.to_excel(writer, sheet_name="genes", index=False, freeze_panes=(1, 0)) |
refs #9 enables grouping genes that are not grouped in alt_allele table.
We select a single representative genes for groups of alternative allele genes. These groups are based on the upstream
alt_allele
table, which provides a mapping betweengene_id
s andalt_allele_group_id
s.However, there appears to be groups of genes that are alternative alleles of each other that are not included in this table. One example is the set of human genes with the symbol GP6. Will elaborate further in subsequent comments.
The text was updated successfully, but these errors were encountered: