Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings on errors of protein modification and allele descriptions #2085

Open
manulera opened this issue Aug 10, 2023 · 12 comments
Open

Warnings on errors of protein modification and allele descriptions #2085

manulera opened this issue Aug 10, 2023 · 12 comments

Comments

@manulera
Copy link
Contributor

Hi @kimrutherford,

One thing that would be nice is to give some warnings for alleles and modifications where we know that the sequence is wrong, so that people know it when they see it in the website.

To do that, you can use the files in this directory: https://github.com/pombase/allele_qc/tree/master/results

  • *_cannot_fix_sequence_errors.tsv: all these have correct descriptions or descriptions that can be auto-fixed, but the sequence positions they indicate are wrong, so they all should be flagged with a warning of wrong sequence.

  • *_cannot_fix_other_errors.tsv: these are the ones that do not follow the patterns, and therefore cannot be chekcked. I hopefully will chip away most of those. Then there is the CTD ones, which for now are not supported by the pipeline. These could be flagged as "Not checked", they may be correct but they do not follow our guidelines.

I also have this file, in which I take some notes about alleles that I tried to fix (I went into the publication), but did not manage. This is mostly for me to not try to fix them again, but the comments sometimes say what I think they may be.

@manulera
Copy link
Contributor Author

manulera commented Aug 10, 2023

Another thing, that may have been mentioned in another issue is that residues in histones are often indexed without counting the first methionine, both when referring to a modification and an allele.

Here some fixed examples:

systematic_id 	allele_description 	allele_name   	allele_type                                                               	change_description_to 	change_name_to             	change_type_to 	auto_fix_comment 	sequence_error 	solution_index 	allele_parts      	rules_applied                                                                                                           	reference
SPAC1834.04   	K9R                	hht1-K9R      	amino_acid_mutation                                                       	K10R                  	                           	               	histone_fix      	K9             	               	K9R               	amino_acid_mutation:single_aa                                                                                           	PMID:27738016
SPAC1834.04   	S10D               	hht1-S10D     	amino_acid_mutation                                                       	S11D                  	                           	               	histone_fix      	S10            	               	S10D              	amino_acid_mutation:single_aa                                                                                           	PMID:27648579
SPAC1834.04   	T3A                	hht1-T3A      	amino_acid_mutation                                                       	T4A                   	                           	               	histone_fix      	T3             	               	T3A               	amino_acid_mutation:single_aa                                                                                           	PMID:20929775
SPAC19G12.06c 	R18A               	hta2-R18A     	amino_acid_mutation                                                       	R19A                  	                           	               	histone_fix      	R18            	               	R18A              	amino_acid_mutation:single_aa                                                                                           	PMID:21633354

I am not sure where the warning makes more sense in this case, but we should flag this in the website when people are looking at modifications / alleles of histones.

These are the genes that I am counting as histones in the pipeline, by the way:

histones = ['SPBC1105.11c', 'SPBC1105.12', 'SPAC1834.03c', 'SPAC1834.04', 'SPAC19G12.06c', 'SPBC8D2.03c', 'SPBC8D2.04', 'SPCC622.08c', 'SPCC622.09', 'SPBC11B10.10c', 'SPBC1105.17']

Related to pombase/allele_qc#15

@ValWood
Copy link
Member

ValWood commented Nov 28, 2023

So, the histone part of this is dealt with by displaying hht3-K56R(K57R aa)
@manulera agreed?

Is the first part
*_cannot_fix_sequence_errors.tsv: all these have correct descriptions or descriptions that can be auto-fixed, but the sequence positions they indicate are wrong, so they all should be flagged with a warning of wrong sequence.

dealt with by the old coordinates in the synonys filed? Or is this referring to something else? I don't fully understand what warnings are required.

@manulera
Copy link
Contributor Author

the histone part of this is dealt with by displaying

Yes

_cannot_fix_sequence_errors

These are the ones that cannot be auto-fixed by the pipeline. Their syntax is correct, as in they follow the pattern to represent the variant correctly, but the residues they mention do not match the position in the sequence. That's why it would be good to mark them as referring to wrong residues. To fix these, the only way would be to write to the authors or going back to the publication.

@ValWood
Copy link
Member

ValWood commented Nov 29, 2023

OK I will take this over to the curation tracker and we will work through them
https://github.com/pombase/allele_qc/blob/master/manual_fixes_pombase/cannot_find.tsv

@ValWood
Copy link
Member

ValWood commented Nov 29, 2023

Keeping open for the warning. I guess we can put the warning on the allele/genotype page.
@kimrutherford does this still need discuss?

@kimrutherford
Copy link
Member

I guess we can put the warning on the allele/genotype page.
@kimrutherford does this still need discuss?

If we are just putting the warning on the allele page, I can implement something and then we can tweak it (since the allele pages aren't live yet).

If we want it on the genotype pages we'll need to decide how/where to display the warning. Especially if the genotype is multi-locus.

@ValWood
Copy link
Member

ValWood commented Feb 12, 2024

Do we need this ticket?

For alleles, if the numbering has changed, the old description should be in the synonym field (@manulera is this the case)?

For modifications we have this ticket:
#2121

@ValWood
Copy link
Member

ValWood commented Feb 13, 2024

There are a few hundred modifications to fix in total from 2 lists.
These are from 89 genes.

We will hopefully fix most of these over time (soonish). To spearhead this, we will try to extract the associated publications and post the list to pombelist with
a) a warning that they are incorrect and
b) a call to authors to send the correct residue

@ValWood
Copy link
Member

ValWood commented Jul 2, 2024

What is still to do?
Do we have up to date lists of the alleles and modifications that are incorrect?

@kimrutherford
Copy link
Member

I haven't added the warnings yet.

I'll need to change the code for generating the website to process the cannot_fix files so the information can be included on the allele pages.

Alternatively we might want to add the information to the alleles in Chado, then change the website code to use that.

I don't know which plan is best.

@kimrutherford
Copy link
Member

Do we have up to date lists of the alleles and modifications that are incorrect?

Yep, that's what's in the cannot_fix files:

To do that, you can use the files in this directory: https://github.com/pombase/allele_qc/tree/master/results

  • *_cannot_fix_sequence_errors.tsv: all these have correct descriptions or descriptions that can be auto-fixed, but the sequence positions they indicate are wrong, so they all should be flagged with a warning of wrong sequence.

  • *_cannot_fix_other_errors.tsv: these are the ones that do not follow the patterns, and therefore cannot be chekcked. I hopefully will chip away most of those. Then there is the CTD ones, which for now are not supported by the pipeline. These could be flagged as "Not checked", they may be correct but they do not follow our guidelines.

@ValWood
Copy link
Member

ValWood commented Jul 2, 2024

Actually there are only 2

Alleles cannot fix, sequence errors!
SPAC19E9.02 | SPAC19E9.02:allele-7 | fin1-KD | K33R,N165A | |N165
SPAC644.14c | SPAC644.14c:allele-4 | loh4-1 | E344K | E344
I will chase these up and move to "other" if I get no response

I double checked that these are as reported in the publications, and then moved to type "other"with a note that the reported residue is incorrect.

The other file *_cannot_fix_other_errors.tsv
"these are the ones that do not follow the patterns, and therefore cannot be chekcked. I hopefully will chip away most of those. Then there is the CTD ones, which for now are not supported by the pipeline. These could be flagged as "Not checked", they may be correct but they do not follow our guidelines."

is mainly disruptions.
After removing disruptions there are 59 remaining. I will chip away at these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants