Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make test data conform to v1.0 API #110

Closed
wants to merge 1 commit into from
Closed

Conversation

buske
Copy link
Member

@buske buske commented May 12, 2015

Summary:

@buske buske added the Testing label May 12, 2015
@buske buske added this to the Before v1.1 milestone May 12, 2015
@Relequestual
Copy link
Member

Hey Orion, may have found some more problems.

https://github.com/MatchmakerExchange/mme-apis/blob/hotfix/v1.0b/testing/benchmark_patients.json#L1092-L1105

No position data given. type is set to SO:0001818 which is "protein_altering_variant" not "MULTIEXON DELETION".

https://github.com/MatchmakerExchange/mme-apis/blob/hotfix/v1.0b/testing/benchmark_patients.json#L1165-L1178

No position data given.

@Relequestual
Copy link
Member

Also, I think the labels should probably be the SO term given name. Where did you get your labels?

@Relequestual
Copy link
Member

In addition, for a single nucleotide change, the start and stop position must be the same, not one different.

For example P0000079.

variant data currently in the proposed update is...
"variant": { "alternateBases": "A", "assembly": "GRCh37", "end": 42929131, "referenceBases": "G", "referenceName": "17", "start": 42929130 },

Putting this data in VEP, produces a consequence of frameshift_variant, however looking at the response data (if using the API), it is seeing the ref Base as TG, not just G. Changing the start to also match the end (as the position data is inclusive), results in a different consequence.

It sees the position of the variant to be at 30, not 31, as you can see in the location

http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=hrD3JGXj3hIuGNHv-858959

If the start and end match (31), the resulting VEP consequence is correct, stop_gain, as is in the data.

http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?db=core;tl=1dPbkyrwAugoDgCm-858973

I will hopefully make this change today.
I'm in the process of validating the vairant type / consequences. These being correct are paramount for our genomic matching algorithm.

@buske
Copy link
Member Author

buske commented May 28, 2015

Hi Ben,
Thank you very, very much for your careful validation and finding these issues. Yes, I removed the position data because the exact endpoints were not known, but I should not have included the variant entry at all, that was a mistake. I'll double-check each of them with the underlying data.

I took the effect labels from the nearest jannovar annotation class. I didn't find an SO term that was more specific for multi-exon deletion. Did I miss one? I agree, the SO label is more appropriate, though as long as it's human-readable, I don't mind too much what is stored there.

@Relequestual
Copy link
Member

I'm running this through a script to import it to our database. It's taking more work than expected because of the required fields including an ensembl transcript, which in order to get from their API, the ref and alt allels must be correct.

I figured it can't hurt to validate the variants annotations. Our matching system uses the vairnat consequence as a key component to genomic matching.

Fraid I don't know about a specific term for multi-exon deletion. A lot of variants in Decipher are large, and the start / end isn't know. We don't annotate those with a specific consequence, just mark the loss or gain.

@Relequestual
Copy link
Member

After updating the position values, two variants have alter consequences.

P0001058
CTTTTC/C CTTTTC/C
annotated as SPLICING, is now FRAMESHIFT

P0001027
G/T G/T
annotated as INTRONIC, is now 3'UTR

I'll update the file accordingly.

@buske
Copy link
Member Author

buske commented Jun 1, 2015

Hey Ben,

Had a minute to take a closer look. Sorry for not having a chance sooner. A few comments:

  • For single-base mutations, the start and stop should not be the same. As per our spec, start is 0-indexed inclusive, stop is exclusive (as in BED). So, the number of bases changed is equal to stop - start. Make sure that, if your system is used to supporting VCF-style bases, you increment the start by 1 when you perform uploading.
  • For P0001058, I believe the annotation was correct originally. Looking at the region in the genome browser shows it falling in an intron, at least for me.
  • For P0001027, it is both intronic and in the 3'UTR, so I'm fine with changing it, but I don't think it was, strictly speaking, incorrect.

@Relequestual
Copy link
Member

Hey Orion. I completly missed that in our spec. I'll revert the commit and make a note to update the Decipher code to take that into account!

Regarding the consequence changes, I wouldn't know how to look at this how you might. I put the data into Ensembl VEP, and that is the data that comes out.

@Relequestual
Copy link
Member

Hey Orion. I think I've figured out how we have arrived at different consequences.

We don't specify the transcript used, so I took the most severe as defined by Ensembl VEP.
For example

P0001058
CTTTTC/C CTTTTC/C

You took the canonical transcript http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000167468;r=19:1103936-1106778;t=ENST00000354171

I took the transcript with the most severe consequence
http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000167468;r=19:1104067-1106442;t=ENST00000593032

From my understanding, framshift is considered "more interesting" than a splice variant.

What do you think is best to do here?

@Relequestual
Copy link
Member

@buske bump! =]

@buske
Copy link
Member Author

buske commented Aug 4, 2015

Hmm, I agree that if we're specifying the effect, we should really specify the transcript as well. The splicing effect was taken from the paper. In all variant effect tools I'm aware of, frameshift variants do indeed have higher priority than splicing ones.

@buske
Copy link
Member Author

buske commented Aug 18, 2015

Moved discussion of transcript-specific effects to #120

@Relequestual
Copy link
Member

This issue / hotfix is becoming a bit stale. Maybe the ensembl transcript can be added as a "_enstrans" on the genomic feature, just for the sample / test data for now?

Issue #109: added zygosity

Issue #105: removed non-spec fields from test data
@buske
Copy link
Member Author

buske commented Nov 26, 2016

I've rebased this onto master and collapsed all the various typo/fix/revert commits. Please review. This will finally merge the test data we've been using anyway into master.

@Relequestual
Copy link
Member

I've eyeballed the changes and they look OK to me.
I can't re-run it through my previous import script without quite a bit of effort, as the script is no longer valid against our schemas.
I'm happy for this to be merged.

On an aside, I know we haven't kept strictly to gitflow, and that's OK, as long as the readme states that master may not be the same as the latest release. If we DO want master to always be the latest release, then we need to adheer to gitflow, and work on a dev branch till a release is ready.

@buske buske changed the base branch from master to release/v1.0 November 30, 2016 06:43
@buske
Copy link
Member Author

buske commented Nov 30, 2016

@Relequestual I've changed the base branch to be release/v1.0 to follow a more git flow-like approach. We can keep making bugfixes and non-substantive changes (like fixing the test data) for the v1.0 version of the API on this branch, as per git flow. Sound good?

@buske buske removed the Testing label Nov 30, 2016
@Relequestual
Copy link
Member

No... because 1.0 is released and tagged, that can't be changed.

What do you mean by the "base" branch?

@buske
Copy link
Member Author

buske commented Dec 1, 2016

Hmm, even after we tag an API release (e.g. 1.0), we still might make bugfixes and documentation clarifications for that version of the API, since people will keep using it. In git flow, I believe this is represented by creating a release branch, which you maintain even after you tag the initial release for that version. In normal semver, those bugfix releases would get tagged as a patch (e.g. 1.0.1). In our case, the API itself isn't changing in those cases... it makes sense to stay as 1.0, but the documentation and supporting code might need to be updated over time for the v1.0 version of the API. We can tag new versions, or just maintain a release/v1.0 branch. This reasonable?

By base branch I meant the branch the PR was pulling into (it was master before).

@Relequestual
Copy link
Member

Relequestual commented Dec 1, 2016

That's not quite how gitflow works, but pretty close.

This time round, the release branch isn't quite right either, but it doesn't matter a great deal, and I was going to address it after we finalised the release.

Gitflow works in the following way:

  • Development work happens on develop.
  • You can use feature branches for more complex changes, but it's unlikely we will need that in this project.
  • When you've scoped a release, and all the functional / feature changes are considered complete, you create a release branch from develop.
  • The release branch must now only be comitted to with bug fixes, where bugs are found during the pre-release process. You can think of the release branch as a release candidate if you will.
  • Once everyone (or at least someone in our case) feels that the bugs and issues are dealt with, the release branch is merged into master and develop (with merege commits) and the new commit on master is tagged.
  • If new bugs are found after the release branch was merged into master, and the new tag created, then a hotfix branch is created. Finishing the hotfix triggers the same process as finishing a release branch.

I'm sure you've seen the image, but for reference: http://nvie.com/posts/a-successful-git-branching-model/

It CAN be somewhat unclear from the "documentation" of gitflow, however I frequently use SourceTree for performing gitflow actions. For example finishing a release will do the two merges and tagging the commit on master, all in one step.

Please do say if any of this leads to further questions. Having a play about with SourceTree can help getting to grips with gitflow. The gitflow commandline tools leave some elements to be desired, last time we looked.

@fschiettecatte
Copy link
Contributor

FWIW I have adopted gitflow for all my s/w development, 5 code bases, works great.

@buske
Copy link
Member Author

buske commented Dec 6, 2016

@Relequestual You're completely right, sorry, I was mistaken. It would help me to clarify the method we will take to update documentation and non-substantive changes for fixed/released APIs. I agree with git flow in principle, but find strictness can get in the way of making changes that need to be made. We released v1.0 of the API over a year ago, and are still clarifying things in the docs. When we release v1.1 of the API, I'm sure we'll be maintaining and develop the documentation over the following year. We can use hotfixes and patch releases to handle it, if you'd like. It feels like overkill to make a patch release whenever we fix a few typos, but I'm okay with it. I was proposing maintaining release branches as an alternative.

@Relequestual
Copy link
Member

No worries @buske.

it can be difficult to see the benefits without experiencing the problems that can happen first hand.

Consider: Someone makes a grematical change... and may as well fix this abiguity over a fied... "it's not a functional change" they believe, so it's OK. Later, someone else joins the project, and implements the API at the latest version. Suddenly, the API is not being interoperable because the first person made a functional change based on their understanding, and not that of the group.

It may sound unlikely, but I've just seen this exact thing happen in another project I'm working on. With open source projects, occasionally a few people have more time, and suddenly do a lot of work and rush ahead of everyone else. Sticking strickly to gitflow avoids potential problems by forcing us to protect against them.

@buske
Copy link
Member Author

buske commented Dec 7, 2016

Fair points. Okay, so let's figure out how we proceed.

Where we are:

  • 22 commits to master since v1.0a, mostly KEYS and disclaimer changes
  • some clarifications sitting in master (including "_"-prefixing which we might want to retract)
  • these v1.0 test data to merge in somewhere
  • changes on develop that need to be put off until a 2.0 release
  • current work towards the backwards-compatible v1.1 API changes on release/v1.1

Where I think we should be:

  • we use master and develop in the way gitflow intended, where develop is only for backwards-compatible changes for the current major version, and master tracks the last stable release
  • all changes to KEYS and disclaimers get made to develop, rather than master
  • we add a develop-v2 branch for parallel development on the next major version
  • when we release v2, we add a stable-v1 branch for continued support of v1.X
  • we use semver tags for each release (v1.1.0) and patch (v1.1.1)
  • if we need to make a clarifying change for the current version (master), we use a hotfix branch a la gitflow

To clean up the current state of things, we'd then:

  • pull this PRs changes and the changes we've made on master (sans the "_"-prefixing) into one combined hotfix-v1.0b branch (I'll redo a PR)
  • reset master to v1.0a
  • pull the backwards-incompatible changes from develop into a develop-v2 branch
  • remove develop and rename release/v1.1 to develop

This okay with everyone?

@Relequestual
Copy link
Member

Relequestual commented Dec 7, 2016

I mostly agree, however suggest a few small alterations:
... and I need to come back and write them later today because of deadlines. =[

develop is where the work happens for the upcoming major release. Once the upcoming major release becomes the current release, create a develop-v[major-release-number] branch. develop then continues to be the branch the work for the upcoming major release, and backwards compatible changes happen on develop-v[previous-major-release].

Therefore, we currently need to create a develop-v1 branch for continued support of backwards compatible v1 changes. This way we don't need any stable-v[num] branches.

I feel that makes things a little easier to manage, unless I've missed some of you intensions?
Thoughts?

@buske
Copy link
Member Author

buske commented Dec 14, 2016

👍 That works for me! You're right, since we really shouldn't need branches for major releases beyond the upcoming one. :)

@buske buske closed this Jan 3, 2017
@buske buske deleted the hotfix/v1.0b branch October 20, 2017 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants