Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DC-1238: Add dummy genomic data to test OMOP datasets #1824

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

pshapiro4broad
Copy link
Member

@pshapiro4broad pshapiro4broad commented Sep 30, 2024

Jira ticket: https://broadworkbench.atlassian.net/browse/DC-1238

Addresses

Update test OMOP data for integration / connected tests and for local dataset setup to include a sample table with dummy genomic data.

Summary of changes

  • add Sample table to schema and settings for test data
  • minor changes to setup_tdr_resources.py
  • rename a .jsonl file to .json

Testing Strategy

  • integration / connected tests
  • ran setup_tdr_resources.py locally

@pshapiro4broad pshapiro4broad marked this pull request as ready for review October 2, 2024 13:44
@pshapiro4broad pshapiro4broad requested review from a team as code owners October 2, 2024 13:44
@pshapiro4broad pshapiro4broad requested review from rushtong and okotsopoulos and removed request for a team October 2, 2024 13:44
Copy link
Contributor

@s-rubenstein s-rubenstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable

Copy link

sonarcloud bot commented Oct 2, 2024

Copy link
Contributor

@fboulnois fboulnois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below:

@@ -0,0 +1,2 @@
{"collaborator_participant_id": "77", "collaborator_sample_id": "SM-77", "contamination_rate": 0.001, "exome_gvcf_index_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.tbi", "exome_gvcf_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.md5", "exome_gvcf_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz", "genome_crai_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.crai", "genome_cram_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.md5", "genome_cram_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram", "mapped_percentage": 99.86, "mean_off_target_coverage": 3.46, "mean_target_coverage": 53.8, "percent_target_bases_at_10x": 98.36, "percent_wgs_bases_at_1x": 94.77, "reblocked_gvcf": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz", "reblocked_gvcf_index": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz.tbi", "total_bases": 23772304683}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a jsonl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all the data tables are jsonl. I found a number of changes like this that would be an improvement, but decided to limit the work done in this PR, I'll make another tech debt ticket to further clean up this code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file extension is json, should it be jsonl?

Copy link
Contributor

@rushtong rushtong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. My nagging concern here would be the tight coupling between the test classes (integration, service, and DAO tests) to the single dataset file which is how these tests were originally written.

@pshapiro4broad
Copy link
Member Author

@rushtong

My nagging concern here would be the tight coupling between the test classes (integration, service, and DAO tests) to the single dataset file which is how these tests were originally written.

Is your concern about having multiple tests depend on the same files, or having tests depend on files outside the test itself, or something else?

For tests that require a lot of assets to set up, I think it makes sense to use external data files like this, rather than embedding the json in the test code, or adding objects that define the models for the json data. But I agree that this separation means that you have to be extra careful when updating the data files.

Copy link
Contributor

@rjohanek rjohanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good! are the datasets in dev/prod going to be updated as a part of this work? I'm excited to see the new table in action

@rushtong
Copy link
Contributor

rushtong commented Oct 4, 2024

@pshapiro4broad

Is your concern about having multiple tests depend on the same files, or having tests depend on files outside the test itself, or something else?

For tests that require a lot of assets to set up, I think it makes sense to use external data files like this, rather than embedding the json in the test code, or adding objects that define the models for the json data. But I agree that this separation means that you have to be extra careful when updating the data files.

My concern is that each of the different test classes are now dependent on the same underlying data such that any change to the underlying data affects multiple tests in potentially different ways - essentially the last point that you make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants