DC-1238: Add dummy genomic data to test OMOP datasets #1824

pshapiro4broad · 2024-09-30T15:17:12Z

Jira ticket: https://broadworkbench.atlassian.net/browse/DC-1238

Addresses

Update test OMOP data for integration / connected tests and for local dataset setup to include a sample table with dummy genomic data.

Summary of changes

add Sample table to schema and settings for test data
minor changes to setup_tdr_resources.py
rename a .jsonl file to .json

Testing Strategy

integration / connected tests
ran setup_tdr_resources.py locally

…k data

s-rubenstein

Looks reasonable

sonarcloud · 2024-10-02T17:56:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

fboulnois

See below:

tools/setupResourceScripts/setup_tdr_resources.py

fboulnois · 2024-10-03T11:57:14Z

tools/setupResourceScripts/files/OMOPDataset/sample.json

@@ -0,0 +1,2 @@
+{"collaborator_participant_id": "77", "collaborator_sample_id": "SM-77", "contamination_rate": 0.001, "exome_gvcf_index_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.tbi", "exome_gvcf_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.md5", "exome_gvcf_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz", "genome_crai_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.crai", "genome_cram_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.md5", "genome_cram_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram", "mapped_percentage": 99.86, "mean_off_target_coverage": 3.46, "mean_target_coverage": 53.8, "percent_target_bases_at_10x": 98.36, "percent_wgs_bases_at_1x": 94.77, "reblocked_gvcf": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz", "reblocked_gvcf_index": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz.tbi", "total_bases": 23772304683}


this looks like a jsonl?

Yes, all the data tables are jsonl. I found a number of changes like this that would be an improvement, but decided to limit the work done in this PR, I'll make another tech debt ticket to further clean up this code

The file extension is json, should it be jsonl?

rushtong

Looks reasonable to me. My nagging concern here would be the tight coupling between the test classes (integration, service, and DAO tests) to the single dataset file which is how these tests were originally written.

pshapiro4broad · 2024-10-04T13:34:12Z

@rushtong

My nagging concern here would be the tight coupling between the test classes (integration, service, and DAO tests) to the single dataset file which is how these tests were originally written.

Is your concern about having multiple tests depend on the same files, or having tests depend on files outside the test itself, or something else?

For tests that require a lot of assets to set up, I think it makes sense to use external data files like this, rather than embedding the json in the test code, or adding objects that define the models for the json data. But I agree that this separation means that you have to be extra careful when updating the data files.

rjohanek

this looks good! are the datasets in dev/prod going to be updated as a part of this work? I'm excited to see the new table in action

rushtong · 2024-10-04T14:34:39Z

@pshapiro4broad

Is your concern about having multiple tests depend on the same files, or having tests depend on files outside the test itself, or something else?

For tests that require a lot of assets to set up, I think it makes sense to use external data files like this, rather than embedding the json in the test code, or adding objects that define the models for the json data. But I agree that this separation means that you have to be extra careful when updating the data files.

My concern is that each of the different test classes are now dependent on the same underlying data such that any change to the underlying data affects multiple tests in potentially different ways - essentially the last point that you make.

pshapiro4broad added 7 commits September 30, 2024 10:01

rename it-dataset-omop.jsonl since file is json, not jsonl

d66c039

add sample table to omop schema and settings for test datasets

9ba50c4

add sample table to omop schema and settings for test datasets

406c082

add sample to local setup resources; change fileref to string for moc…

4c96d1f

…k data

add mock sample data

52d3699

many misc fixes

e8c190b

Merge branch 'develop' into ps/dc-1238-sample-genomic-data

906ffe0

pshapiro4broad marked this pull request as ready for review October 2, 2024 13:44

pshapiro4broad requested review from a team as code owners October 2, 2024 13:44

pshapiro4broad requested review from rushtong and okotsopoulos and removed request for a team October 2, 2024 13:44

s-rubenstein approved these changes Oct 2, 2024

View reviewed changes

add sample to azure dataset

808914c

fboulnois reviewed Oct 3, 2024

View reviewed changes

rushtong approved these changes Oct 3, 2024

View reviewed changes

rjohanek approved these changes Oct 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DC-1238: Add dummy genomic data to test OMOP datasets #1824

DC-1238: Add dummy genomic data to test OMOP datasets #1824

pshapiro4broad commented Sep 30, 2024 •

edited

Loading

s-rubenstein left a comment

sonarcloud bot commented Oct 2, 2024

fboulnois left a comment

fboulnois Oct 3, 2024

pshapiro4broad Oct 4, 2024

fboulnois Oct 4, 2024

rushtong left a comment

pshapiro4broad commented Oct 4, 2024

rjohanek left a comment

rushtong commented Oct 4, 2024

		@@ -0,0 +1,2 @@
		{"collaborator_participant_id": "77", "collaborator_sample_id": "SM-77", "contamination_rate": 0.001, "exome_gvcf_index_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.tbi", "exome_gvcf_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz.md5", "exome_gvcf_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.wes.hard-filtered.gvcf.gz", "genome_crai_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.crai", "genome_cram_md5_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram.md5", "genome_cram_path": "gs://fc-00000000-0000-0000-0000-000000000000/sample.cram", "mapped_percentage": 99.86, "mean_off_target_coverage": 3.46, "mean_target_coverage": 53.8, "percent_target_bases_at_10x": 98.36, "percent_wgs_bases_at_1x": 94.77, "reblocked_gvcf": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz", "reblocked_gvcf_index": "gs://fc-00000000-0000-0000-0000-000000000000/sample.rb.g.vcf.gz.tbi", "total_bases": 23772304683}

DC-1238: Add dummy genomic data to test OMOP datasets #1824

Are you sure you want to change the base?

DC-1238: Add dummy genomic data to test OMOP datasets #1824

Conversation

pshapiro4broad commented Sep 30, 2024 • edited Loading

Addresses

Summary of changes

Testing Strategy

s-rubenstein left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Oct 2, 2024

Quality Gate passed

fboulnois left a comment

Choose a reason for hiding this comment

fboulnois Oct 3, 2024

Choose a reason for hiding this comment

pshapiro4broad Oct 4, 2024

Choose a reason for hiding this comment

fboulnois Oct 4, 2024

Choose a reason for hiding this comment

rushtong left a comment

Choose a reason for hiding this comment

pshapiro4broad commented Oct 4, 2024

rjohanek left a comment

Choose a reason for hiding this comment

rushtong commented Oct 4, 2024

pshapiro4broad commented Sep 30, 2024 •

edited

Loading