-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPL-048 fix root sample ids #528
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #528 +/- ##
===========================================
+ Coverage 92.33% 92.55% +0.21%
===========================================
Files 98 106 +8
Lines 3248 3397 +149
Branches 330 343 +13
===========================================
+ Hits 2999 3144 +145
- Misses 203 206 +3
- Partials 46 47 +1
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Looks good to me. Obviously this isn't actually going to be merged and if it works correctly that's good enough. |
update_query = { "Root Sample ID": original_root_sample_id } | ||
new_value = { "$set": { "Root Sample ID": root_sample_id } } | ||
|
||
table.update_one(update_query, new_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be updateMany because we want updating all entries with the same root_sample_id.
@@ -0,0 +1,15 @@ | |||
# get the root_sample_ids, fix them, write these to a CSV to be used with the 'write_data' script (which inserts the fixed IDs into the DBs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would group all files related with the same data fix together, so we can add other data fixes in future if needed but keeping things separate, what do you think on creating a subfolder dpl-048 and move all files inside there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for future reference, it will be very useful to have a README.md file in that subfolder with the description of this data fix and how to run it.
Hopefully so that these scripts can be used in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a brief question. I didn't check all the logic makes sense but it looks like you know what you're trying to do. I don't know the background of what needs to happen.
|
||
if __name__ == "__main__": | ||
save_data() | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--input_file", required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required=False
? I think you later rely on there being a value. Might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't give an input_file
it will return 'None' when you ask for it on line 30. Then that gets passed into the method, which has a check for whether it exists (ie. whether it's not None or it is None), and if it wasn't None it will read the file but if not it goes to the DB to get the data. This line was only for testing really, so that I could check that it would fix the data correctly and save to CSV when given some dummy data. In reality it should go to the DB to get the data because that's the data we're trying to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I see. That makes sense! Good stuff.
Closed as not needed |
Fixes sanger/crawler#502 for MLWH lighthouse_sample table and MongoDB.
Includes test helper files.
Not included: test data (has IDs) in a folder called 'test-data', connection variables in a file called 'constants.py'.
The workflow is
Get data from MLWH -> fix it -> save the original IDs and corresponding fixed versions in a CSV file -> loop through the CSV to insert the correct data into DB of choice.