-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPL-048 fix root sample ids #528
Closed
Closed
Changes from 2 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
034eeee
Actual data fix method
Jonnie-Bevan f62c503
get the malformed IDs from MySQL
Jonnie-Bevan 86a9de0
fix the data and save it to CSV
Jonnie-Bevan fb602dd
write the fixed data to DB
Jonnie-Bevan 77de455
test locally (helper functions)
Jonnie-Bevan 64bc2e2
move to a new subdirectory
Jonnie-Bevan 2560aa9
Update .gitignore
Jonnie-Bevan e0c00e8
Generalise the code
Jonnie-Bevan eb5bffa
add MYSQL_PORT to constants
Jonnie-Bevan 935f3b6
move files up one directory
Jonnie-Bevan 36c1424
Update data_writers.py
Jonnie-Bevan 1940a50
update name of data save file and combine in data getter
Jonnie-Bevan 917cb8a
update name of data_writers
Jonnie-Bevan d527cd9
user guide
Jonnie-Bevan 52b920f
Update README.md
Jonnie-Bevan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,32 @@ | ||
# get the root_sample_ids, fix them, write these to a CSV to be used with the 'write_data' script (which inserts the fixed IDs into the DBs) | ||
# get the data, fix it, write these to a CSV to be used with the 'write_data' script (which inserts the fixed data into the DBs) | ||
import pandas as pd | ||
import argparse | ||
|
||
from data_getters import get_data | ||
from data_helpers import remove_everything_after_first_underscore | ||
|
||
def save_data(): | ||
data = get_data() | ||
print("Editing the root_sample_ids...") | ||
data = data.rename(columns={"root_sample_id": "original_root_sample_id"}) | ||
data["root_sample_id"] = data["original_root_sample_id"].apply(remove_everything_after_first_underscore) | ||
print("Adding the root_sample_ids to a CSV file.") | ||
data.to_csv('data-fixes/test-data/root_sample_ids.csv', index=False) | ||
from constants import ( | ||
COLUMN_NAME, | ||
ORIGINAL_COLUMN_NAME | ||
) | ||
|
||
def save_data(input_filename, output_filename): | ||
if input_filename: | ||
data = pd.read_csv(input_filename) | ||
else: | ||
data = get_data() | ||
|
||
print("Editing the data...") | ||
data = data.rename(columns={COLUMN_NAME: ORIGINAL_COLUMN_NAME}) | ||
data[COLUMN_NAME] = data[ORIGINAL_COLUMN_NAME].apply(remove_everything_after_first_underscore) | ||
print("Adding the data to a CSV file.") | ||
data.to_csv(output_filename, index=False) | ||
|
||
if __name__ == "__main__": | ||
save_data() | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--input_file", required=False) | ||
parser.add_argument("--output_file", required=True) | ||
args = parser.parse_args() | ||
input_filename = vars(args)["input_file"] | ||
output_filename = vars(args)["output_file"] | ||
save_data(input_filename=input_filename, output_filename=output_filename) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,29 @@ | ||
# get the root_sample_ids from MLWH - any root_sample_ids containing an underscore | ||
# get the data from MLWH - use the SQL_MLWH_GET_MALFORMED_DATA constant in constants.py to give the SQL call for the data | ||
import sqlalchemy | ||
import pandas as pd | ||
|
||
from constants import MYSQL_DB_CONN_STRING, MLWH_DB, SQL_MLWH_GET_MALFORMED_ROOT_IDS | ||
from constants import ( | ||
MYSQL_DB_CONN_STRING, | ||
MLWH_DB, | ||
SQL_MLWH_GET_MALFORMED_DATA | ||
) | ||
|
||
def get_data() -> pd.DataFrame: | ||
print("Attempting to connect to DB.") | ||
print("Attempting to connect to MLWH.") | ||
try: | ||
sql_engine = sqlalchemy.create_engine( | ||
f"mysql+pymysql://{MYSQL_DB_CONN_STRING}/{MLWH_DB}", pool_recycle=3600 | ||
) | ||
db_connection = sql_engine.connect() | ||
print("Connected to the DB... getting data.") | ||
data = pd.read_sql(SQL_MLWH_GET_MALFORMED_ROOT_IDS, db_connection) | ||
print("Connected to MLWH... getting data.") | ||
data = pd.read_sql(SQL_MLWH_GET_MALFORMED_DATA, db_connection) | ||
print("Got the data.") | ||
except Exception as e: | ||
print("Error while connecting to MySQL") | ||
print("Error while connecting to MLWH.") | ||
print(e) | ||
return None | ||
finally: | ||
if db_connection is not None: | ||
print("Closing DB connection.") | ||
print("Closing MLWH connection.") | ||
db_connection.close() | ||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required=False
? I think you later rely on there being a value. Might be wrong.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't give an
input_file
it will return 'None' when you ask for it on line 30. Then that gets passed into the method, which has a check for whether it exists (ie. whether it's not None or it is None), and if it wasn't None it will read the file but if not it goes to the DB to get the data. This line was only for testing really, so that I could check that it would fix the data correctly and save to CSV when given some dummy data. In reality it should go to the DB to get the data because that's the data we're trying to fixThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I see. That makes sense! Good stuff.