-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add python CLI #14
Add python CLI #14
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marissafujimoto This is looking good!
I don't have major comments just things here and there -- many of which can be done in new PRs so this doesn't get too much bigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have a README that explains about this example data. What it is and how it was subsetted. But this can also be in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I will address that in a future pr.
# TODO check can write to this path? | ||
parser.add_argument("-o", "--output", required=False, | ||
help="Output file path to populate with the counts for each paired guide and sample. If not provided the counts will be output in STDOUT.") | ||
# TODO support arbitrary trim strategies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably be good to add each of these TODOs as GitHub issues so we don't lose track of them. Also could be on Basecamp. Wherever you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do! I have a lot of TODOs to go back and see if they are worth prioritizing. Often these are where I didn't want to pause at that moment to make an issue, or small enough issues that I wanted to acknowledge a change could be made by future authors without necessarily committing to it the way an issue implies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally makes sense! It's good move!
# TODO support arbitrary trim strategies | ||
parser.add_argument("--trim-strategy", required=True, choices=(TWO_READ_STRATEGY, THREE_READ_STRATEGY), | ||
help="The trim strategy used to extract guides and barcodes. The two read strategy should have fastqs R1 and I1. The three read strategy should have fastqs R1, I1, and I2") # TODO extract consts | ||
parser.add_argument("--gRNA2-error", required=False, default=2, type=_check_nonnegative, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these last ones need a "help" description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are lines 50 and 52 not it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so sorry! I missed that somehow!
def write_counts(output_path: Optional[str], counts: Counter[tuple[str, str, str]], barcodes: dict[str, str], id_mappings: dict[str, str]): | ||
""" | ||
Write counts to an output file tsv with a header [id, seq_1, seq2, sample_id_1, ..., sample_id_n]. If no | ||
output_path is specified, write to STDOUT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could alternatively just have a default file name its saved to in the working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then again people could >
. Hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt pipe support was necessary based on the expectations set by other bioinformatics tools, though this function signature could be more agnostic of that fact and just take a file opened by the caller? Then again this is just a library function exposed for convenience, and I feel path is pretty convenient and conventional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes path I think is definitely a good call. I was just thinking about the default for it.
counts (Counter[tuple[str, str, str]]): The counts for each (gRNA1, gRNA2, barcode). | ||
id_mappings (dict[str, str]): A mapping from (gRNA1, gRNA2) to the paired guide id. | ||
""" | ||
# TODO delete file if it already exists? manually test to see what happens if you don't do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to opt for an "overwrite" parameter. Basically by default overwrite = FALSE. But if it is TRUE then we can delete the existing file of the same name and write a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the convention for python libraries is to overwrite e.g. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html and https://numpy.org/doc/2.1/reference/generated/numpy.save.html. The expectation is the caller can check before calling this if they desire that behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of bioinformatics packages are used by people who don't have a lot of experience with command line tools and because we are working with huge data sometimes we want to decrease the probability that people will shoot themselves in the foot (aka accidentally overwrite a file they didn't mean to).
So I think having a default that spits out a warning that people can shut off if they agree feels better than having it automatically overwrite before people realize what they are doing.
Happy to discuss more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that is concerning. I think I have two users in mind. One is using pgMAP as a python library and I imagine they have the skills and expectations for this function signature. The second is the CLI user and I am concerned about them overwriting their files.
I wonder then if we should address that as part of the CLI (i.e. checking the output does not exist and erroring, and adding a flag to overwrite it if desired). Or do you still feel strongly that the library user should also have this warning?
Another concern is that taking a step back, we never really planned to support pgMAP as a python library, it more just was convenient enough to do it and it's been useful for some deep dive testing, but I wonder if it's something we say "use at your own risk". I will try to get answers about this prioritization of the library from the Berger lab folks.
"--barcode-error", "-1"]) | ||
|
||
def test_arg_parse_invalid_type_gRNA2_error(self): | ||
with self.assertRaises(argparse.ArgumentError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have more of a consensus on what the results should be (and I think we are close on that) I also like to include tests that make sure that the results are still the same and that the data is formatted because we want to make sure that not only are we not getting an error, but that the data being returned is still what we expect.
This can be added in a different PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will do in a later PR.
|
||
for i, row in enumerate(tsv_reader): | ||
if i == 0: | ||
sample_ids = row[3:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always going to be index of 3 or might that change depending on the input data? I've lost track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is sort of messy, basically since the header of the output is defined as "[id, seq_1, seq2, sample_id_1, ..., sample_id_n]", slicing from index 3 just gives the sample_ids. Hoping this is okay for a test helper function. I'll try to add a comment here for clarity (later PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah since its just for the test I think that's fine. Hard coded indices just always raise alarm bells for me but you're right that this is low stakes.
Description
Addresses #12
Type of change
How Has This Been Tested?
python3 -m pip install . && python3 -m tests
python3 src/pgmap/cli.py --fastq example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_R1_001_Sampled10k.fastq.gz example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_I1_001_Sampled10k.fastq.gz --library example-data/pgPEN-library/paralog_pgRNA_annotations.txt --barcodes example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/barcode_ref_file_revcomp.txt --trim-strategy two-read
Checklist: