Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python CLI #14

Merged
merged 10 commits into from
Jan 15, 2025
Merged

Add python CLI #14

merged 10 commits into from
Jan 15, 2025

Conversation

marissafujimoto
Copy link
Collaborator

Description

Addresses #12

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Unit tests. Run with python3 -m pip install . && python3 -m tests
  • Manual test. Run with python3 src/pgmap/cli.py --fastq example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_R1_001_Sampled10k.fastq.gz example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_I1_001_Sampled10k.fastq.gz --library example-data/pgPEN-library/paralog_pgRNA_annotations.txt --barcodes example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/barcode_ref_file_revcomp.txt --trim-strategy two-read

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@marissafujimoto marissafujimoto marked this pull request as ready for review December 18, 2024 23:40
@marissafujimoto marissafujimoto mentioned this pull request Jan 7, 2025
8 tasks
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marissafujimoto This is looking good!

I don't have major comments just things here and there -- many of which can be done in new PRs so this doesn't get too much bigger.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a README that explains about this example data. What it is and how it was subsetted. But this can also be in a future PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I will address that in a future pr.

# TODO check can write to this path?
parser.add_argument("-o", "--output", required=False,
help="Output file path to populate with the counts for each paired guide and sample. If not provided the counts will be output in STDOUT.")
# TODO support arbitrary trim strategies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be good to add each of these TODOs as GitHub issues so we don't lose track of them. Also could be on Basecamp. Wherever you prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! I have a lot of TODOs to go back and see if they are worth prioritizing. Often these are where I didn't want to pause at that moment to make an issue, or small enough issues that I wanted to acknowledge a change could be made by future authors without necessarily committing to it the way an issue implies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense! It's good move!

# TODO support arbitrary trim strategies
parser.add_argument("--trim-strategy", required=True, choices=(TWO_READ_STRATEGY, THREE_READ_STRATEGY),
help="The trim strategy used to extract guides and barcodes. The two read strategy should have fastqs R1 and I1. The three read strategy should have fastqs R1, I1, and I2") # TODO extract consts
parser.add_argument("--gRNA2-error", required=False, default=2, type=_check_nonnegative,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these last ones need a "help" description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are lines 50 and 52 not it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so sorry! I missed that somehow!

def write_counts(output_path: Optional[str], counts: Counter[tuple[str, str, str]], barcodes: dict[str, str], id_mappings: dict[str, str]):
"""
Write counts to an output file tsv with a header [id, seq_1, seq2, sample_id_1, ..., sample_id_n]. If no
output_path is specified, write to STDOUT.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could alternatively just have a default file name its saved to in the working directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then again people could >. Hmm...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt pipe support was necessary based on the expectations set by other bioinformatics tools, though this function signature could be more agnostic of that fact and just take a file opened by the caller? Then again this is just a library function exposed for convenience, and I feel path is pretty convenient and conventional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes path I think is definitely a good call. I was just thinking about the default for it.

counts (Counter[tuple[str, str, str]]): The counts for each (gRNA1, gRNA2, barcode).
id_mappings (dict[str, str]): A mapping from (gRNA1, gRNA2) to the paired guide id.
"""
# TODO delete file if it already exists? manually test to see what happens if you don't do this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to opt for an "overwrite" parameter. Basically by default overwrite = FALSE. But if it is TRUE then we can delete the existing file of the same name and write a new one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the convention for python libraries is to overwrite e.g. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html and https://numpy.org/doc/2.1/reference/generated/numpy.save.html. The expectation is the caller can check before calling this if they desire that behavior?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of bioinformatics packages are used by people who don't have a lot of experience with command line tools and because we are working with huge data sometimes we want to decrease the probability that people will shoot themselves in the foot (aka accidentally overwrite a file they didn't mean to).

So I think having a default that spits out a warning that people can shut off if they agree feels better than having it automatically overwrite before people realize what they are doing.

Happy to discuss more!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is concerning. I think I have two users in mind. One is using pgMAP as a python library and I imagine they have the skills and expectations for this function signature. The second is the CLI user and I am concerned about them overwriting their files.

I wonder then if we should address that as part of the CLI (i.e. checking the output does not exist and erroring, and adding a flag to overwrite it if desired). Or do you still feel strongly that the library user should also have this warning?

Another concern is that taking a step back, we never really planned to support pgMAP as a python library, it more just was convenient enough to do it and it's been useful for some deep dive testing, but I wonder if it's something we say "use at your own risk". I will try to get answers about this prioritization of the library from the Berger lab folks.

"--barcode-error", "-1"])

def test_arg_parse_invalid_type_gRNA2_error(self):
with self.assertRaises(argparse.ArgumentError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have more of a consensus on what the results should be (and I think we are close on that) I also like to include tests that make sure that the results are still the same and that the data is formatted because we want to make sure that not only are we not getting an error, but that the data being returned is still what we expect.

This can be added in a different PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will do in a later PR.


for i, row in enumerate(tsv_reader):
if i == 0:
sample_ids = row[3:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always going to be index of 3 or might that change depending on the input data? I've lost track.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is sort of messy, basically since the header of the output is defined as "[id, seq_1, seq2, sample_id_1, ..., sample_id_n]", slicing from index 3 just gives the sample_ids. Hoping this is okay for a test helper function. I'll try to add a comment here for clarity (later PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah since its just for the test I think that's fine. Hard coded indices just always raise alarm bells for me but you're right that this is low stakes.

@cansavvy cansavvy merged commit 3ac44cd into main Jan 15, 2025
2 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants