Add python CLI #14

marissafujimoto · 2024-12-18T23:40:04Z

Description

Addresses #12

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Unit tests. Run with python3 -m pip install . && python3 -m tests
Manual test. Run with python3 src/pgmap/cli.py --fastq example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_R1_001_Sampled10k.fastq.gz example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/Undetermined_S0_I1_001_Sampled10k.fastq.gz --library example-data/pgPEN-library/paralog_pgRNA_annotations.txt --barcodes example-data/two-read-strategy/240123_VH01189_224_AAFGFNYM5/barcode_ref_file_revcomp.txt --trim-strategy two-read

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

cansavvy

@marissafujimoto This is looking good!

I don't have major comments just things here and there -- many of which can be done in new PRs so this doesn't get too much bigger.

cansavvy · 2025-01-15T18:17:42Z

example-data/pgPEN-library/paralog_pgRNA_annotations.txt

Would be good to have a README that explains about this example data. What it is and how it was subsetted. But this can also be in a future PR.

Agree. I will address that in a future pr.

cansavvy · 2025-01-15T18:19:26Z

src/pgmap/cli.py

+    # TODO check can write to this path?
+    parser.add_argument("-o", "--output", required=False,
+                        help="Output file path to populate with the counts for each paired guide and sample. If not provided the counts will be output in STDOUT.")
+    # TODO support arbitrary trim strategies


Would probably be good to add each of these TODOs as GitHub issues so we don't lose track of them. Also could be on Basecamp. Wherever you prefer.

Will do! I have a lot of TODOs to go back and see if they are worth prioritizing. Often these are where I didn't want to pause at that moment to make an issue, or small enough issues that I wanted to acknowledge a change could be made by future authors without necessarily committing to it the way an issue implies.

Totally makes sense! It's good move!

cansavvy · 2025-01-15T18:19:50Z

src/pgmap/cli.py

+    # TODO support arbitrary trim strategies
+    parser.add_argument("--trim-strategy", required=True, choices=(TWO_READ_STRATEGY, THREE_READ_STRATEGY),
+                        help="The trim strategy used to extract guides and barcodes. The two read strategy should have fastqs R1 and I1. The three read strategy should have fastqs R1, I1, and I2")  # TODO extract consts
+    parser.add_argument("--gRNA2-error", required=False, default=2, type=_check_nonnegative,


I think these last ones need a "help" description.

Are lines 50 and 52 not it?

Oh so sorry! I missed that somehow!

cansavvy · 2025-01-15T18:21:13Z

src/pgmap/io/counts_writer.py

+def write_counts(output_path: Optional[str], counts: Counter[tuple[str, str, str]], barcodes: dict[str, str], id_mappings: dict[str, str]):
+    """
+    Write counts to an output file tsv with a header [id, seq_1, seq2, sample_id_1, ..., sample_id_n]. If no
+    output_path is specified, write to STDOUT.


Could alternatively just have a default file name its saved to in the working directory.

But then again people could >. Hmm...

I felt pipe support was necessary based on the expectations set by other bioinformatics tools, though this function signature could be more agnostic of that fact and just take a file opened by the caller? Then again this is just a library function exposed for convenience, and I feel path is pretty convenient and conventional.

Yes path I think is definitely a good call. I was just thinking about the default for it.

cansavvy · 2025-01-15T18:22:39Z

src/pgmap/io/counts_writer.py

+        counts (Counter[tuple[str, str, str]]): The counts for each (gRNA1, gRNA2, barcode).
+        id_mappings (dict[str, str]): A mapping from (gRNA1, gRNA2) to the paired guide id.
+    """
+    # TODO delete file if it already exists? manually test to see what happens if you don't do this


I like to opt for an "overwrite" parameter. Basically by default overwrite = FALSE. But if it is TRUE then we can delete the existing file of the same name and write a new one.

I believe the convention for python libraries is to overwrite e.g. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html and https://numpy.org/doc/2.1/reference/generated/numpy.save.html. The expectation is the caller can check before calling this if they desire that behavior?

A lot of bioinformatics packages are used by people who don't have a lot of experience with command line tools and because we are working with huge data sometimes we want to decrease the probability that people will shoot themselves in the foot (aka accidentally overwrite a file they didn't mean to).

So I think having a default that spits out a warning that people can shut off if they agree feels better than having it automatically overwrite before people realize what they are doing.

Happy to discuss more!

Hmm that is concerning. I think I have two users in mind. One is using pgMAP as a python library and I imagine they have the skills and expectations for this function signature. The second is the CLI user and I am concerned about them overwriting their files.

I wonder then if we should address that as part of the CLI (i.e. checking the output does not exist and erroring, and adding a flag to overwrite it if desired). Or do you still feel strongly that the library user should also have this warning?

Another concern is that taking a step back, we never really planned to support pgMAP as a python library, it more just was convenient enough to do it and it's been useful for some deep dive testing, but I wonder if it's something we say "use at your own risk". I will try to get answers about this prioritization of the library from the Berger lab folks.

cansavvy · 2025-01-15T18:25:58Z

tests/__main__.py

+                                    "--barcode-error", "-1"])
+
+    def test_arg_parse_invalid_type_gRNA2_error(self):
+        with self.assertRaises(argparse.ArgumentError):


Once we have more of a consensus on what the results should be (and I think we are close on that) I also like to include tests that make sure that the results are still the same and that the data is formatted because we want to make sure that not only are we not getting an error, but that the data being returned is still what we expect.

This can be added in a different PR though.

Makes sense. Will do in a later PR.

cansavvy · 2025-01-15T18:28:12Z

tests/__main__.py

+
+            for i, row in enumerate(tsv_reader):
+                if i == 0:
+                    sample_ids = row[3:]


Is this always going to be index of 3 or might that change depending on the input data? I've lost track.

Sorry this is sort of messy, basically since the header of the output is defined as "[id, seq_1, seq2, sample_id_1, ..., sample_id_n]", slicing from index 3 just gives the sample_ids. Hoping this is okay for a test helper function. I'll try to add a comment here for clarity (later PR)

Yeah since its just for the test I think that's fine. Hard coded indices just always raise alarm bells for me but you're right that this is low stakes.

marissafujimoto added 10 commits December 18, 2024 10:19

Move counter to separate module

ee3c1e0

Add argparse proof of concept

1110aee

Add argument parsing

7745c57

Add loading library from annotation file

01280b2

Add libary annotation reader docs

94f98b0

Get counts in cli

32b7b11

Load id mappings from library annotation file

3e6f85b

Add file output writing

db23827

Make trim strategy required

c023589

Use hyphens instead of underscores for cli options

3ac44cd

marissafujimoto marked this pull request as ready for review December 18, 2024 23:40

marissafujimoto requested a review from cansavvy December 18, 2024 23:41

marissafujimoto mentioned this pull request Jan 7, 2025

Single threaded optimization #15

Merged

8 tasks

cansavvy approved these changes Jan 15, 2025

View reviewed changes

cansavvy merged commit 3ac44cd into main Jan 15, 2025
2 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add python CLI #14

Add python CLI #14

marissafujimoto commented Dec 18, 2024

cansavvy left a comment

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

cansavvy Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

marissafujimoto Jan 15, 2025

cansavvy Jan 15, 2025

Add python CLI #14

Add python CLI #14

Conversation

marissafujimoto commented Dec 18, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

cansavvy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment