Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggested API #5

Open
timodonnell opened this issue Jun 9, 2021 · 4 comments
Open

suggested API #5

timodonnell opened this issue Jun 9, 2021 · 4 comments

Comments

@timodonnell
Copy link
Contributor

@endrebak asked for suggestions on a possible API to make this a more general purpose read-my-bam tool.

My thought is a read_bam function that returns a pandas.DataFrame (similar to what is already here) but that

  1. Reads all alignments and all fields by default (including unmapped reads)
  2. Supports subselecting the fields (columns) being read for efficiency using a parameter, say, fields. For example fields=["Chromosome", "Start", "End", "Strand"] would only read in the specified columns and return a DataFrame with only those columns. Similar to usecols in pandas.read_csv.
  3. Supports subselecting the alignments (rows) being read to specified regions (and uses the BAM index for doing this). E.g. regions=[("chr1", 100, 10000)] would subselect to chr1:100-10000.
  4. Supports subselecting the alignments (rows) being read according to the BAM record flags. I think adding particular parameters for each of these would be the most user friendly. E.g. only_mapped=True would be the equivalent of passing -F 4 to samtools. I think really helpful to use named parameters here rather than making the user do bit arithmetic with binary flag codes. Basically implement this as named arguments.
  5. Has a max_alignments argument so the user can read just the first 10 records by passing max_alignments=10

I think one function that implements this would handle the majority of my use cases for reading BAMs in Python, and provide a much simpler API to get started with and use than pysam

@endrebak
Copy link
Collaborator

endrebak commented Jun 9, 2021

I agree with all, but I'll leave 2) as an exercise for someone else since it is a lot of work for what is likely to be a surprisingly small gain in speed (try it yourself with read_full vs read_sparse to see).

I already support using flags. See here: https://github.com/pyranges/bamread/blob/master/bamread/src/bamread.pyx#L17

@endrebak
Copy link
Collaborator

endrebak commented Jun 9, 2021

I should also support writing bams, but I use bams ever so seldom that I've never gotten around to it (and I am not completely sure how to do it either).

@endrebak
Copy link
Collaborator

endrebak commented Jun 9, 2021

The current functions contain some extra stuff to make sure the data is valid for pyranges like

if start < 0 or end < 0: continue

I guess I should create new similar ones to use for the standalone library so that it acts completely like pysam.

@endrebak
Copy link
Collaborator

endrebak commented Jun 9, 2021

(Just writing notes to self here)

I guess we can alias read_bam to read_sam in __init__.py since I do not think pysam cares about the distinction.

read_sam = read_bam

Worst case we need to change this to not be binary I think: samfile = pysam.AlignmentFile(filename, "rb")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants