suggested API #5

timodonnell · 2021-06-09T17:38:34Z

@endrebak asked for suggestions on a possible API to make this a more general purpose read-my-bam tool.

My thought is a read_bam function that returns a pandas.DataFrame (similar to what is already here) but that

Reads all alignments and all fields by default (including unmapped reads)
Supports subselecting the fields (columns) being read for efficiency using a parameter, say, fields. For example fields=["Chromosome", "Start", "End", "Strand"] would only read in the specified columns and return a DataFrame with only those columns. Similar to usecols in pandas.read_csv.
Supports subselecting the alignments (rows) being read to specified regions (and uses the BAM index for doing this). E.g. regions=[("chr1", 100, 10000)] would subselect to chr1:100-10000.
Supports subselecting the alignments (rows) being read according to the BAM record flags. I think adding particular parameters for each of these would be the most user friendly. E.g. only_mapped=True would be the equivalent of passing -F 4 to samtools. I think really helpful to use named parameters here rather than making the user do bit arithmetic with binary flag codes. Basically implement this as named arguments.
Has a max_alignments argument so the user can read just the first 10 records by passing max_alignments=10

I think one function that implements this would handle the majority of my use cases for reading BAMs in Python, and provide a much simpler API to get started with and use than pysam

The text was updated successfully, but these errors were encountered:

endrebak · 2021-06-09T17:50:08Z

I agree with all, but I'll leave 2) as an exercise for someone else since it is a lot of work for what is likely to be a surprisingly small gain in speed (try it yourself with read_full vs read_sparse to see).

I already support using flags. See here: https://github.com/pyranges/bamread/blob/master/bamread/src/bamread.pyx#L17

endrebak · 2021-06-09T17:51:14Z

I should also support writing bams, but I use bams ever so seldom that I've never gotten around to it (and I am not completely sure how to do it either).

endrebak · 2021-06-09T17:55:00Z

The current functions contain some extra stuff to make sure the data is valid for pyranges like

if start < 0 or end < 0: continue

I guess I should create new similar ones to use for the standalone library so that it acts completely like pysam.

endrebak · 2021-06-09T17:57:49Z

(Just writing notes to self here)

I guess we can alias read_bam to read_sam in __init__.py since I do not think pysam cares about the distinction.

read_sam = read_bam

Worst case we need to change this to not be binary I think: samfile = pysam.AlignmentFile(filename, "rb")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggested API #5

suggested API #5

timodonnell commented Jun 9, 2021

endrebak commented Jun 9, 2021 •

edited

Loading

endrebak commented Jun 9, 2021

endrebak commented Jun 9, 2021 •

edited

Loading

endrebak commented Jun 9, 2021

suggested API #5

suggested API #5

Comments

timodonnell commented Jun 9, 2021

endrebak commented Jun 9, 2021 • edited Loading

endrebak commented Jun 9, 2021

endrebak commented Jun 9, 2021 • edited Loading

endrebak commented Jun 9, 2021

endrebak commented Jun 9, 2021 •

edited

Loading

endrebak commented Jun 9, 2021 •

edited

Loading