Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How critical is the "Non-unique read group integer IDs found" warning in running ipdSummary? #98

Open
Marjan-Hosseini opened this issue Jan 24, 2024 · 3 comments

Comments

@Marjan-Hosseini
Copy link

I am trying to get the ipdRatio and modified bases information from a PacBio bam file which is sorted and indexed. The command I'm running is:

ipdSummary sorted_chr21_0_100000.bam --reference $REF_path/$REF_FILE_NAME.fasta --referenceWindow 20:0-100000 --identify m6A,m4C,m5C_TET --methylFraction --outfile sorted_chr21_0_100000;

I get the following warning message:

RuntimeWarning: Non-unique read group integer IDs found - some features may be restricted warnings.warn("Non-unique read group integer IDs found - some features may be restricted", RuntimeWarning)

My question is which types of features might be restricted if my read groups are not unique?

I am not sure how critical it could be.

@natechols
Copy link
Contributor

It's probably not critical - it was part of a fix for working with multiplexed data, and I was concerned about possible corner cases so I made it output an obnoxious warning. However, are you running a multi-sample BAM? It's unusual to see that warning unless you're combining barcodes.

@Marjan-Hosseini
Copy link
Author

@natechols Thank you for your response! This is the reason my read groups messed up:
I once split my large bam file, and because the size of the smaller bams were not small enough I had to merge the bam file again to split. This process for some reason made read groups be in the format of:

@RG     ID:5b57ecf9-62574588    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000
@RG     ID:f82bacd0-7B8A4FFF    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000
@RG     ID:9af21d85-1080A37D    PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-789-300;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000

This is part of the read groups.
As a result, ipdSummary gave me this error:

ValueError: invalid literal for int() with base 16: '5b57ecf9-1A103DBB'

which was from this part of the code:

pbcore/io/align/_BamSupport.py", line 70, in rgAsInt
    return np.int32(int(rgIdString.split("/")[0], 16))

Because my read groups that now have dashes cannot be converted to int. So in my local installation, I modified line 70 to:

return np.int32(int(rgIdString.split("-")[0], 16))

and that solved the problem but I got the warning because now my read groups are not unique.
I don't know why merging bam files by 'samtools' made such read groups. The first splitted files do not have such a format of read groups.

@natechols
Copy link
Contributor

Do not use samtools to merge PacBio BAMs if you want the output to work in downstream PacBio applications. Instead, use the pbmerge utility that is included with SMRT Link, and also distributed here: https://github.com/PacificBiosciences/pbtk
(I think it's buried in the install tree in smrtlink install, I can dig up the exact path later). This will ensure that your output is compatible with our reader functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants