Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated read IDs using simulate_experiment_countmat() #50

Open
vllorens opened this issue Jan 16, 2018 · 1 comment
Open

Duplicated read IDs using simulate_experiment_countmat() #50

vllorens opened this issue Jan 16, 2018 · 1 comment

Comments

@vllorens
Copy link
Contributor

Hello,

I have started simulating an experiment with 2M reads and I realised the resulting fasta files contains 4000002 lines, that is, one more read than expected.

I then found out that there are two different reads with the same read ID:

read1000001/649967984 Bacsa_2669 3110994..3111788(-)(NC_015164) [Bacteroides salanitronis BL78, DSM 18170];mate1:353-452;mate2:329-428
read1000001/649967984 Bacsa_2669 3110994..3111788(-)(NC_015164) [Bacteroides salanitronis BL78, DSM 18170];mate1:393-492;mate2:566-665

The amount of duplicated read IDs growths with the size of the simulation. For instance, simulating 4M reads generates 3 duplicated read IDs: read1000001, read2000001 and read3000001

This can later cause problems with the downstream analysis as some tools may yield an error when encountering the same read ID twice. Also, for comparison purposes, I'd expect the number of produced reads to exactly match the number of reads in the provided count matrix. I can remove any of the reads with duplicated IDs but I'd rather have this solved from the polyester output.

I have had a quick look at the sgseq() function, it seems the issue is related to the offset value. I presume using the simulate_experiment() also yields some duplicated read IDs as it uses sgseq().

I'll look at this more into detail in the coming days to see if I can solve it myself. In the meantime, thanks in advance for your comments on this!

@JMF47
Copy link
Collaborator

JMF47 commented Jan 16, 2018

Hi @vllorens, what serendipitous timing. I JUST found this out myself as well. Thank you for already looking into it. I will continue with the search too, and we can keep each other posted here. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants