Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug load fastq records #63

Open
miliskato opened this issue Nov 5, 2024 · 2 comments
Open

Bug load fastq records #63

miliskato opened this issue Nov 5, 2024 · 2 comments
Assignees

Comments

@miliskato
Copy link

miliskato commented Nov 5, 2024

Hi,

When running ConFindr on a specific sample we encountered a KeyError where the key could not be found in the fastq records. However, the key is present in the fastq records.

After some digging, we could trace the cause to be the load_fastq_records method in the methods.py script. We have read names that contain :1: (referring to the lane number (https://help.basespace.illumina.com/files-used-by-basespace/fastq-files)) but also end in /1. This causes /1 to be added to the record. As a result, the key it is looking for does not match the fastq read names (it is looking for read_name/1).

Is there a reason why you first check if :1: is present in the record before checking if the record already contains /1? Can this be swapped and can you check if the read ends with /1 instead of containing it? Also, it is documented above the first condition (if ':1:' in record.id) that you change a :1: to /1 in the record id, but you just add /1. Is this a mistake in the documentation or in the code?

Current code:

if forward:
                # Change a :1: to /1 in the record.id
                if ':1:' in record.id:
                    record.id = record.id + '/1'
                # Don't worry if the record.id already has a /1
                elif '/1' in record.id:
                    pass
                # If the record.id doesn't have a read direction, add /1
                else:
                    record.id = record.id + '/1'
# Process reverse reads in a similar fashion to forward reads
else:
                if ':2:' in record.id:
                    record.id = record.id + '/2'
                elif '/2' in record.id:
                    pass
                else:
                    record.id = record.id + '/2'

Suggested code:

if forward:
                # Don't worry if the record.id already has a /1
                if record.id.endswith('/1'):
                    pass
                # Change a :1: to /1 in the record.id
                elif ':1:' in record.id:
                    record.id = record.id + '/1'
                # If the record.id doesn't have a read direction, add /1
                else:
                    record.id = record.id + '/1'
# Process reverse reads in a similar fashion to forward reads
else:
                if record.id.endswith('/2'):
                    pass
                elif ':2:' in record.id:
                    record.id = record.id + '/2'
                else:
                    record.id = record.id + '/2'

Thanks in advance for your reply!

@pcrxn pcrxn self-assigned this Jan 24, 2025
@pcrxn
Copy link
Collaborator

pcrxn commented Jan 24, 2025

Hi @miliskato, I'm sorry for the slow response to your issue, and thank you for including those suggested code changes.

It seems that your FASTQ headers may be in an unconventional format. In the link you've provided above, there's a space character separating information such as the <lane> with the <read> (read direction, 1 or 2).

When ConFindr reads the paired-end FASTQ files, it only uses the first contiguous string (no whitespace) in the FASTQ header line as the record.id, and it assumes that this is the same for both forward and reverse reads, except for instances of ":1:" and "/1" which are handled according to the code provided above in load_fastq_records().

When a pair of forward and reverse FASTQ files is provided where the headers do not match between mate pairs, a KeyError is raised by characterise_read() for the reverse read, as this function assumes that the key for the reverse read in the fastq_records dictionary is the record.id of the forward read + '/1'. I believe this error has the same root cause as #52, and I plan to modify #54 accordingly to track this.

Would you be able to provide an example of a pair of FASTQ files which are causing this issue? I would like to test the proposed changes to the code on these files before submitting a pull request to the ConFindr repository. I attempted to reproduce the issue using some modified FASTQ headers, but your suggested changes didn't prevent the KeyError on these, unfortunately.

Thanks!

@miliskato
Copy link
Author

Hi,

Thank you for your reply. Attached you can find a subsample of the FASTQ files that caused this issue.

Thank you!

sub1_R1.fastq.gz
sub1_R2.fastq.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants