Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to properly link sequence_spec to library_spec #62

Open
massonix opened this issue Feb 16, 2025 · 7 comments
Open

Unable to properly link sequence_spec to library_spec #62

massonix opened this issue Feb 16, 2025 · 7 comments

Comments

@massonix
Copy link

Hi Sina and Pachter lab,

Thank you for developing this wonderful tool, it's going to be super helpful for our lab.

I've built a seqspec for the Flex kit, as I couldn't find it in the assays directory of this repo. I plan to submit it for review following the instructions provided here as soon as I solve this issue.

After running the following command:

seqspec print -f seqspec-html spec.yaml > spec.html

I get this html:

Image

As you can see, seqspec is not properly linking the library structure to the fastq files. I've created it based on other examples I found here.Here the spec file:

spec.txt

For reference, this is how the read1 and 2 looks like for my example dataset:

R1:

Image

R2:

Image

Note that the R2 should be 88bp: 25 RHS + 25 RHS + 16 constant sequence + 10 probe barcode + 12 pCS1. However, the actual read in the fastq is 90bp long, which may be problematic.

I'm running seqspec 0.3.1 and python 3.10.16

Many thanks in advance for your kind help!

Ramon

@sbooeshaghi
Copy link
Collaborator

Hi Ramon, thank you for submitting this issue. It’s great to see you putting together a seqspec for data—I’m happy to help debug.

I’ll take a look at the generated HTML page. Quick question: what does the sequencer report for the final two bases? If the sequenceable part of the library is length n and the sequencer produces reads of length n+2, are random bases being added?

@sbooeshaghi
Copy link
Collaborator

Ah sorry that was kind of a silly question. The answer is obviously that it would incorporate bases that correspond to the pCS1 and go into the UMI. I've mocked up the html output for your seqspec below. Would appreciate your feedback as to whether this correctly visualizes the link between your reads and library structure.

Image

@massonix
Copy link
Author

Hi Sina, thanks so much for looking into it! Exactly, the two extra bases are from the UMI.

For reference, this is the library structure

Image

(obtained from page 20 in this protocol)

Read 1 contains the cell barcode (16bp) + UMI (12bp), so should look like this ( from the 10x-RNA-v3 seqspec )

Image

Read 2 contains probe insert (50bp) + constant sequence + probe BC + pCS1, so they should also be under "rna-R2.fastq.gz". rna-I2.fastq.gz contains the rna-index5, and rna-I1.fastq.gz the rna-index7.

I understood from looking at the examples that seqspec figured these connections by combining primer_id, strand, min_len and max_len, but I likely specified something wrong.

Hope this helps!

@massonix
Copy link
Author

Image

This indeed perfectly matches the link between reads and the library structure 👍

@sbooeshaghi
Copy link
Collaborator

Hi @massonix,

I’ve updated seqspec print to display the reads on the sequence. You can test it by installing seqspec from the devel branch, formatting your spec file, and then running seqspec print:

pip install git+https://github.com/pachterlab/seqspec.git@devel
seqspec format -o spec.yaml spec.txt
seqspec print -f seqspec-html -o spec.html spec.yaml

Let me know if you run into any issues!

@massonix
Copy link
Author

Hi Sina,

Apologies for the delay, I finally had time to look into this. I ran the code that you provided and obtained this html:

Image

I love the arrows added in the "Final Library" section, they make it easier to understand the connection between library structure and sequencing reads. However, the "Sequence structure" and "Library structure" sections still appear disconnected. I think the true genius of these htmls is the hierarchical dropdown arrows, where users can click in R1.fastq and immediately find the UMI and cell barcode contained there, which is the case for all the htmls in IGVF. For instance:

Image

To discard that this is an error specific to my seqspec, I ran the same line on 10x_rna_v3.spec.yaml and obtained this:

Image

which is different from the html in IGVF linked above. This may be an overkill to implement tho in the edge cases where one of the reads extends into the other, so I totally understand if this feature is not prioritized. Thanks in advance and happy to discuss and test this further if needed :)

@massonix
Copy link
Author

massonix commented Mar 5, 2025

Hi Sina, I'm building another spec for a new version of NTT-seq. I set the sequence_type of the gDNA region to random, with a min_len of 100 and a max_len of 500:

  - !Region
    parent_id: histone_mark
    region_id: histone_mark-gDNA
    region_type: gdna
    name: transposed gDNA next to targeted histone mark
    sequence_type: random
    sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    min_len: 100
    max_len: 500
    onlist: null
    regions: null

After converting to html, I get this:

Image

Image

Image

See how I need to scroll all the way to the right to get the arrows with the actual reads. I guess that seqspec is picking the max-len to draw the reads, but it'd be better if it was something like:

Image

Another option could be to encode in the sequence that the number of nucleotides can vary (something like XXX...XXX), to prevent long sequences, similar to

Image

In any case, I'm really happy with how seqspec works, it's truly the "lingua franca" of genomics assays ^^, my labmates are already loving it! Let me know if I can help with anything on my end.

Ramon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants