Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bed file option for Ordered-Histgrowth #55

Open
OZTaekOppa opened this issue Oct 18, 2024 · 17 comments
Open

Bed file option for Ordered-Histgrowth #55

OZTaekOppa opened this issue Oct 18, 2024 · 17 comments

Comments

@OZTaekOppa
Copy link

Hi again,

I have a few more burning questions:

  1. In the example of ordered-histgrowth from this link (https://github.com/marschall-lab/panacus/blob/main/examples/ordered_pangenome_growth_statistics.md), it only uses a single chromosome. That’s why the total size shows around 175M after adding NA21309. If that’s correct, should the plot for the full genome (22 autosomes and 2 sex chromosomes) reach 3.1B after adding the final assembly?

image

  1. I noticed that the -s and -e options allow the use of a BED file.

-s, --subset Produce counts by subsetting the graph to a given list of paths (1-column list) or path coordinates (3- or 12-column BED file). If the "order" option is not used, the subset list will also indicate the order of
paths/groups in the histogram. [default: ]
-e, --exclude Exclude bp/node/edge in growth count that intersect with paths (1-column list) or path coordinates (3- or 12-column BED-file) provided by the given file [default: ]

If I want to identify which genomic regions contribute to graph growth, is there a way to investigate this via Panacus using CHM13’s BED file with the -s and -e options?

Looking forward to your insights!

Cheers,

Taek

@danydoerr
Copy link
Member

danydoerr commented Oct 18, 2024

Hi Taek,

for this plot all 24 chromosomes where taken into account. As described in the example, only the fraction of the graph were quantified that are not touched by the reference genome GRCh38. In other words, each node through which the reference genome traverses, is ignored. The subset option was used to quantify only HPRC haplotypes (this excludes CHM13).

If you want to know what genomic regions contributed to the plot, take the graph, subtract all GRCh38 paths and extract all remaining nodes that are covered by HG* and NA* paths.

@OZTaekOppa
Copy link
Author

Thank you for your reply.
If that’s the case, I’d like to clarify a few more things.

I followed these steps.
Step 1: Establish the order of samples in the growth statistics
echo 'HG01891 HG02109 HG02145 HG02257 …… HG02080 HG005' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.order.samples.txt

Step 2: Establish the exclude of samples in the growth statistics
echo 'chm13 grch38' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt

Step 3: Exclude paths from reference genome CHM13 and GRCH38
grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes
RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv

Step 5: Visualize coverage histogram and pangenome growth curve with estimated growth parameters
panacus-visualize -e ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.pdf

In Step 4, the file *pggb.ng0.ordhstgrw.bp.tsv will be used to generate the graph. Attached is the TSV file from chr1 in my trial.

image

As you can see, even a single chromosome (chr1) already shows a size comparable to the total chromosome size described in the Panacus example. I’ve cross-checked the steps and parameters with the manual in the provided link. Did I miss anything?

Regarding genomic region contributions, I’m not fully clear on your explanation. When you say "take the graph," do you mean the PGGB graph GFA file? And how exactly should I subtract or extract regions from it? Could you please share any relevant steps, scripts, or manuals I could follow?

Looking forward to your guidance.

Kind regards,

Taek

@OZTaekOppa
Copy link
Author

OZTaekOppa commented Oct 20, 2024

Hi again,

I have also tried using the same approach explained in the manual (https://github.com/marschall-lab/panacus/blob/main/examples/ordered_pangenome_growth_statistics.md).

Step 4 (followed manual): panacus histgrowth to calculate coverage and pangenome growth for nodes
RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0.ordhstgrw.bp.tsv

Eventually, the -e ${OUTPUT_DIR}/${BASEFILE}.pggb_hprc.exculde.samples.txt and -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt generated the same TSV file and PDF from chr1 in my trial. Both approaches generated an abnormally large chromosome size even with a single chromosome (chr1).

I’ve tried both v0.2.4 from Bioconda and the binary release from GitHub. It seems likely that Step 4 didn’t generate the correct values for the TSV file.

I have also tried the -s parameter using a bed file from CHM13v2.
This attempt generated all values “0” in the TSV and PDF.

Step 4 (bed file)
RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -s ${CHM13_BED} -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0bed.ordhstgrw.bp.tsv

Looking forward to your insights.

Kind regards,

Take

@heringerp
Copy link
Collaborator

Hi Taek,

the differing base pair counts are explained by the differing graphs. The example was generated using the Minigraph-Cactus graph, while your results are from the PGGB graph.

The Minigraph-Cactus graph does not contain any heterochromatic regions, so it already starts with fewer bases than the PGGB graph. The PGGB graph does include these regions but the sequences in these regions are seriously under-aligned meaning that we will have lots of bases in the graph where there would be only one when using a single sequence. The size in bps in the last line is not the size of the chromosome but instead the size of the graph of the chromosome, which can be in reality quite different from each other. For example, the human chr1 has a length of 248,387,328 bp (CHM13) while the PGGB graph for it contains 1,117,392,094 bps.

With regards of extracting the non-GRCh38 parts of the graph: Unfortunately, there are no tools available (at least to my knowledge) that can do this for you. My approach would be to take all the nodes in the GRCh38 path and removing them from all the other paths manually (e.g. using a bash/python script).

Best regards
Peter

@OZTaekOppa
Copy link
Author

Thank you for your reply all the time!

So, if the difference between Minigraph-Cactus and PGGB is accepted, your explanation confirms that the TSV and graph generated from PGGB are correct. Is that right?

Additionally, regarding the chr1 example: A GFA file from PGGB with 56 samples (112 haplotypes) excluding CHM13 and GRCh38 was used in Panacus. If I want to retrieve or calculate the added sequence length for chr1 from the PGGB graph, which file or information should I refer to?

Kind regards,

Taek

@heringerp
Copy link
Collaborator

Yes, the tsv and the plots for the PGGB graph should be correct.

About the second question: What do you mean with added sequence length? The number of base pairs that the non-GRCh38/CHM13 sequences add to the graph?

Best regards
Peter

@OZTaekOppa
Copy link
Author

That’s correct. In Steps 2 and 3, it should reflect the number of base pairs added to the graph by non-GRCh38/CHM13 sequences.
Cheers,
Taek

@heringerp
Copy link
Collaborator

This should be in the last line of the .tsv file generated in Step 4. Using the -e-Parameter you tell panacus to not count all the bps that intersect with the paths in the file (i.e. in this case GRCh38 and CHM13). Thus, the last line contains the number of bps in the graph without all the bps of GRCh38/CHM13 or in other words all the bps added by non-GRCh38/CHM13 sequences.

Best regards
Peter

@OZTaekOppa
Copy link
Author

OZTaekOppa commented Oct 22, 2024

Thanks.

Based on your explanation and my current PGGB trial, the total accumulated base pairs added by non-GRCh38/CHM13 sequences (with the -l 1,2,4,49 option) would be 1,376,115,016 from HG005:
1086959399 + 224525351 + 63802404 + 827862.

Or is it just 1,086,959,399 from the first -l 1 information?

Cheers,

Taek

@heringerp
Copy link
Collaborator

It is just the first number, e.g. 1,086,959,399.

The other numbers apply to the other coverages/quora respectively. So for example, if you would like to only take into account nodes that appear at least in two paths (because they might be errors, etc.), you would take the second value (in this case) 224,525,351 (since it is for a coverage of 2).

Best regards
Peter

@OZTaekOppa
Copy link
Author

Thank you for your reply and clarification.

Based on the manual and my trials, I can generate the Panacus growth plot for specific assemblies using Steps 1, 2, and 3. However, I’m wondering if there’s a way to extract only specific assemblies from the PGGB-generated GFA file. For example, is it possible to extract just the paths covered by ^HH* and ^NA* to make a subset of a GFA file?

Kind regards,

Taek

@heringerp
Copy link
Collaborator

Yes, it is possible. This could be done manually or by creating a list of all the paths one wants to keep and using trim-graph on it. This will remove all node/edges not covered by the paths in the list. After that you might want to run odgi unchop on the graph to merge all the nodes that can now be merged since they are not separated anymore.

Best regards
Peter

@OZTaekOppa
Copy link
Author

Thank you so much for your help!
I’ll proceed with the programs as you suggested.

I have another question regarding the use of bed files in Panacus.

FYI,
I retrieved the bed file from CHM13 JHU RefSeqv110 + Liftoff v5.2 (https://github.com/marbl/CHM13/tree/master) and converted the GFF3 file into a six-column bed file.

Step 4 (bed file)
RUST_LOG=info panacus ordered-histgrowth -c bp -t${PBS_NCPUS} -l 1,2,4,49 -s ${CHM13_BED} -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng0bed.ordhstgrw.bp.tsv

However, the generated TSV file contains only “0” values.
Do I need to provide a separate bed file for each assembly?
Is there a more effective way to use the -s parameter in Panacus?

Looking forward to your insights.

Kind regards,

Taek 

@heringerp
Copy link
Collaborator

Yes, there is definitely an issue there.
I'm not sure about what exactly the problem is with the bed file. Can you maybe send me the first few lines of the bed file? Likely there is a mismatch between the first field of the bed file and the path names in the graph.

Best regards
Peter

@OZTaekOppa
Copy link
Author

This is the bed file I used.

chr1 6046 13941 transcript_id . -
chr1 6046 6420 transcript_id . -
chr1 12077 12982 transcript_id . -
chr1 13444 13579 transcript_id . -
chr1 13679 13941 transcript_id . -
chr1 15079 21429 transcript_id . +
chr1 15079 15564 transcript_id . +
chr1 20565 21429 transcript_id . +
chr1 20528 37628 transcript_id . -
chr1 20528 21087 transcript_id . -

cheers,

Taek

@heringerp
Copy link
Collaborator

Thank you for the file.

Okay, so the problem is that panacus cannot match the "chr1" in the first field of the bed file to any path in the graph. If you replace the "chr1" everywhere in the bed file with "chm13#chr1" (name of the corresponding path in the graph) it should work.

However, this will give you a hist/growth file having only two lines, since the graph is then subsetted to only have the chm13 path. If you wanted to subset all paths correspondingly you would have to liftover the annotations to other paths and include all of them in the bed file.

Best regards
Peter

@OZTaekOppa
Copy link
Author

OZTaekOppa commented Oct 24, 2024

Thank you for your reply.

I have tried based on your explanation. And, I have this error.

  • /tmp/panacus-0.2.4/bin/panacus ordered-histgrowth -c bp -t8 -l 1,2,4,49 -s /chm13-t2t/panacus_chm13v2.0_RefSeq_Liftoff_v5.2.bed -S -e /PGGB/ng0_bed/pggbbed.path.txt -O /PGGB/ng0_bed/pggb_hprc.order.samples.txt /chr1.pggb.out/chr1.fasta.gz.c1c4fa2.3db5bc7.5ef21f9.smooth.final.gfa
    [2024-10-25T00:18:00Z INFO panacus::cli] running panacus on 8 threads
    [2024-10-25T00:18:00Z INFO panacus::graph] constructing indexes for node/edge IDs, node lengths, and P/W lines..
    [2024-10-25T00:18:00Z INFO panacus::io] loading graph from /chr1.pggb.out/chr1.fasta.gz.c1c4fa2.3db5bc7.5ef21f9.smooth.final.gfa
    [2024-10-25T00:18:07Z INFO panacus::graph] found: 2546 paths/walks, 8589688 nodes
    [2024-10-25T00:18:07Z INFO panacus::abacus] loading coordinates from /chm13-t2t/panacus_chm13v2.0_RefSeq_Liftoff_v5.2.bed
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 6046 13941 transcript_id . -
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 6046 6420 transcript_id . -
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 12077 12982 transcript_id . -
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 13444 13579 transcript_id . -
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 13679 13941 transcript_id . -
    [2024-10-25T00:18:13Z ERROR panacus::abacus] unknown path/group chm13#2#chr1 15079 21429 transcript_id . +

For the single merged subset bed file, do you mean something like this?

chm13#chr1 6046 13941 transcript_id . -
chm13#chr1 12077 12982 transcript_id . -
HG01891#chr1 13444 13579 transcript_id . -
HG02109#chr1 15079 21429 transcript_id . +
.
.
.

cheers,

Taek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants