Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About metagenomic data #8

Closed
jxmelody opened this issue May 28, 2024 · 7 comments
Closed

About metagenomic data #8

jxmelody opened this issue May 28, 2024 · 7 comments

Comments

@jxmelody
Copy link

Hi,

Thank you for implementing this great tool.

I am trying to use this tool to get methylation profile for my metagenomic data, but I got the error:

[2024-05-26T17:07:43.378Z ERROR methbat] Error while verifying settings: hap1_bed does not exist: "ref_pbmm2_6G_len_1000.hap1.bed"

pb-CpG-tool does not generate hap1 adn hap2 file as it is metagenomic data. But I still wonder can we use this tool to ge methylation profile?

Best,
Jin

@holtjma
Copy link
Collaborator

holtjma commented May 28, 2024

Hi Jin,

In its current state, it is expecting to find combined.bed, hap1.bed, and hap2.bed. However, this is clearly a situation where it could generate combined information without any haplotype specific outputs. I'll make a note to see if we can create an option to disable this requirement (and the corresponding phased outputs). It may be a while as I'm currently on FTO.

In the interim, if you want to try to progress with the existing MethBat version, you should be able to just mock the files. I think if you create an empty file with the corresponding filenames (one for hap1, one for hap2), it will recognize it as an empty dataset and continue past the above error.

Matt

@jxmelody
Copy link
Author

jxmelody commented Jun 2, 2024

Hi Matt,

Thank you for your detailed response.
As you suggested, I tried to create these two empty hap.bed files.
Now I get a different error message, which I guess is about the region file I provided?

$ methbat profile --input-prefix ref_pbmm2_6G_len_1000 --input-regions ref_pbmm2_6G_len_1000_methbat_segment.csv --output-region-profile metabat_profile
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile] Input prefix: "ref_pbmm2_6G_len_1000"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile] Region profiling:
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Input profile regions: "ref_pbmm2_6G_len_1000_methbat_segment.csv"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Output profile file: "metabat_profile"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Profile selections: ["ALL"]
[2024-06-02T11:55:07.223Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.combined.bed"...
[2024-06-02T11:55:07.223Z INFO  methbat::cpg_parser] Model mode auto-detected
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.hap1.bed"...
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.hap2.bed"...
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000_methbat_segment.csv"...
[2024-06-02T11:55:08.574Z ERROR methbat] Error while aggregating region statistics: CSV deserialize error: record 1 (line: 2, byte: 30): missing field `chrom`

The region file is generated by

$methbat segment --input-prefix ref_pbmm2_6G_len_1000 --output-segments  ref_pbmm2_6G_len_1000_methbat_segment.csv

This file looks like this:

#chrom	start	end	summary_label
581f4cc2897b459a_1	21	4898301	Unmethylated
581f4cc2897b459a_2	22	102511	Unmethylated
581f4cc2897b459a_3	26	5333	Unmethylated
MG1655	21	1299415	Unmethylated
MG1655	1300836	2955015	Unmethylated
MG1655	2955288	4641590	Unmethylated
ATCC_43504	37	148267	Unmethylated
...

Do you have any suggestions?

Also, I have a slight concern about the generated region result. I found the summary label column to be all Unmethylated. Does this suggest something abnormal in my sequencing data or possible data processing mistakes?

Many thanks.

Jin

@holtjma
Copy link
Collaborator

holtjma commented Jun 2, 2024

Jin,

To be clear, the output of methbat segment is not meant to be sent to methbat profile; those are two separate analysis paths (which is why you're getting an error). MethBat's segmentation algorithm will provide any regions it detects as methylated, unmethylated, or ASM (not applicable here); and then you will need to analyze those regions for your application.

Also, I have a slight concern about the generated region result. I found the summary label column to be all Unmethylated. Does this suggest something abnormal in my sequencing data or possible data processing mistakes?

In human data, we usually see a variety of states, not just unmethylated. I'm not sure what to expect in metagenomics, but I assume at least some would be methylated.

@dportik Do you have any expectations around what we would see if we ran a methylation segmentation algorithm on metagenomic data?

Matt

@holtjma
Copy link
Collaborator

holtjma commented Jun 6, 2024

@jxmelody,

I spoke with Dan and will relay the key parts of that conversation. The short version is we don't have data on the accuracy of methylation in metagenomics, so we don't really have a clear set of expectations to convey here. We have not performed any validations on metagenomic datasets, so you're definitely in an experimental area just with the baseline 5mC calls. pb-CpG-tools (and MethBat as a consequence) were also designed with diploid organisms in mind, so there are likely some assumptions that will not match up correctly to metagenomics context.

Sorry this probably isn't a satisfactory answer, but let me know if you have any follow up questions!

Matt

@jxmelody
Copy link
Author

jxmelody commented Jun 7, 2024

Dear Matt,

Thank you for your response! Your explanation is very helpful.

I would like to further clarify the difference between using metagenomic data and diploid data for methylation analysis. Is this difference related to the process of obtaining the methylation profile, which typically requires negative control data? I understand that in many cases, whole genome amplification (WGA) is used to generate unmethylated DNA as the control data. And due to a real control data is not always avaliable, many tools use in silico control to do prediction.

Considering this, even though I have HiFi data with kinetic information and MM/ML tags indicating the 5-methylcytosine (5mC) location and score (generated by the SMRT Link tool) in the BAM file, would I still need additional control data to accurately determine the methylation locations? Or do I need to redo the analysis(I think inside the SMRTlink, jasmine is used for 5mC methylation detection)

I also came across a tool called ipdSummary, which is part of the KineticTools suite provided by PacBio. This tool supports the detection of 6mA and 4mC modifications. However, according to an issue I found, ipdSummary may not correctly detect 5mC because it uses metagenomic data for training(from this issue).

Based on this information, my understanding is that the pipeline consisting of Jasmine, pb-CpG-tool, and MethBat is primarily designed for human genome analysis, which predominantly involves 5mC methylation. In contrast, metagenomic data often exhibits a higher prevalence of 6mA and 4mC modifications. This difference in the predominant methylation types could potentially explain why I obtained all unmethylated labels when using MethBat segmentation on my metagenomic data.

Please let me know if my understanding is correct or if there are any additional factors I should consider.

Many, many thanks!

Jin

@holtjma
Copy link
Collaborator

holtjma commented Jun 7, 2024

Jin,

I would like to further clarify the difference between using metagenomic data and diploid data for methylation analysis. Is this difference related to the process of obtaining the methylation profile, which typically requires negative control data? I understand that in many cases, whole genome amplification (WGA) is used to generate unmethylated DNA as the control data. And due to a real control data is not always avaliable, many tools use in silico control to do prediction.

My understanding is that this is how the models for methylation were trained for human datasets.

Considering this, even though I have HiFi data with kinetic information and MM/ML tags indicating the 5-methylcytosine (5mC) location and score (generated by the SMRT Link tool) in the BAM file, would I still need additional control data to accurately determine the methylation locations? Or do I need to redo the analysis(I think inside the SMRTlink, jasmine is used for 5mC methylation detection)

My apologies, but this isn't really something I can speak to. Thus far, I have only used our methylation tooling for human datasets. Given that your scope of questions is growing, I would recommend contacting your PacBio representative to get more information on how methylation might work with metagenomic datasets. What I can say for sure is that MethBat is not designed to work with non-diploid organisms.

I also came across a tool called ipdSummary, which is part of the KineticTools suite provided by PacBio. This tool supports the detection of 6mA and 4mC modifications. However, according to an PacificBiosciences/kineticsTools#97 I found, ipdSummary may not correctly detect 5mC because it uses metagenomic data for training(from PacificBiosciences/kineticsTools#87).

I'm unfamiliar with this tooling, so I cannot speak to its performance. You would likely get more information by opening an issue there or through your PacBio representative.

Based on this information, my understanding is that the pipeline consisting of Jasmine, pb-CpG-tool, and MethBat is primarily designed for human genome analysis, which predominantly involves 5mC methylation. In contrast, metagenomic data often exhibits a higher prevalence of 6mA and 4mC modifications. This difference in the predominant methylation types could potentially explain why I obtained all unmethylated labels when using MethBat segmentation on my metagenomic data.

I can't speak to the limits of Jasmine, but pb-CpG-tools and MethBat are both intrinsically tied to 5mC and diploid analysis. Anything outside that scope might work, but we are not building the tooling with it in mind.

Sorry there wasn't much I could really answer there, a lot of your questions are beyond the scope of MethBat's functionality at this time.

Matt

@jxmelody
Copy link
Author

jxmelody commented Jun 9, 2024

Matt,

Thank you so much for your kind support. It really helps :)

Jin

@jxmelody jxmelody closed this as completed Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants