Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed call due to high coverage #121

Open
mbhall88 opened this issue Mar 22, 2021 · 2 comments
Open

Missed call due to high coverage #121

mbhall88 opened this issue Mar 22, 2021 · 2 comments

Comments

@mbhall88
Copy link
Member

I have an example where we miss a very clear resistant call due to LOW_GT but the depth on alt vs ref is very obvious. I appreciate it is hard when we are drawing lines based on percentiles. Anyway, I thought this might be a nice example to do some testing/debugging with.

This is a sample from Sylvianne in Madagascar. It was one of 5 samples in a multiplexed nanopore run. It has a HUGE amount of reads (fastq file is 3.3Gb - compressed!) They are testing out MGIT in the lab and were using mykrobe to compare results. The MGIT and phenotyping for this sample show resistance to Isoniazid, Rifampicin, and Ethambutol.

When I run mykrobe ok the full set of reads, it returns resistant to INH only. When I randomly subsample the file to 150x depth mykrobe says resistant to INH and RIF. When I filter out anything (from the subsampled file) that doesn't map to H37Rv I also get resistant to RIF and INH.

The variant in question is rpoB_S450L-TCG761154TTG. The information for this variant from the full and filtered runs are:

Full read set

            "rpoB_S450L-TCG761154TTG": {
                "variant": "ref-S450X?var_name=TCG761154TTG&num_alts=8&ref=NC_000962.3&enum=0&gene=rpoB&mut=S450X",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -22153.337254734015,
                    -44.788636474841695
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 75.0,
                            "median_depth": 1,
                            "min_non_zero_depth": 1,
                            "kmer_count": 20,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 166,
                            "min_non_zero_depth": 132,
                            "kmer_count": 3109,
                            "klen": 20
                        }
                    },
                    "expected_depths": [
                        160
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 22109
                },
                "_cls": "Call.VariantCall"

Filtered read set

            "Rifampicin": {
                "predict": "R",
                "called_by": {
                    "rpoB_S450L-TCG761154TTG": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -7098.776446933491,
                            -9.626747418092155
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 0.0,
                                    "median_depth": 0,
                                    "min_non_zero_depth": 0,
                                    "kmer_count": 0,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 42,
                                    "min_non_zero_depth": 37,
                                    "kmer_count": 812,
                                    "klen": 20
                                }
                            },
                            "expected_depths": [
                                41.0
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 7089
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

All of the data from this run is stored on noah. I'll send through the paths on Slack.

@mbhall88
Copy link
Member Author

Also, it looks like we've also missed a clear ETH call

Full read set

            "embB_M306V-ATG4247429GTG": {
                "variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -14407.132009802384,
                    -426.5460799265036
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 100.0,
                            "median_depth": 4,
                            "min_non_zero_depth": 3,
                            "kmer_count": 82,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 105,
                            "min_non_zero_depth": 82,
                            "kmer_count": 2203,
                            "klen": 21
                        }
                    },
                    "expected_depths": [
                        160
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 13981
                },
                "_cls": "Call.VariantCall"
            },

Filtered read set

            "embB_M306V-ATG4247429GTG": {
                "variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -5175.289309876459,
                    -89.3627577744604
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 0.0,
                            "median_depth": 0,
                            "min_non_zero_depth": 0,
                            "kmer_count": 0,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 25,
                            "min_non_zero_depth": 19,
                            "kmer_count": 515,
                            "klen": 21
                        }
                    },
                    "expected_depths": [
                        41.0
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 5086
                },
                "_cls": "Call.VariantCall"
            },

Maybe we need to have - in addition to the GT percentile stuff - some kind of overriding "hard-coded" rules that say something like if Ref depth is less than 2 and total depth is > 15 then ignore filters? (numbers are for illustration here)

@iqbal-lab
Copy link
Collaborator

There seem to be two issues here.

  1. Better results (sensitivity) when subsampling v high coverage sample
  2. Better results (sensitivity) when removing NON Mtb reads

1 I can believe and might be fixable by having a GCP threshold which drops when coverage is high. There is another issue related to that, can't find on my phone right now.
2 is surprising

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants