Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: introduce a rare allele outlier analyis in cohort mode #33

Open
berntpopp opened this issue Nov 23, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@berntpopp
Copy link
Collaborator

berntpopp commented Nov 23, 2024

Currently, cohort mode in the pipeline aggregates Kestrel results from multiple samples and produces an overall summary. However, there is no dedicated mechanism to detect or highlight rare alleles across the cohort. This feature request proposes introducing a rare allele outlier analysis to identify potentially overlooked or low-frequency variants that might be clinically or biologically significant.


Motivation

  1. Identify Low-Frequency Variants:
    Rare alleles (e.g., <10% in a population) can be strong indicators of possible missed diagnoses or unique genotype patterns.

  2. Improve Cohort Insight:
    By filtering at the cohort level, we can highlight variants that might be too low in frequency to stand out in single-sample reports, yet become important across many samples.

  3. Automated Discovery:
    Aggregating Kestrel results from each sample (kestrel_pre_result.tsv) into a single database or DataFrame, then applying a cutoff, reduces manual searching and error.


Proposed Implementation

1. Data Aggregation

  • Input: Each sample’s Kestrel output, typically named kestrel_pre_result.tsv.
  • Process:
    1. Gather all kestrel_pre_result.tsv files (one per sample) during cohort mode.
    2. Parse columns such as:
      • Motifs (e.g. 5-A, 5C-A)
      • POS (position)
      • REF/ALT
      • Sample (or an added identifier)
      • Others: Estimated_Depth_AlternateVariant, Estimated_Depth_Variant_ActiveRegion, etc.
    3. Merge them into a single aggregated table or DataFrame.
      • Possibly store columns: [Motifs, POS, REF, ALT, Sample, (other fields)].

2. Frequency Calculation & Filtering

  • Compute Frequency:

    • Group by (Motifs, POS, REF, ALT) to count how many times that variant appears across all samples.
    • Derive an overall frequency (e.g. count / total_sample_count).
  • Cutoff Parameter:

    • Use a user-defined threshold (e.g., rare_allele_cutoff = 0.1) to filter out variants with frequency >= 10%.
    • Only keep those “rare” variants under this threshold.

3. Output Table

  • Rare Variant Table:

    • Show each (Motifs, POS, REF, ALT) combination that falls under the cutoff, along with:
      • The count of samples where it appears.
      • The frequency (count / total).
      • Possibly the list of sample IDs that exhibit it.
  • Format:

    • Could be a single HTML or TSV output added to the cohort_summary.html or produced as a separate file (e.g. cohort_rare_alleles.tsv).
    • Example columns: [Motifs, POS, REF, ALT, count, frequency, sample_list, ...].

Proposed Steps in cohort_summary.py (or a New Module)

  1. Collect all kestrel_pre_result.tsv paths.
  2. Read each file into a Pandas DataFrame (or similar).
  3. Append a Sample_ID column if not already present (extracted from the directory name or a CLI argument).
  4. Concatenate all DataFrames into one large DataFrame.
  5. Group by [Motifs, POS, REF, ALT] to get counts.
  6. Calculate the frequency = count / total_samples.
  7. Filter by a user-specified rare_allele_cutoff (default 0.1).
  8. Write out the final table of rare variants.

Potential Enhancements

  • Configurable Cutoff:
    Let the user specify --rare-allele-cutoff=0.05 (or 10% by default) in the CLI or in the config file.
  • Sorting the final table by frequency ascending (so the rarest are top).
  • Highlighting or color-coding in HTML (if integrated into cohort_summary.html).
  • Email Alert or log highlight if certain extremely rare variants appear.

Open Questions

  1. Where to Integrate:

    • Extend cohort_summary.py with a function like filter_rare_alleles(...)?
    • Or create a new module, e.g. rare_allele_analysis.py, and call it from cohort subcommand?
  2. Handling Edge Cases:

    • If a variant is present in 1 sample out of N (e.g., 1/50 = 2%), but multiple times in the same sample. Should that be counted differently or just 1? Possibly we consider only presence/absence per sample.
  3. Performance:

    • For extremely large cohorts, we must ensure efficient grouping (likely via Pandas or an SQL-based approach if data is large).
  4. User-Friendliness:

    • Should we produce only a TSV, or also an HTML snippet integrated into the main cohort_summary.html?

Conclusion

By aggregating all Kestrel outputs, calculating variant frequencies, and filtering for low-frequency cutoffs, we can highlight potential outlier alleles that might warrant extra attention. This enhancement transforms the cohort mode from a simple summarization into a robust detection tool for rare, possibly significant variants.

If approved, next steps are:

  1. Add a CLI parameter (e.g. --rare-allele-cutoff).
  2. Implement reading & merging Kestrel data inside cohort mode.
  3. Generate a table of rare variants under the given frequency threshold.
  4. Integrate into cohort_summary.html or a separate file.
@berntpopp
Copy link
Collaborator Author

Depends on #72
Depends on #73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant