Feature request: introduce a rare allele outlier analyis in cohort mode #33

berntpopp · 2024-11-23T09:33:15Z

Currently, cohort mode in the pipeline aggregates Kestrel results from multiple samples and produces an overall summary. However, there is no dedicated mechanism to detect or highlight rare alleles across the cohort. This feature request proposes introducing a rare allele outlier analysis to identify potentially overlooked or low-frequency variants that might be clinically or biologically significant.

Motivation

Identify Low-Frequency Variants:
Rare alleles (e.g., <10% in a population) can be strong indicators of possible missed diagnoses or unique genotype patterns.
Improve Cohort Insight:
By filtering at the cohort level, we can highlight variants that might be too low in frequency to stand out in single-sample reports, yet become important across many samples.
Automated Discovery:
Aggregating Kestrel results from each sample (kestrel_pre_result.tsv) into a single database or DataFrame, then applying a cutoff, reduces manual searching and error.

Proposed Implementation

1. Data Aggregation

Input: Each sample’s Kestrel output, typically named kestrel_pre_result.tsv.
Process:
1. Gather all kestrel_pre_result.tsv files (one per sample) during cohort mode.
2. Parse columns such as:
  - Motifs (e.g. 5-A, 5C-A)
  - POS (position)
  - REF/ALT
  - Sample (or an added identifier)
  - Others: Estimated_Depth_AlternateVariant, Estimated_Depth_Variant_ActiveRegion, etc.
3. Merge them into a single aggregated table or DataFrame.
  - Possibly store columns: [Motifs, POS, REF, ALT, Sample, (other fields)].

2. Frequency Calculation & Filtering

Compute Frequency:
- Group by (Motifs, POS, REF, ALT) to count how many times that variant appears across all samples.
- Derive an overall frequency (e.g. count / total_sample_count).
Cutoff Parameter:
- Use a user-defined threshold (e.g., rare_allele_cutoff = 0.1) to filter out variants with frequency >= 10%.
- Only keep those “rare” variants under this threshold.

3. Output Table

Rare Variant Table:
- Show each (Motifs, POS, REF, ALT) combination that falls under the cutoff, along with:
  - The count of samples where it appears.
  - The frequency (count / total).
  - Possibly the list of sample IDs that exhibit it.
Format:
- Could be a single HTML or TSV output added to the cohort_summary.html or produced as a separate file (e.g. cohort_rare_alleles.tsv).
- Example columns: [Motifs, POS, REF, ALT, count, frequency, sample_list, ...].

Proposed Steps in `cohort_summary.py` (or a New Module)

Collect all kestrel_pre_result.tsv paths.
Read each file into a Pandas DataFrame (or similar).
Append a Sample_ID column if not already present (extracted from the directory name or a CLI argument).
Concatenate all DataFrames into one large DataFrame.
Group by [Motifs, POS, REF, ALT] to get counts.
Calculate the frequency = count / total_samples.
Filter by a user-specified rare_allele_cutoff (default 0.1).
Write out the final table of rare variants.

Potential Enhancements

Configurable Cutoff:
Let the user specify --rare-allele-cutoff=0.05 (or 10% by default) in the CLI or in the config file.
Sorting the final table by frequency ascending (so the rarest are top).
Highlighting or color-coding in HTML (if integrated into cohort_summary.html).
Email Alert or log highlight if certain extremely rare variants appear.

Open Questions

Where to Integrate:
- Extend cohort_summary.py with a function like filter_rare_alleles(...)?
- Or create a new module, e.g. rare_allele_analysis.py, and call it from cohort subcommand?
Handling Edge Cases:
- If a variant is present in 1 sample out of N (e.g., 1/50 = 2%), but multiple times in the same sample. Should that be counted differently or just 1? Possibly we consider only presence/absence per sample.
Performance:
- For extremely large cohorts, we must ensure efficient grouping (likely via Pandas or an SQL-based approach if data is large).
User-Friendliness:
- Should we produce only a TSV, or also an HTML snippet integrated into the main cohort_summary.html?

Conclusion

By aggregating all Kestrel outputs, calculating variant frequencies, and filtering for low-frequency cutoffs, we can highlight potential outlier alleles that might warrant extra attention. This enhancement transforms the cohort mode from a simple summarization into a robust detection tool for rare, possibly significant variants.

If approved, next steps are:

Add a CLI parameter (e.g. --rare-allele-cutoff).
Implement reading & merging Kestrel data inside cohort mode.
Generate a table of rare variants under the given frequency threshold.
Integrate into cohort_summary.html or a separate file.

The text was updated successfully, but these errors were encountered:

berntpopp · 2024-12-25T12:04:15Z

Depends on #72
Depends on #73

berntpopp added the enhancement New feature or request label Nov 23, 2024

berntpopp self-assigned this Nov 23, 2024

berntpopp pinned this issue Dec 24, 2024

berntpopp mentioned this issue Dec 25, 2024

Refactor: kestrel_pre_result.tsv to Match kestrel_result.tsv Columns #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: introduce a rare allele outlier analyis in cohort mode #33

Feature request: introduce a rare allele outlier analyis in cohort mode #33

berntpopp commented Nov 23, 2024 •

edited

Loading

berntpopp commented Dec 25, 2024

Feature request: introduce a rare allele outlier analyis in cohort mode #33

Feature request: introduce a rare allele outlier analyis in cohort mode #33

Comments

berntpopp commented Nov 23, 2024 • edited Loading

Motivation

Proposed Implementation

1. Data Aggregation

2. Frequency Calculation & Filtering

3. Output Table

Proposed Steps in cohort_summary.py (or a New Module)

Potential Enhancements

Open Questions

Conclusion

berntpopp commented Dec 25, 2024

berntpopp commented Nov 23, 2024 •

edited

Loading

Proposed Steps in `cohort_summary.py` (or a New Module)