You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, cohort mode in the pipeline aggregates Kestrel results from multiple samples and produces an overall summary. However, there is no dedicated mechanism to detect or highlight rare alleles across the cohort. This feature request proposes introducing a rare allele outlier analysis to identify potentially overlooked or low-frequency variants that might be clinically or biologically significant.
Motivation
Identify Low-Frequency Variants:
Rare alleles (e.g., <10% in a population) can be strong indicators of possible missed diagnoses or unique genotype patterns.
Improve Cohort Insight:
By filtering at the cohort level, we can highlight variants that might be too low in frequency to stand out in single-sample reports, yet become important across many samples.
Automated Discovery:
Aggregating Kestrel results from each sample (kestrel_pre_result.tsv) into a single database or DataFrame, then applying a cutoff, reduces manual searching and error.
Proposed Implementation
1. Data Aggregation
Input: Each sample’s Kestrel output, typically named kestrel_pre_result.tsv.
Process:
Gather all kestrel_pre_result.tsv files (one per sample) during cohort mode.
Parse columns such as:
Motifs (e.g. 5-A, 5C-A)
POS (position)
REF/ALT
Sample (or an added identifier)
Others: Estimated_Depth_AlternateVariant, Estimated_Depth_Variant_ActiveRegion, etc.
Merge them into a single aggregated table or DataFrame.
Possibly store columns: [Motifs, POS, REF, ALT, Sample, (other fields)].
2. Frequency Calculation & Filtering
Compute Frequency:
Group by (Motifs, POS, REF, ALT) to count how many times that variant appears across all samples.
Derive an overall frequency (e.g. count / total_sample_count).
Cutoff Parameter:
Use a user-defined threshold (e.g., rare_allele_cutoff = 0.1) to filter out variants with frequency >= 10%.
Only keep those “rare” variants under this threshold.
3. Output Table
Rare Variant Table:
Show each (Motifs, POS, REF, ALT) combination that falls under the cutoff, along with:
The count of samples where it appears.
The frequency (count / total).
Possibly the list of sample IDs that exhibit it.
Format:
Could be a single HTML or TSV output added to the cohort_summary.html or produced as a separate file (e.g. cohort_rare_alleles.tsv).
Example columns: [Motifs, POS, REF, ALT, count, frequency, sample_list, ...].
Proposed Steps in cohort_summary.py (or a New Module)
Collect all kestrel_pre_result.tsv paths.
Read each file into a Pandas DataFrame (or similar).
Append a Sample_ID column if not already present (extracted from the directory name or a CLI argument).
Concatenate all DataFrames into one large DataFrame.
Group by [Motifs, POS, REF, ALT] to get counts.
Calculate the frequency = count / total_samples.
Filter by a user-specified rare_allele_cutoff (default 0.1).
Write out the final table of rare variants.
Potential Enhancements
Configurable Cutoff:
Let the user specify --rare-allele-cutoff=0.05 (or 10% by default) in the CLI or in the config file.
Sorting the final table by frequency ascending (so the rarest are top).
Highlighting or color-coding in HTML (if integrated into cohort_summary.html).
Email Alert or log highlight if certain extremely rare variants appear.
Open Questions
Where to Integrate:
Extend cohort_summary.py with a function like filter_rare_alleles(...)?
Or create a new module, e.g. rare_allele_analysis.py, and call it from cohort subcommand?
Handling Edge Cases:
If a variant is present in 1 sample out of N (e.g., 1/50 = 2%), but multiple times in the same sample. Should that be counted differently or just 1? Possibly we consider only presence/absence per sample.
Performance:
For extremely large cohorts, we must ensure efficient grouping (likely via Pandas or an SQL-based approach if data is large).
User-Friendliness:
Should we produce only a TSV, or also an HTML snippet integrated into the main cohort_summary.html?
Conclusion
By aggregating all Kestrel outputs, calculating variant frequencies, and filtering for low-frequency cutoffs, we can highlight potential outlier alleles that might warrant extra attention. This enhancement transforms the cohort mode from a simple summarization into a robust detection tool for rare, possibly significant variants.
If approved, next steps are:
Add a CLI parameter (e.g. --rare-allele-cutoff).
Implement reading & merging Kestrel data inside cohort mode.
Generate a table of rare variants under the given frequency threshold.
Integrate into cohort_summary.html or a separate file.
The text was updated successfully, but these errors were encountered:
Currently, cohort mode in the pipeline aggregates Kestrel results from multiple samples and produces an overall summary. However, there is no dedicated mechanism to detect or highlight rare alleles across the cohort. This feature request proposes introducing a rare allele outlier analysis to identify potentially overlooked or low-frequency variants that might be clinically or biologically significant.
Motivation
Identify Low-Frequency Variants:
Rare alleles (e.g., <10% in a population) can be strong indicators of possible missed diagnoses or unique genotype patterns.
Improve Cohort Insight:
By filtering at the cohort level, we can highlight variants that might be too low in frequency to stand out in single-sample reports, yet become important across many samples.
Automated Discovery:
Aggregating Kestrel results from each sample (
kestrel_pre_result.tsv
) into a single database or DataFrame, then applying a cutoff, reduces manual searching and error.Proposed Implementation
1. Data Aggregation
kestrel_pre_result.tsv
.kestrel_pre_result.tsv
files (one per sample) during cohort mode.5-A
,5C-A
)Estimated_Depth_AlternateVariant
,Estimated_Depth_Variant_ActiveRegion
, etc.[Motifs, POS, REF, ALT, Sample, (other fields)]
.2. Frequency Calculation & Filtering
Compute Frequency:
(Motifs, POS, REF, ALT)
to count how many times that variant appears across all samples.count / total_sample_count
).Cutoff Parameter:
rare_allele_cutoff = 0.1
) to filter out variants with frequency >= 10%.3. Output Table
Rare Variant Table:
(Motifs, POS, REF, ALT)
combination that falls under the cutoff, along with:Format:
cohort_summary.html
or produced as a separate file (e.g.cohort_rare_alleles.tsv
).[Motifs, POS, REF, ALT, count, frequency, sample_list, ...]
.Proposed Steps in
cohort_summary.py
(or a New Module)kestrel_pre_result.tsv
paths.Sample_ID
column if not already present (extracted from the directory name or a CLI argument).[Motifs, POS, REF, ALT]
to get counts.count / total_samples
.rare_allele_cutoff
(default 0.1).Potential Enhancements
Let the user specify
--rare-allele-cutoff=0.05
(or 10% by default) in the CLI or in the config file.cohort_summary.html
).Open Questions
Where to Integrate:
cohort_summary.py
with a function likefilter_rare_alleles(...)
?rare_allele_analysis.py
, and call it fromcohort
subcommand?Handling Edge Cases:
Performance:
User-Friendliness:
cohort_summary.html
?Conclusion
By aggregating all Kestrel outputs, calculating variant frequencies, and filtering for low-frequency cutoffs, we can highlight potential outlier alleles that might warrant extra attention. This enhancement transforms the cohort mode from a simple summarization into a robust detection tool for rare, possibly significant variants.
If approved, next steps are:
--rare-allele-cutoff
).cohort
mode.cohort_summary.html
or a separate file.The text was updated successfully, but these errors were encountered: