Refactor: `kestrel_pre_result.tsv` to Match `kestrel_result.tsv` Columns #72

berntpopp · 2024-12-25T11:54:23Z

Currently, kestrel_pre_result.tsv lacks several important columns that appear in kestrel_result.tsv, such as Depth_Score, Confidence, Motif_fasta, Frame_Score, Motif_left, Motif_right, and POS_fasta. This discrepancy makes the output less informative and more difficult to merge or compare with final Kestrel results.

This issue proposes refactoring how kestrel_pre_result.tsv is generated so it includes the same columns as kestrel_result.tsv—or at least a superset that covers all relevant fields (e.g., Depth_Score, Confidence, Motifs, etc.). By doing so, it will streamline downstream analysis (such as cohort mode or rare allele filtering).

Motivation

Consistency:
- Having consistent columns in kestrel_pre_result.tsv and kestrel_result.tsv allows for direct comparison and merging without additional transformations or complicated parsing.
Completeness of Information:
- Important fields like Depth_Score and Confidence offer critical insights into variant quality and precision, which are missing in the current “pre” result.
Downstream Integrations:
- Features like rare allele outlier analysis (Feature request: introduce a rare allele outlier analyis in cohort mode #33) or other cohort-level computations rely on uniform data. Including all columns from the final result eases integration.

Proposed Changes

Add Extra Columns to kestrel_pre_result.tsv:
- Depth_Score
- Confidence
- Motif_fasta (if relevant in the pre-processing step)
- Frame_Score
- Motif_left
- Motif_right
- POS_fasta
Maintain Existing Columns:
- Keep everything already present (e.g., Motifs, POS, REF, ALT, Sample, Variant, and the depth columns) to avoid breaking existing logic.
Adjust the Generation Logic:
- Update wherever kestrel_pre_result.tsv is produced (e.g., in the Kestrel postprocessing or intermediate pipeline steps) so that it includes these columns if the data is available at that stage.
Check for Data Availability:
- Some fields (e.g., Frame_Score) might be computed only in later steps. If feasible, we should compute or carry it forward earlier.
- If certain columns (like Motif_left, Motif_right) are not yet computed in the “pre” stage, consider either computing them or marking them with placeholders.
Documentation & Testing:
- Update any documentation or comments describing the contents of kestrel_pre_result.tsv.
- Add or adjust tests (unit/integration) to verify these new columns are populated correctly.

Implementation Outline

Locate the code path that generates kestrel_pre_result.tsv.
Pull in logic (or data) that calculates Depth_Score, Confidence, Frame_Score, etc. from the final Kestrel result stage if feasible.
Append new columns to the DataFrame or CSV writer.
Maintain or incorporate existing data transformations so that Motif_left/Motif_right are computed if relevant in the “pre” stage.
Test by comparing the new kestrel_pre_result.tsv columns with those in kestrel_result.tsv to confirm consistency.

Benefits

Ease of Merging: Tools that read kestrel_pre_result.tsv can now seamlessly handle the same columns used in final results, improving workflow consistency.
Enhanced Debugging: Having Depth_Score or Confidence in pre-results helps debug or track variant filtering earlier in the pipeline.
Future Pipeline Development: Additional logic (like the rare allele outlier analysis) can rely on these columns without re-generating them from scratch.

Conclusion

By aligning kestrel_pre_result.tsv with kestrel_result.tsv, we ensure a consistent, flexible pipeline output that’s easier to integrate, debug, and extend for advanced analyses.

The text was updated successfully, but these errors were encountered:

berntpopp · 2024-12-25T12:03:35Z

Depends on #72

berntpopp added enhancement New feature or request refactor Refcator the code base labels Dec 25, 2024

berntpopp self-assigned this Dec 25, 2024

berntpopp mentioned this issue Dec 25, 2024

Feature request: introduce a rare allele outlier analyis in cohort mode #33

Open

berntpopp mentioned this issue Jan 13, 2025

Fix missing columns in kestrel_result.tsv and implement detailed screening summary #78

Merged

hassansaei closed this as completed in #78 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: `kestrel_pre_result.tsv` to Match `kestrel_result.tsv` Columns #72

Refactor: `kestrel_pre_result.tsv` to Match `kestrel_result.tsv` Columns #72

berntpopp commented Dec 25, 2024

berntpopp commented Dec 25, 2024

Refactor: kestrel_pre_result.tsv to Match kestrel_result.tsv Columns #72

Refactor: kestrel_pre_result.tsv to Match kestrel_result.tsv Columns #72

Comments

berntpopp commented Dec 25, 2024

Motivation

Proposed Changes

Implementation Outline

Benefits

Conclusion

berntpopp commented Dec 25, 2024

Refactor: `kestrel_pre_result.tsv` to Match `kestrel_result.tsv` Columns #72

Refactor: `kestrel_pre_result.tsv` to Match `kestrel_result.tsv` Columns #72