Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: kestrel_pre_result.tsv to Match kestrel_result.tsv Columns #72

Closed
berntpopp opened this issue Dec 25, 2024 · 1 comment · Fixed by #78
Closed

Refactor: kestrel_pre_result.tsv to Match kestrel_result.tsv Columns #72

berntpopp opened this issue Dec 25, 2024 · 1 comment · Fixed by #78
Assignees
Labels
enhancement New feature or request refactor Refcator the code base

Comments

@berntpopp
Copy link
Collaborator

Currently, kestrel_pre_result.tsv lacks several important columns that appear in kestrel_result.tsv, such as Depth_Score, Confidence, Motif_fasta, Frame_Score, Motif_left, Motif_right, and POS_fasta. This discrepancy makes the output less informative and more difficult to merge or compare with final Kestrel results.

This issue proposes refactoring how kestrel_pre_result.tsv is generated so it includes the same columns as kestrel_result.tsv—or at least a superset that covers all relevant fields (e.g., Depth_Score, Confidence, Motifs, etc.). By doing so, it will streamline downstream analysis (such as cohort mode or rare allele filtering).


Motivation

  1. Consistency:

    • Having consistent columns in kestrel_pre_result.tsv and kestrel_result.tsv allows for direct comparison and merging without additional transformations or complicated parsing.
  2. Completeness of Information:

    • Important fields like Depth_Score and Confidence offer critical insights into variant quality and precision, which are missing in the current “pre” result.
  3. Downstream Integrations:


Proposed Changes

  1. Add Extra Columns to kestrel_pre_result.tsv:

    • Depth_Score
    • Confidence
    • Motif_fasta (if relevant in the pre-processing step)
    • Frame_Score
    • Motif_left
    • Motif_right
    • POS_fasta
  2. Maintain Existing Columns:

    • Keep everything already present (e.g., Motifs, POS, REF, ALT, Sample, Variant, and the depth columns) to avoid breaking existing logic.
  3. Adjust the Generation Logic:

    • Update wherever kestrel_pre_result.tsv is produced (e.g., in the Kestrel postprocessing or intermediate pipeline steps) so that it includes these columns if the data is available at that stage.
  4. Check for Data Availability:

    • Some fields (e.g., Frame_Score) might be computed only in later steps. If feasible, we should compute or carry it forward earlier.
    • If certain columns (like Motif_left, Motif_right) are not yet computed in the “pre” stage, consider either computing them or marking them with placeholders.
  5. Documentation & Testing:

    • Update any documentation or comments describing the contents of kestrel_pre_result.tsv.
    • Add or adjust tests (unit/integration) to verify these new columns are populated correctly.

Implementation Outline

  1. Locate the code path that generates kestrel_pre_result.tsv.
  2. Pull in logic (or data) that calculates Depth_Score, Confidence, Frame_Score, etc. from the final Kestrel result stage if feasible.
  3. Append new columns to the DataFrame or CSV writer.
  4. Maintain or incorporate existing data transformations so that Motif_left/Motif_right are computed if relevant in the “pre” stage.
  5. Test by comparing the new kestrel_pre_result.tsv columns with those in kestrel_result.tsv to confirm consistency.

Benefits

  • Ease of Merging: Tools that read kestrel_pre_result.tsv can now seamlessly handle the same columns used in final results, improving workflow consistency.
  • Enhanced Debugging: Having Depth_Score or Confidence in pre-results helps debug or track variant filtering earlier in the pipeline.
  • Future Pipeline Development: Additional logic (like the rare allele outlier analysis) can rely on these columns without re-generating them from scratch.

Conclusion

By aligning kestrel_pre_result.tsv with kestrel_result.tsv, we ensure a consistent, flexible pipeline output that’s easier to integrate, debug, and extend for advanced analyses.

@berntpopp berntpopp added enhancement New feature or request refactor Refcator the code base labels Dec 25, 2024
@berntpopp berntpopp self-assigned this Dec 25, 2024
@berntpopp
Copy link
Collaborator Author

Depends on #72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Refcator the code base
Projects
None yet
1 participant