Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #61

Merged
merged 4 commits into from
Sep 17, 2024
Merged

Dev #61

merged 4 commits into from
Sep 17, 2024

Conversation

zprobot
Copy link
Collaborator

@zprobot zprobot commented Sep 17, 2024

PR Type

enhancement, documentation, tests


Description

  • Implemented new classes and methods for handling PSM data, including MzTab and PsmInMemory.
  • Defined schema for PSM fields using PyArrow.
  • Added test cases for the new PSM processing functionality.
  • Updated documentation and schema files to reflect new field names and mappings.

Changes walkthrough 📝

Relevant files
Enhancement
4 files
common.py
Define PSM mapping and additional attributes                         

quantmsio/temp_core/common.py

  • Defined PSM_MAP and PSM_USECOLS for mapping PSM data.
  • Added ADDITIONS list for additional PSM attributes.
  • +39/-0   
    format.py
    Define schema for PSM fields using PyArrow                             

    quantmsio/temp_core/format.py

  • Defined PEPTIDE_FIELDS and PSM_UNIQUE_FIELDS using PyArrow.
  • Combined fields into PSM_FIELDS.
  • +165/-0 
    mzTab.py
    Implement MzTab class for mzTab file processing                   

    quantmsio/temp_core/mzTab.py

  • Implemented MzTab class for handling mzTab files.
  • Added methods for extracting modifications and MS runs.
  • +224/-0 
    psm.py
    Create PsmInMemory class for PSM processing                           

    quantmsio/temp_core/psm.py

  • Created PsmInMemory class extending MzTab.
  • Added methods for generating reports and converting to Parquet.
  • +105/-0 
    Tests
    1 files
    test_new_psm.py
    Add test case for PsmInMemory class                                           

    tests/test_new_psm.py

  • Added test case for PsmInMemory class.
  • Tested conversion of mzTab to feature.
  • +10/-0   
    Documentation
    5 files
    README.adoc
    Update README with new PSM field descriptions                       

    docs/README.adoc

  • Updated field descriptions for peptide-based views.
  • Added mappings for new PSM fields.
  • +15/-14 
    feature.avsc
    Update field names in feature schema                                         

    docs/feature.avsc

    • Updated field names for gene accessions and names.
    +2/-2     
    peptide.avsc
    Update field names in peptide schema                                         

    docs/peptide.avsc

    • Updated field names for gene accessions and names.
    +2/-2     
    protein.avsc
    Update field names in protein schema                                         

    docs/protein.avsc

    • Updated field names for gene accessions and names.
    +2/-2     
    psm.avsc
    Update field names in PSM schema                                                 

    docs/psm.avsc

    • Updated field names for gene accessions and names.
    +2/-2     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    @qodo-merge-pro qodo-merge-pro bot added documentation Improvements or additions to documentation enhancement New feature or request tests labels Sep 17, 2024
    Copy link

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Potential Performance Issue
    The generate_report method uses a for loop to process chunks of data, which might be inefficient for large datasets. Consider using a more memory-efficient approach or implementing parallel processing.

    Code Duplication
    The transform_psm method contains multiple similar operations for adding columns to the DataFrame. Consider refactoring to reduce duplication and improve maintainability.

    Error Handling
    The fetch_modifications_from_mztab_line function raises exceptions without proper error handling or logging. Consider implementing more robust error handling and providing informative error messages.

    Copy link

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Performance
    Improve memory efficiency when writing large DataFrames to Parquet files

    Consider using a more memory-efficient approach when writing large DataFrames to
    Parquet files by using the pyarrow.parquet.ParquetWriter with chunked writing.

    quantmsio/temp_core/psm.py [80-83]

    -for p in self.generate_report(chunksize=chunksize,protein_str=protein_str):
    -    if not pqwriter:
    -        pqwriter = pq.ParquetWriter(output_path, p.schema)
    -    pqwriter.write_table(p)
    +with pq.ParquetWriter(output_path, PSM_SCHEMA) as pqwriter:
    +    for p in self.generate_report(chunksize=chunksize, protein_str=protein_str):
    +        pqwriter.write_table(p)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion provides a significant improvement in memory efficiency by using a context manager for ParquetWriter, which is a good practice for handling large data. This change is beneficial for resource management, warranting a higher score.

    8
    Vectorize the generation of 'pg_positions' for improved performance

    Use a more efficient method to generate the pg_positions column by vectorizing the
    operation instead of using apply.

    quantmsio/temp_core/psm.py [41-44]

    -df.loc[:, 'pg_positions'] = df[['start','end']].apply(
    -    lambda row: self.generate_positions(row["start"], row["end"]),
    -    axis=1
    -)
    +df['pg_positions'] = df['start'].str.split(',').combine(df['end'].str.split(','), lambda x, y: [f'{s}:{e}' for s, e in zip(x, y)])
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion correctly identifies a potential performance improvement by vectorizing the operation, which can lead to faster execution for large DataFrames. However, the improvement is not critical, hence a moderate score.

    7
    Vectorize the generation of 'additional_scores' for improved performance

    Use a more efficient method to generate the additional_scores column by applying the
    function to the entire DataFrame at once instead of row by row.

    quantmsio/temp_core/psm.py [48]

    -df.loc[:, "additional_scores"] = df[list(self._score_names.values())].apply(self._genarate_additional_scores,axis=1)
    +df["additional_scores"] = df[list(self._score_names.values())].apply(lambda x: [{"name": name, "value": x[score]} for name, score in self._score_names.items()], axis=1)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion offers a valid performance enhancement by vectorizing the operation, which can be beneficial for large datasets. The change is not essential but can improve efficiency, justifying a moderate score.

    7
    Optimize the generation of 'peptidoform' for better performance

    Consider using a more efficient method to generate the peptidoform column by
    vectorizing the operation instead of using apply.

    quantmsio/temp_core/psm.py [49-52]

    -df.loc[:,'peptidoform'] = df[["modifications", "sequence"]].apply(
    -    lambda row: get_peptidoform_proforma_version_in_mztab(row["sequence"], row["modifications"], self._modifications),
    -    axis=1
    -)
    +df['peptidoform'] = df.apply(lambda row: get_peptidoform_proforma_version_in_mztab(row["sequence"], row["modifications"], self._modifications), axis=1)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: The suggestion proposes a vectorized approach, but the improved code does not fully vectorize the operation as it still uses apply. The impact on performance is limited, resulting in a lower score.

    5

    @zprobot zprobot merged commit 0232717 into 1.0.0 Sep 17, 2024
    8 of 11 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    documentation Improvements or additions to documentation enhancement New feature or request Review effort [1-5]: 4 tests
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant