-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FragPipe PSM conversion #84
Conversation
WalkthroughA new command-line interface (CLI) command, Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (2)
quantmsio/core/fragpipe.py (1)
171-174
: Simplify the assignment using a ternary operatorYou can replace the
if
-else
block with a ternary operator for more concise code.Apply this diff to simplify the code:
if pos_aa == "N-term": - position = 0 - else: - position = int(pos_aa[:-1]) + position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])🧰 Tools
🪛 Ruff (0.8.0)
171-174: Use ternary operator
position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead ofif
-else
-blockReplace
if
-else
-block withposition = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
quantmsio/commands/fragpipe_command.py (1)
35-40
: Ensure type annotations match the optionsThe
convert_fragpipe_psm
function parameters should match the types specified in theclick
options for consistency and type safety.Verify that the type annotations correspond to the option types, and adjust if necessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
quantmsio/commands/fragpipe_command.py
(1 hunks)quantmsio/core/fragpipe.py
(1 hunks)quantmsio/quantmsioc.py
(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsio/core/fragpipe.py
171-174: Use ternary operator position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead of if
-else
-block
Replace if
-else
-block with position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
🔇 Additional comments (5)
quantmsio/core/fragpipe.py (3)
21-37
: The Parameter
class implementation looks good
The class provides necessary parsing and formatting functionalities, ensuring proper handling of parameters.
42-76
: The ModificationPositions
class is well-designed
Methods for appending, extending, and formatting positions are appropriately implemented.
78-101
: The ModificationIdentifier
class handles identifiers correctly
The class initializes and validates modification identifiers effectively, ensuring they are correctly formatted.
quantmsio/quantmsioc.py (2)
35-39
: Logging configuration is appropriately set
The logging configuration has been added to enhance traceability and debugging capabilities.
57-57
: The convert_fragpipe_psm
command is successfully integrated
The new command has been added to the CLI, making it accessible via the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
quantmsio/core/fragpipe.py (2)
1-3
: Remove unused importThe
numbers.Number
import is not used in the code.-from numbers import Number
🧰 Tools
🪛 Ruff (0.8.0)
2-2:
numbers.Number
imported but unusedRemove unused import:
numbers.Number
(F401)
399-404
: Consider performance optimizations for large datasetsThe current implementation processes data in batches but could be optimized:
- Consider parallel processing for batch conversion
- Use Arrow's streaming API for memory efficiency
- Add progress tracking for long-running conversions
Would you like me to provide an example implementation with these optimizations?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsio/core/fragpipe.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsio/core/fragpipe.py
2-2: numbers.Number
imported but unused
Remove unused import: numbers.Number
(F401)
170-173: Use ternary operator position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead of if
-else
-block
Replace if
-else
-block with position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
🔇 Additional comments (3)
quantmsio/core/fragpipe.py (3)
147-152
: Add @classmethod
decorator to to_arrow
method
The to_arrow
method uses cls
as its first parameter but is missing the @classmethod
decorator.
399-404
:
Add error handling for file operations and data parsing
The pd.read_csv
operation should handle:
- File format issues (ParserError)
- Memory issues during batch processing
- Data type conversion errors
def convert_psms(
self,
file_path: Path,
batch_size: int = 10000,
) -> Iterator[pa.RecordBatch]:
- iterator = pd.read_csv(file_path, iterator=True, chunksize=batch_size, sep="\t")
- for batch in iterator:
- idents = Identification.from_dataframe(batch)
- table = Identification.to_arrow(idents)
- yield from table.to_batches(batch_size)
+ try:
+ iterator = pd.read_csv(file_path, iterator=True, chunksize=batch_size, sep="\t")
+ for batch in iterator:
+ try:
+ idents = Identification.from_dataframe(batch)
+ table = Identification.to_arrow(idents)
+ yield from table.to_batches(batch_size)
+ except Exception as e:
+ logger.error(f"Error processing batch: {str(e)}")
+ raise
+ except pd.errors.ParserError as e:
+ logger.error(f"Error parsing CSV file: {str(e)}")
+ raise
+ except Exception as e:
+ logger.error(f"Unexpected error: {str(e)}")
+ raise
Likely invalid or redundant comment.
195-220
: Verify peptide modification handling edge cases
The peptidoform property handles modifications but should be verified for:
- Multiple modifications at the same position
- C-terminal modifications
- Empty modification lists
✅ Verification successful
Based on the search results, I can now provide a comprehensive verification of the peptidoform property's handling of edge cases:
Edge cases are properly handled in the peptidoform implementation
The code correctly handles:
- Multiple modifications at the same position - through the list structure
mods_at_index
which can store multiple modifications - N-terminal modifications - explicitly handled with
if pos == 0
condition - Empty modification lists - implicitly handled as modifications are only added when present
The implementation also properly raises NotImplementedError for unsupported cases:
- Variably localized modifications
- CV-based modifications
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test cases covering edge cases in peptide modifications
rg -l "def test.*peptidoform" --type python
Length of output: 76
Script:
#!/bin/bash
# Search for test files and test cases covering peptide modifications
rg -l "test.*peptid"
Length of output: 22
Script:
#!/bin/bash
# Search for usage of peptidoform method
rg "peptidoform" -A 5 -B 5
Length of output: 67914
Script:
#!/bin/bash
# Search for test cases with modifications
rg "modification.*test|test.*modification" -A 5 -B 5
Length of output: 54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
quantmsio/core/fragpipe.py (2)
170-174
: Simplify position assignment using ternary operatorThe code can be more concise by using a ternary operator.
- if pos_aa == "N-term": - position = 0 - else: - position = int(pos_aa[:-1]) + position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])🧰 Tools
🪛 Ruff (0.8.0)
170-173: Use ternary operator
position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead ofif
-else
-blockReplace
if
-else
-block withposition = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
1-486
: Overall implementation looks good!The code demonstrates good separation of concerns and follows a clear data processing pipeline. The data classes are well-structured, and the Arrow schema definitions are appropriate for the domain.
Consider adding comprehensive logging throughout the data processing pipeline to aid in debugging and monitoring in production.
🧰 Tools
🪛 Ruff (0.8.0)
170-173: Use ternary operator
position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead ofif
-else
-blockReplace
if
-else
-block withposition = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
quantmsio/commands/fragpipe_command.py
(1 hunks)quantmsio/core/fragpipe.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsio/commands/fragpipe_command.py
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsio/core/fragpipe.py
170-173: Use ternary operator position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead of if
-else
-block
Replace if
-else
-block with position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
🔇 Additional comments (2)
quantmsio/core/fragpipe.py (2)
147-152
: Add @classmethod
decorator to to_arrow
method
The method uses cls
as its first parameter but is missing the @classmethod
decorator.
408-413
: Add error handling for file operations and data parsing
The code needs error handling for file operations and data parsing.
def format(self, parameters=None) -> str: | ||
if not parameters: | ||
return "|".join(map(str, self.positions)) | ||
else: | ||
if len(parameters) != len(self): | ||
raise ValueError( | ||
f"{self.__class__.__name__}.format requires the parameter list" | ||
f" length ({len(parameters)}) match its length ({len(self)})" | ||
) | ||
return "|".join(f"{i}{p.format()}" for i, p in zip(self.positions, parameters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add defensive check for empty positions list
The format
method could raise an exception when parameters
is None but positions
exist. Consider adding a defensive check.
def format(self, parameters=None) -> str:
+ if not self.positions:
+ return ""
if not parameters:
return "|".join(map(str, self.positions))
else:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def format(self, parameters=None) -> str: | |
if not parameters: | |
return "|".join(map(str, self.positions)) | |
else: | |
if len(parameters) != len(self): | |
raise ValueError( | |
f"{self.__class__.__name__}.format requires the parameter list" | |
f" length ({len(parameters)}) match its length ({len(self)})" | |
) | |
return "|".join(f"{i}{p.format()}" for i, p in zip(self.positions, parameters)) | |
def format(self, parameters=None) -> str: | |
if not self.positions: | |
return "" | |
if not parameters: | |
return "|".join(map(str, self.positions)) | |
else: | |
if len(parameters) != len(self): | |
raise ValueError( | |
f"{self.__class__.__name__}.format requires the parameter list" | |
f" length ({len(parameters)}) match its length ({len(self)})" | |
) | |
return "|".join(f"{i}{p.format()}" for i, p in zip(self.positions, parameters)) |
def identification_from_row(row: pd.Series) -> Identification: | ||
spectrum = spectrum_from_row(row) | ||
peptide = peptide_from_row(row) | ||
|
||
calculated_mz = row["Calculated M/Z"] | ||
|
||
posterior = row["PeptideProphet Probability"] | ||
|
||
cv_params = [ | ||
Parameter( | ||
"number of missed cleavages", | ||
"PSI-MS", | ||
str(row["Number of Missed Cleavages"]), | ||
"MS:1003044", | ||
), | ||
Parameter( | ||
"number of enzymatic termini", | ||
"PSI-MS", | ||
str(row["Number of Enzymatic Termini"]), | ||
"MS:1003048", | ||
), | ||
] | ||
|
||
additional_scores = [ | ||
Score("MSFragger:Hyperscore", row["Hyperscore"]), | ||
Score("MSFragger:Expectation", row["Expectation"]), | ||
] | ||
|
||
return Identification( | ||
spectrum, | ||
peptide, | ||
calculated_mz, | ||
cv_params=cv_params, | ||
additional_scores=additional_scores, | ||
posterior_error_probability=posterior, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add column validation before accessing row data
The function assumes all required columns exist in the DataFrame. Consider adding validation to prevent KeyError exceptions.
def identification_from_row(row: pd.Series) -> Identification:
+ required_columns = {
+ "Calibrated Observed M/Z", "Charge", "Retention", "Spectrum",
+ "Protein ID", "Gene", "Protein Start", "Protein End",
+ "Peptide", "Assigned Modifications", "Calculated M/Z",
+ "PeptideProphet Probability", "Number of Missed Cleavages",
+ "Number of Enzymatic Termini", "Hyperscore", "Expectation"
+ }
+ missing_columns = required_columns - set(row.index)
+ if missing_columns:
+ raise ValueError(f"Missing required columns: {missing_columns}")
+
spectrum = spectrum_from_row(row)
peptide = peptide_from_row(row)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def identification_from_row(row: pd.Series) -> Identification: | |
spectrum = spectrum_from_row(row) | |
peptide = peptide_from_row(row) | |
calculated_mz = row["Calculated M/Z"] | |
posterior = row["PeptideProphet Probability"] | |
cv_params = [ | |
Parameter( | |
"number of missed cleavages", | |
"PSI-MS", | |
str(row["Number of Missed Cleavages"]), | |
"MS:1003044", | |
), | |
Parameter( | |
"number of enzymatic termini", | |
"PSI-MS", | |
str(row["Number of Enzymatic Termini"]), | |
"MS:1003048", | |
), | |
] | |
additional_scores = [ | |
Score("MSFragger:Hyperscore", row["Hyperscore"]), | |
Score("MSFragger:Expectation", row["Expectation"]), | |
] | |
return Identification( | |
spectrum, | |
peptide, | |
calculated_mz, | |
cv_params=cv_params, | |
additional_scores=additional_scores, | |
posterior_error_probability=posterior, | |
) | |
def identification_from_row(row: pd.Series) -> Identification: | |
required_columns = { | |
"Calibrated Observed M/Z", "Charge", "Retention", "Spectrum", | |
"Protein ID", "Gene", "Protein Start", "Protein End", | |
"Peptide", "Assigned Modifications", "Calculated M/Z", | |
"PeptideProphet Probability", "Number of Missed Cleavages", | |
"Number of Enzymatic Termini", "Hyperscore", "Expectation" | |
} | |
missing_columns = required_columns - set(row.index) | |
if missing_columns: | |
raise ValueError(f"Missing required columns: {missing_columns}") | |
spectrum = spectrum_from_row(row) | |
peptide = peptide_from_row(row) | |
calculated_mz = row["Calculated M/Z"] | |
posterior = row["PeptideProphet Probability"] | |
cv_params = [ | |
Parameter( | |
"number of missed cleavages", | |
"PSI-MS", | |
str(row["Number of Missed Cleavages"]), | |
"MS:1003044", | |
), | |
Parameter( | |
"number of enzymatic termini", | |
"PSI-MS", | |
str(row["Number of Enzymatic Termini"]), | |
"MS:1003048", | |
), | |
] | |
additional_scores = [ | |
Score("MSFragger:Hyperscore", row["Hyperscore"]), | |
Score("MSFragger:Expectation", row["Expectation"]), | |
] | |
return Identification( | |
spectrum, | |
peptide, | |
calculated_mz, | |
cv_params=cv_params, | |
additional_scores=additional_scores, | |
posterior_error_probability=posterior, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
quantmsio/core/fragpipe.py (3)
355-361
: Add input validation for batch_size parameterThe batch_size parameter should be validated to ensure it's a positive integer.
Add parameter validation:
def write_psms_to_parquet( self, file_path: Path, batch_size: int = 10000, output_prefix_file: Optional[str] = None, **metadata, ): + if not isinstance(batch_size, int) or batch_size <= 0: + raise ValueError(f"batch_size must be a positive integer, got {batch_size}")
400-410
: Add progress tracking for large file processingFor large files, it would be helpful to track progress and provide feedback to users.
Consider using tqdm for progress tracking:
+from tqdm import tqdm + def convert_psms( self, file_path: Path, batch_size: int = 10000, ) -> Iterator[pa.RecordBatch]: iterator = pd.read_csv(file_path, iterator=True, chunksize=batch_size, sep="\t") + total_size = sum(1 for _ in open(file_path)) - 1 # subtract header + progress = tqdm(total=total_size, desc="Converting PSMs") for batch in iterator: idents = Identification.from_dataframe(batch) table = Identification.to_arrow(idents) + progress.update(len(batch)) yield from table.to_batches(batch_size) + progress.close()
400-410
: Consider memory optimization for large file processingThe current implementation loads each batch into memory entirely before converting it. For very large files, this could lead to high memory usage.
Consider these optimizations:
- Process records in smaller chunks within each batch
- Use generators for transformation steps
- Implement memory usage monitoring
- Add configuration options for memory management
Example approach:
def process_chunk(chunk: pd.DataFrame, chunk_size: int = 1000): for i in range(0, len(chunk), chunk_size): mini_chunk = chunk.iloc[i:i + chunk_size] idents = Identification.from_dataframe(mini_chunk) yield from Identification.to_arrow(idents).to_batches()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsio/core/fragpipe.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsio/core/fragpipe.py
170-173: Use ternary operator position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
instead of if
-else
-block
Replace if
-else
-block with position = 0 if pos_aa == "N-term" else int(pos_aa[:-1])
(SIM108)
def peptidoform(self) -> str: | ||
peptide = proforma.ProForma.parse(self.sequence) | ||
|
||
mod: MzTabModification | ||
for mod in self.modifications: | ||
if len(mod.position) > 1: | ||
raise NotImplementedError("Variably localized modifications") | ||
pos = mod.position[0] | ||
mod_tag = None | ||
if mod.is_chemmod(): | ||
mod_tag = proforma.MassModification(mod.identifier.mass_value) | ||
else: | ||
raise NotImplementedError("CV-based modifications") | ||
if pos == 0: | ||
# n-term | ||
if not peptide.n_term: | ||
peptide.n_term = [] | ||
peptide.n_term.append(mod_tag) | ||
else: | ||
index = pos - 1 | ||
(aa, mods_at_index) = peptide.sequence[index] | ||
if not mods_at_index: | ||
mods_at_index = [] | ||
mods_at_index.append(mod_tag) | ||
peptide.sequence[index] = (aa, mods_at_index) | ||
return peptide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and improve error messages in peptidoform property
The peptidoform property implementation has several areas that could be improved:
- Add error handling for invalid sequences in
proforma.ProForma.parse
- Add bounds checking before accessing
peptide.sequence[index]
- Improve error messages for unsupported features
Consider applying these improvements:
@property
def peptidoform(self) -> str:
- peptide = proforma.ProForma.parse(self.sequence)
+ try:
+ peptide = proforma.ProForma.parse(self.sequence)
+ except Exception as e:
+ raise ValueError(f"Invalid peptide sequence: {self.sequence}") from e
mod: MzTabModification
for mod in self.modifications:
if len(mod.position) > 1:
- raise NotImplementedError("Variably localized modifications")
+ raise NotImplementedError(
+ f"Variably localized modifications are not supported. Found at positions: {mod.position}"
+ )
pos = mod.position[0]
mod_tag = None
if mod.is_chemmod():
mod_tag = proforma.MassModification(mod.identifier.mass_value)
else:
- raise NotImplementedError("CV-based modifications")
+ raise NotImplementedError(
+ f"CV-based modifications are not supported. Found: {mod.identifier}"
+ )
if pos == 0:
# n-term
if not peptide.n_term:
peptide.n_term = []
peptide.n_term.append(mod_tag)
else:
index = pos - 1
+ if not (0 <= index < len(peptide.sequence)):
+ raise ValueError(
+ f"Modification position {pos} is out of range for sequence length {len(peptide.sequence)}"
+ )
(aa, mods_at_index) = peptide.sequence[index]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def peptidoform(self) -> str: | |
peptide = proforma.ProForma.parse(self.sequence) | |
mod: MzTabModification | |
for mod in self.modifications: | |
if len(mod.position) > 1: | |
raise NotImplementedError("Variably localized modifications") | |
pos = mod.position[0] | |
mod_tag = None | |
if mod.is_chemmod(): | |
mod_tag = proforma.MassModification(mod.identifier.mass_value) | |
else: | |
raise NotImplementedError("CV-based modifications") | |
if pos == 0: | |
# n-term | |
if not peptide.n_term: | |
peptide.n_term = [] | |
peptide.n_term.append(mod_tag) | |
else: | |
index = pos - 1 | |
(aa, mods_at_index) = peptide.sequence[index] | |
if not mods_at_index: | |
mods_at_index = [] | |
mods_at_index.append(mod_tag) | |
peptide.sequence[index] = (aa, mods_at_index) | |
return peptide | |
def peptidoform(self) -> str: | |
try: | |
peptide = proforma.ProForma.parse(self.sequence) | |
except Exception as e: | |
raise ValueError(f"Invalid peptide sequence: {self.sequence}") from e | |
mod: MzTabModification | |
for mod in self.modifications: | |
if len(mod.position) > 1: | |
raise NotImplementedError( | |
f"Variably localized modifications are not supported. Found at positions: {mod.position}" | |
) | |
pos = mod.position[0] | |
mod_tag = None | |
if mod.is_chemmod(): | |
mod_tag = proforma.MassModification(mod.identifier.mass_value) | |
else: | |
raise NotImplementedError( | |
f"CV-based modifications are not supported. Found: {mod.identifier}" | |
) | |
if pos == 0: | |
# n-term | |
if not peptide.n_term: | |
peptide.n_term = [] | |
peptide.n_term.append(mod_tag) | |
else: | |
index = pos - 1 | |
if not (0 <= index < len(peptide.sequence)): | |
raise ValueError( | |
f"Modification position {pos} is out of range for sequence length {len(peptide.sequence)}" | |
) | |
(aa, mods_at_index) = peptide.sequence[index] | |
if not mods_at_index: | |
mods_at_index = [] | |
mods_at_index.append(mod_tag) | |
peptide.sequence[index] = (aa, mods_at_index) | |
return peptide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful PR @mobiusklein
User description
This adds a PSM conversion for FragPipe. It mirrors the MaxQuant conversion in style, but takes a different approach to building the
RecordBatch
stream.I tried to factorize the different components to be re-usable for the different views, but only built out the PSM so far. I realized near the end that I wasn't sure which was the "current" schema anymore, so I decided to halt where I'm at and get feedback.
PR Type
enhancement
Description
convert-fragpipe-psm
to convert FragPipe PSMs frompsm.tsv
to parquet format using thequantms.io
library.fragpipe.py
to handle the conversion of PSM data, including parsing and formatting of peptide and spectrum data.Changes walkthrough 📝
fragpipe_command.py
Add command for FragPipe PSM conversion to parquet
quantmsio/commands/fragpipe_command.py
convert-fragpipe-psm
for converting FragPipe PSMs.chunk size.
FragPipe
instance to handle conversion to parquet format.fragpipe.py
Implement FragPipe PSM data conversion to Arrow and Parquet
quantmsio/core/fragpipe.py
quantmsioc.py
Integrate FragPipe PSM conversion command into CLI
quantmsio/quantmsioc.py
convert_fragpipe_psm
command.Summary by CodeRabbit
Release Notes
New Features
convert-fragpipe-psm
for converting PSM data from TSV to Parquet format.Improvements
These updates enhance data processing capabilities and improve user experience with the command-line interface.