-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor LLM inputs into multiple files #752
Conversation
46b35f8
to
29db2ee
Compare
@@ -0,0 +1,51 @@ | |||
import json |
Check notice
Code scanning / CodeQL
Unused import Note
@@ -0,0 +1,31 @@ | |||
from typing import Any, Dict, List, Union |
Check notice
Code scanning / CodeQL
Unused import Note
from pathlib import Path | ||
from typing import Any, Dict, List, Optional, Tuple, cast | ||
from typing import Dict, List, Optional, cast |
Check notice
Code scanning / CodeQL
Unused import Note
@@ -0,0 +1,258 @@ | |||
import json | |||
import random | |||
from typing import Any, Dict, List, Union |
Check notice
Code scanning / CodeQL
Unused import Note
Import of 'Any' is not used.
62b6f81
to
8872014
Compare
180e30f
to
45b2b8d
Compare
@@ -40,12 +40,21 @@ | |||
DEFAULT_COMPARE_DIR, | |||
OPEN_ORCA, | |||
) | |||
from genai_perf.llm_inputs.llm_inputs import ( | |||
LlmInputs, | |||
from genai_perf.llm_inputs.llm_inputs import LlmInputs |
Check notice
Code scanning / CodeQL
Unused import Note
# if output_format == OutputFormat.RANKINGS: | ||
# dataset = DatasetRetriever.from_directory(input_filename) | ||
# else: |
Check notice
Code scanning / CodeQL
Commented-out code Note
4cd5a1a
to
5b03325
Compare
Get all filepaths working for chat/completions Check invalid input type combinations Catch JSON Fix tests, add copyrights Fix vLLM backend, add JSON input file check Fix TRT-LLM backend Remove unused imports Get tests passing Remove unused imports
5b03325
to
db95dfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big improvement! Looks really good so far.
|
||
|
||
class DatasetRetriever: | ||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a short description about what this class does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Good suggestion, done for all the classes.
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py
Show resolved
Hide resolved
) | ||
if "text_input" not in item: | ||
raise GenAIPerfException( | ||
"Missing 'text_input' field in one or more items." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know how often this error would occur, but it might be nice to print the malformed item, so the user doesn't have to go searching for the problem through a large json file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
dataset: Dict = {"rows": []} | ||
data = {} | ||
|
||
# Check all JSONL files in the directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like each of these should be there own method. Something like _load_data_from_file() & _create_dataset()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a counterpoint here would be that the file so far is clean (each function is from_) and the function is still short (<15 lines of code) and clean. The new functions would only be used for the directory source and would each be ~5 lines. I can see both sides though.
If you think it's valuable for readability to add other functions, I can add a utils section at the bottom of the file for the lower-level functions that help the from_* functions.
def to_generic(dataset: List[Dict[str, Any]]) -> Dict: | ||
if isinstance(dataset, list) and len(dataset) > 0: | ||
if isinstance(dataset[0], dict): | ||
# Assume dataset is a list of dictionaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment accurate? Aren't you already checking for this with the `isinstance()'s above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, good catch!
"rows": [{"row": {"text_input": item}} for item in dataset], | ||
} | ||
else: | ||
raise ValueError("Dataset is not in a recognized format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would print the dataset for the user in the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the print statement to help with debugging. I agree the flow might be a bit ambiguous for the user for now, which is something we can consider in the modularization, as we want to make it easier for users to add tasks. If we want to redesign the flow or document the schema, we should do that outside of this feature branch IMO.
model = self._select_model_name(model_name, index, model_selection_strategy) | ||
payload = {"prompt": text_content} | ||
|
||
if add_model_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is identical for both OpenAICompletion converters - possible refactor here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing the overlap. I think we should not do that, as this is a step towards creating converters for each endpoint/task. We want to keep them completely decoupled from one another. We're ultimately going in that direction, so trying to pull out functionality would only make sense if it can be extracted into helper functions that are useful across the board for different endpoints.
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
|
||
for index, row in enumerate(generic_dataset["rows"]): | ||
if "query" not in row or "passages" not in row: | ||
raise GenAIPerfException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print out the malformed query in the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,53 @@ | |||
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of the filename....but, I can't think of something that's more descriptive at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe common.py
or utils.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to inputs_utils.py
for now. If folks prefer and agree to a different name later, happy to update in another PR for this feature branch (or we can always update it later as well).
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py
Show resolved
Hide resolved
@@ -0,0 +1,53 @@ | |||
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe common.py
or utils.py
?
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_writer.py
Outdated
Show resolved
Hide resolved
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py
Show resolved
Hide resolved
) | ||
if "text_input" not in item: | ||
raise GenAIPerfException( | ||
"Missing 'text_input' field in one or more items." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion
"rows": [{"row": {"text_input": item}} for item in dataset], | ||
} | ||
else: | ||
raise ValueError("Dataset is not in a recognized format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py
Show resolved
Hide resolved
@@ -0,0 +1,53 @@ | |||
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?
src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py
Show resolved
Hide resolved
Address feedback
2a2dc70
to
1a3d97f
Compare
Standardize copyrights Print more details in exception Address feedback
3c23e1d
to
4d033ef
Compare
I did my best to address everyone's comments. If I missed any or you have additional feedback, please let me know. |
@nv-braf @nv-hwoo @debermudez @matthewkotila Are you good to approve this? I can merge it to the feature branch. My immediate next step will be to rebase on the current main in a separate PR, then address the outstanding future work items above. |
I'll defer to others--working on other priorities at the moment. |
As a step in refactoring LLM inputs, split up the monolithic file into multiple components:
This PR targets a feature branch. After this PR, all endpoints work except rankings (due to requiring multiple files and pre-processing that likely requires a new pre-processing class to work with this paradigm).
Future work:
Outside-of-feature future work: