Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LLM inputs into multiple files #752

Merged
merged 3 commits into from
Jul 24, 2024

Conversation

dyastremsky
Copy link
Contributor

@dyastremsky dyastremsky commented Jul 12, 2024

As a step in refactoring LLM inputs, split up the monolithic file into multiple components:

  • Dataset Retriever: pulls the data from its source into an expected format.
  • JSON Converter: converts the data into a generic JSON format.
  • OutputFormat Converter: converts the generic JSON format into the formats needed for a specific endpoint. There is a base class, then a class for each endpoint/format. In a future PR, this file will likely get broken up into multiple files.
  • Shared: This file holds shared enums used by the LLM inputs.

This PR targets a feature branch. After this PR, all endpoints work except rankings (due to requiring multiple files and pre-processing that likely requires a new pre-processing class to work with this paradigm).

Future work:

  • Add generic multi-file processing
  • Add preprocessing class that allows for data to be composed differently before being translated to the generic format
  • Add ranking support
  • Add batching support for embedding and ranking models
  • Compare against old code to identify and fix any misalignment in what's supported (e.g. output token counts, etc.)
  • Get all commented out tests passing
  • Confirm all endpoints still work with all flags (including --streaming and --extra-inputs) and on all currently supported platforms (including NIM)
  • Update unit tests to comprehensively test new format

Outside-of-feature future work:

@dyastremsky dyastremsky self-assigned this Jul 12, 2024
@dyastremsky dyastremsky force-pushed the dyas-inputs-refactor branch 3 times, most recently from 46b35f8 to 29db2ee Compare July 12, 2024 21:04
@@ -0,0 +1,51 @@
import json

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'json' is not used.
@@ -0,0 +1,31 @@
from typing import Any, Dict, List, Union

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Union' is not used.
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, cast
from typing import Dict, List, Optional, cast

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'List' is not used.
@@ -0,0 +1,258 @@
import json
import random
from typing import Any, Dict, List, Union

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Union' is not used.
Import of 'Any' is not used.
# See the License for the specific language governing permissions and
# limitations under the License.

import json

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'json' is not used.
@@ -40,12 +40,21 @@
DEFAULT_COMPARE_DIR,
OPEN_ORCA,
)
from genai_perf.llm_inputs.llm_inputs import (
LlmInputs,
from genai_perf.llm_inputs.llm_inputs import LlmInputs

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'LlmInputs' is not used.
Comment on lines +91 to +94
# if output_format == OutputFormat.RANKINGS:
# dataset = DatasetRetriever.from_directory(input_filename)
# else:

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@dyastremsky dyastremsky changed the base branch from main to feat-inputs-refactor July 15, 2024 22:54
@dyastremsky dyastremsky force-pushed the dyas-inputs-refactor branch 2 times, most recently from 4cd5a1a to 5b03325 Compare July 16, 2024 00:00
@@ -29,53 +29,64 @@

import pytest
from genai_perf.llm_inputs.llm_inputs import LlmInputs, ModelSelectionStrategy
from genai_perf.llm_inputs.shared import OutputFormat, PromptSource

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'PromptSource' is not used.
Get all filepaths working for chat/completions

Check invalid input type combinations

Catch JSON

Fix tests, add copyrights

Fix vLLM backend, add JSON input file check

Fix TRT-LLM backend

Remove unused imports

Get tests passing

Remove unused imports
@dyastremsky dyastremsky marked this pull request as ready for review July 16, 2024 00:08
Copy link
Contributor

@nv-braf nv-braf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big improvement! Looks really good so far.



class DatasetRetriever:
@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a short description about what this class does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Good suggestion, done for all the classes.

)
if "text_input" not in item:
raise GenAIPerfException(
"Missing 'text_input' field in one or more items."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know how often this error would occur, but it might be nice to print the malformed item, so the user doesn't have to go searching for the problem through a large json file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

dataset: Dict = {"rows": []}
data = {}

# Check all JSONL files in the directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like each of these should be there own method. Something like _load_data_from_file() & _create_dataset()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a counterpoint here would be that the file so far is clean (each function is from_) and the function is still short (<15 lines of code) and clean. The new functions would only be used for the directory source and would each be ~5 lines. I can see both sides though.

If you think it's valuable for readability to add other functions, I can add a utils section at the bottom of the file for the lower-level functions that help the from_* functions.

def to_generic(dataset: List[Dict[str, Any]]) -> Dict:
if isinstance(dataset, list) and len(dataset) > 0:
if isinstance(dataset[0], dict):
# Assume dataset is a list of dictionaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate? Aren't you already checking for this with the `isinstance()'s above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, good catch!

"rows": [{"row": {"text_input": item}} for item in dataset],
}
else:
raise ValueError("Dataset is not in a recognized format.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would print the dataset for the user in the error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the print statement to help with debugging. I agree the flow might be a bit ambiguous for the user for now, which is something we can consider in the modularization, as we want to make it easier for users to add tasks. If we want to redesign the flow or document the schema, we should do that outside of this feature branch IMO.

model = self._select_model_name(model_name, index, model_selection_strategy)
payload = {"prompt": text_content}

if add_model_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is identical for both OpenAICompletion converters - possible refactor here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing the overlap. I think we should not do that, as this is a step towards creating converters for each endpoint/task. We want to keep them completely decoupled from one another. We're ultimately going in that direction, so trying to pull out functionality would only make sense if it can be extracted into helper functions that are useful across the board for different endpoints.


for index, row in enumerate(generic_dataset["rows"]):
if "query" not in row or "passages" not in row:
raise GenAIPerfException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print out the malformed query in the error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,53 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the filename....but, I can't think of something that's more descriptive at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe common.py or utils.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to inputs_utils.py for now. If folks prefer and agree to a different name later, happy to update in another PR for this feature branch (or we can always update it later as well).

@@ -0,0 +1,53 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe common.py or utils.py?

)
if "text_input" not in item:
raise GenAIPerfException(
"Missing 'text_input' field in one or more items."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

"rows": [{"row": {"text_input": item}} for item in dataset],
}
else:
raise ValueError("Dataset is not in a recognized format.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?

@@ -0,0 +1,53 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?

Address feedback
Standardize copyrights

Print more details in exception

Address feedback
@dyastremsky
Copy link
Contributor Author

I did my best to address everyone's comments. If I missed any or you have additional feedback, please let me know.

@dyastremsky dyastremsky requested a review from nv-braf July 16, 2024 23:00
@dyastremsky
Copy link
Contributor Author

dyastremsky commented Jul 24, 2024

@nv-braf @nv-hwoo @debermudez @matthewkotila Are you good to approve this? I can merge it to the feature branch. My immediate next step will be to rebase on the current main in a separate PR, then address the outstanding future work items above.

@matthewkotila
Copy link
Contributor

@dyastremsky: @nv-braf @nv-hwoo @debermudez @matthewkotila Are you good to approve this? I can merge it to the feature branch. My immediate next step will be to rebase on the current main in a separate PR, then address the outstanding future work items above.

I'll defer to others--working on other priorities at the moment.

@dyastremsky dyastremsky merged commit 5a55a7e into feat-inputs-refactor Jul 24, 2024
5 checks passed
@dyastremsky dyastremsky deleted the dyas-inputs-refactor branch July 24, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants