Refactor LLM inputs into multiple files #752

dyastremsky · 2024-07-12T19:57:41Z

As a step in refactoring LLM inputs, split up the monolithic file into multiple components:

Dataset Retriever: pulls the data from its source into an expected format.
JSON Converter: converts the data into a generic JSON format.
OutputFormat Converter: converts the generic JSON format into the formats needed for a specific endpoint. There is a base class, then a class for each endpoint/format. In a future PR, this file will likely get broken up into multiple files.
Shared: This file holds shared enums used by the LLM inputs.

This PR targets a feature branch. After this PR, all endpoints work except rankings (due to requiring multiple files and pre-processing that likely requires a new pre-processing class to work with this paradigm).

Future work:

Add generic multi-file processing
Add preprocessing class that allows for data to be composed differently before being translated to the generic format
Add ranking support
Add batching support for embedding and ranking models
Compare against old code to identify and fix any misalignment in what's supported (e.g. output token counts, etc.)
Get all commented out tests passing
Confirm all endpoints still work with all flags (including --streaming and --extra-inputs) and on all currently supported platforms (including NIM)
Update unit tests to comprehensively test new format

Outside-of-feature future work:

Determine best design for passing in just the needed args to each converter
Updating OutputFormat to use another name
Rename "LLM Inputs" directory as "inputs" directory (no longer LLM-specific)
Use aliases for complex data structures

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

@@ -0,0 +1,51 @@
+import json


src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_converter.py

@@ -0,0 +1,31 @@
+from typing import Any, Dict, List, Union


src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py

 from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple, cast
+from typing import Dict, List, Optional, cast


src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

@@ -0,0 +1,258 @@
+import json
+import random
+from typing import Any, Dict, List, Union


src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json


src/c++/perf_analyzer/genai-perf/genai_perf/parser.py

@@ -40,12 +40,21 @@
    DEFAULT_COMPARE_DIR,
    OPEN_ORCA,
 )
-from genai_perf.llm_inputs.llm_inputs import (
-    LlmInputs,
+from genai_perf.llm_inputs.llm_inputs import LlmInputs


src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py

+            # if output_format == OutputFormat.RANKINGS:
+            #     dataset = DatasetRetriever.from_directory(input_filename)
+            # else:


src/c++/perf_analyzer/genai-perf/tests/test_llm_inputs_embeddings.py

@@ -29,53 +29,64 @@

 import pytest
 from genai_perf.llm_inputs.llm_inputs import LlmInputs, ModelSelectionStrategy
+from genai_perf.llm_inputs.shared import OutputFormat, PromptSource


Get all filepaths working for chat/completions Check invalid input type combinations Catch JSON Fix tests, add copyrights Fix vLLM backend, add JSON input file check Fix TRT-LLM backend Remove unused imports Get tests passing Remove unused imports

nv-braf

A big improvement! Looks really good so far.

nv-braf · 2024-07-16T13:44:29Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

+
+
+class DatasetRetriever:
+    @staticmethod


Add a short description about what this class does

Done! Good suggestion, done for all the classes.

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

nv-braf · 2024-07-16T13:51:12Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

+                    )
+                if "text_input" not in item:
+                    raise GenAIPerfException(
+                        "Missing 'text_input' field in one or more items."


Don't know how often this error would occur, but it might be nice to print the malformed item, so the user doesn't have to go searching for the problem through a large json file.

Good suggestion

nv-braf · 2024-07-16T13:54:33Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

+        dataset: Dict = {"rows": []}
+        data = {}
+
+        # Check all JSONL files in the directory


I feel like each of these should be there own method. Something like _load_data_from_file() & _create_dataset()

I think a counterpoint here would be that the file so far is clean (each function is from_) and the function is still short (<15 lines of code) and clean. The new functions would only be used for the directory source and would each be ~5 lines. I can see both sides though.

If you think it's valuable for readability to add other functions, I can add a utils section at the bottom of the file for the lower-level functions that help the from_* functions.

nv-braf · 2024-07-16T13:56:54Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_converter.py

+    def to_generic(dataset: List[Dict[str, Any]]) -> Dict:
+        if isinstance(dataset, list) and len(dataset) > 0:
+            if isinstance(dataset[0], dict):
+                # Assume dataset is a list of dictionaries


Is this comment accurate? Aren't you already checking for this with the `isinstance()'s above?

Removed, good catch!

nv-braf · 2024-07-16T13:57:59Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_converter.py

+                    "rows": [{"row": {"text_input": item}} for item in dataset],
+                }
+            else:
+                raise ValueError("Dataset is not in a recognized format.")


I would print the dataset for the user in the error message

This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?

Added the print statement to help with debugging. I agree the flow might be a bit ambiguous for the user for now, which is something we can consider in the modularization, as we want to make it easier for users to add tasks. If we want to redesign the flow or document the schema, we should do that outside of this feature branch IMO.

nv-braf · 2024-07-16T14:07:45Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

+            model = self._select_model_name(model_name, index, model_selection_strategy)
+            payload = {"prompt": text_content}
+
+            if add_model_name:


This section is identical for both OpenAICompletion converters - possible refactor here

Thanks for noticing the overlap. I think we should not do that, as this is a step towards creating converters for each endpoint/task. We want to keep them completely decoupled from one another. We're ultimately going in that direction, so trying to pull out functionality would only make sense if it can be extracted into helper functions that are useful across the board for different endpoints.

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

nv-braf · 2024-07-16T14:11:57Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

+
+        for index, row in enumerate(generic_dataset["rows"]):
+            if "query" not in row or "passages" not in row:
+                raise GenAIPerfException(


Print out the malformed query in the error message

nv-braf · 2024-07-16T14:13:29Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/shared.py

@@ -0,0 +1,53 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Not a fan of the filename....but, I can't think of something that's more descriptive at the moment.

maybe common.py or utils.py?

This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?

Renamed to inputs_utils.py for now. If folks prefer and agree to a different name later, happy to update in another PR for this feature branch (or we can always update it later as well).

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

nv-hwoo · 2024-07-16T18:19:26Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/shared.py

@@ -0,0 +1,53 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


maybe common.py or utils.py?

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_writer.py

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

debermudez · 2024-07-16T17:25:46Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

+                    )
+                if "text_input" not in item:
+                    raise GenAIPerfException(
+                        "Missing 'text_input' field in one or more items."


Good suggestion

debermudez · 2024-07-16T17:49:10Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/json_converter.py

+                    "rows": [{"row": {"text_input": item}} for item in dataset],
+                }
+            else:
+                raise ValueError("Dataset is not in a recognized format.")


This method has always bothered me.
I think it because I need to relearn the generic format every time we add support.
I dont know if printing the dataset in the message is the call or if we need to document the schema somewhere. Thoughts?

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py

debermudez · 2024-07-16T19:09:35Z

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/shared.py

@@ -0,0 +1,53 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This is my comment too. Can we move this to something like inputs_shared_utilities or something like that?

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/dataset_retriever.py

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/llm_inputs.py

Address feedback

Standardize copyrights Print more details in exception Address feedback

dyastremsky · 2024-07-16T22:59:57Z

I did my best to address everyone's comments. If I missed any or you have additional feedback, please let me know.

dyastremsky · 2024-07-24T16:11:02Z

@nv-braf @nv-hwoo @debermudez @matthewkotila Are you good to approve this? I can merge it to the feature branch. My immediate next step will be to rebase on the current main in a separate PR, then address the outstanding future work items above.

matthewkotila · 2024-07-24T16:39:15Z

@dyastremsky: @nv-braf @nv-hwoo @debermudez @matthewkotila Are you good to approve this? I can merge it to the feature branch. My immediate next step will be to rebase on the current main in a separate PR, then address the outstanding future work items above.

I'll defer to others--working on other priorities at the moment.

dyastremsky self-assigned this Jul 12, 2024

dyastremsky force-pushed the dyas-inputs-refactor branch 3 times, most recently from 46b35f8 to 29db2ee Compare July 12, 2024 21:04

github-advanced-security bot found potential problems Jul 12, 2024

View reviewed changes

dyastremsky force-pushed the dyas-inputs-refactor branch from 62b6f81 to 8872014 Compare July 12, 2024 23:16

github-advanced-security bot found potential problems Jul 15, 2024

View reviewed changes

src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/output_format_converter.py Outdated

# See the License for the specific language governing permissions and

# limitations under the License.

import json

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'json' is not used.

dyastremsky force-pushed the dyas-inputs-refactor branch from 180e30f to 45b2b8d Compare July 15, 2024 21:23

github-advanced-security bot found potential problems Jul 15, 2024

View reviewed changes

dyastremsky changed the base branch from main to feat-inputs-refactor July 15, 2024 22:54

dyastremsky force-pushed the dyas-inputs-refactor branch 2 times, most recently from 4cd5a1a to 5b03325 Compare July 16, 2024 00:00

github-advanced-security bot found potential problems Jul 16, 2024

View reviewed changes

Break classes up

db95dfe

Get all filepaths working for chat/completions Check invalid input type combinations Catch JSON Fix tests, add copyrights Fix vLLM backend, add JSON input file check Fix TRT-LLM backend Remove unused imports Get tests passing Remove unused imports

dyastremsky force-pushed the dyas-inputs-refactor branch from 5b03325 to db95dfe Compare July 16, 2024 00:08

dyastremsky marked this pull request as ready for review July 16, 2024 00:08

dyastremsky requested review from debermudez, matthewkotila, nv-braf and nv-hwoo July 16, 2024 00:08

nv-braf reviewed Jul 16, 2024

View reviewed changes

nv-hwoo reviewed Jul 16, 2024

View reviewed changes

debermudez reviewed Jul 16, 2024

View reviewed changes

Address feedback

d164433

Address feedback

dyastremsky force-pushed the dyas-inputs-refactor branch from 2a2dc70 to 1a3d97f Compare July 16, 2024 19:42

Standardize copyrights

4d033ef

Standardize copyrights Print more details in exception Address feedback

dyastremsky force-pushed the dyas-inputs-refactor branch from 3c23e1d to 4d033ef Compare July 16, 2024 20:28

dyastremsky requested review from nv-hwoo and debermudez July 16, 2024 23:00

dyastremsky requested a review from nv-braf July 16, 2024 23:00

nv-braf approved these changes Jul 24, 2024

View reviewed changes

dyastremsky merged commit 5a55a7e into feat-inputs-refactor Jul 24, 2024
5 checks passed

dyastremsky deleted the dyas-inputs-refactor branch July 24, 2024 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LLM inputs into multiple files #752

Refactor LLM inputs into multiple files #752

dyastremsky commented Jul 12, 2024 •

edited

Loading

nv-braf left a comment

nv-braf Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

debermudez Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

debermudez Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

dyastremsky Jul 16, 2024

nv-braf Jul 16, 2024

nv-hwoo Jul 16, 2024

debermudez Jul 16, 2024

dyastremsky Jul 16, 2024

nv-hwoo Jul 16, 2024

debermudez Jul 16, 2024

debermudez Jul 16, 2024

debermudez Jul 16, 2024

dyastremsky commented Jul 16, 2024

dyastremsky commented Jul 24, 2024 •

edited

Loading

matthewkotila commented Jul 24, 2024

		@@ -0,0 +1,53 @@
		# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Refactor LLM inputs into multiple files #752

Refactor LLM inputs into multiple files #752

Conversation

dyastremsky commented Jul 12, 2024 • edited Loading

nv-braf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyastremsky commented Jul 16, 2024

dyastremsky commented Jul 24, 2024 • edited Loading

matthewkotila commented Jul 24, 2024

dyastremsky commented Jul 12, 2024 •

edited

Loading

dyastremsky commented Jul 24, 2024 •

edited

Loading