Skip to content

Commit

Permalink
Refactor LLM inputs into multiple files (#752)
Browse files Browse the repository at this point in the history
  • Loading branch information
dyastremsky authored Jul 24, 2024
1 parent db888f1 commit 5a55a7e
Show file tree
Hide file tree
Showing 17 changed files with 1,142 additions and 2,149 deletions.
16 changes: 8 additions & 8 deletions src/c++/perf_analyzer/genai-perf/docs/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,18 @@ GenAI-Perf allows you to profile embedding models running on an
To create a sample embeddings input file, use the following command:

```bash
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
echo '{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}' > embeddings.jsonl
```

This will generate a file named embeddings.jsonl with the following content:
```jsonl
{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}
{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}
```

## Start an OpenAI Embeddings-Compatible Server
Expand Down
16 changes: 8 additions & 8 deletions src/c++/perf_analyzer/genai-perf/docs/rankings.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,19 +44,19 @@ mkdir rankings_jsonl
Inside this directory, create a JSONL file named queries.jsonl with queries data:

```bash
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > rankings_jsonl/queries.jsonl
echo '{"text_input": "What was the first car ever driven?"}
{"text_input": "Who served as the 5th President of the United States of America?"}
{"text_input": "Is the Sydney Opera House located in Australia?"}
{"text_input": "In what state did they film Shrek 2?"}' > rankings_jsonl/queries.jsonl
```

Create another JSONL file named passages.jsonl with passages data:

```bash
echo '{"text": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."}
{"text": "Kevin Loader is a British film and television producer."}
{"text": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."}
{"text": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}' > rankings_jsonl/passages.jsonl
echo '{"text_input": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."}
{"text_input": "Kevin Loader is a British film and television producer."}
{"text_input": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."}
{"text_input": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}' > rankings_jsonl/passages.jsonl
```

## Start a Hugging Face Re-Ranker-Compatible Server
Expand Down
32 changes: 22 additions & 10 deletions src/c++/perf_analyzer/genai-perf/genai_perf/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,28 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


class GenAIPerfException(Exception):
Expand Down
32 changes: 22 additions & 10 deletions src/c++/perf_analyzer/genai-perf/genai_perf/llm_inputs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,25 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

from pathlib import Path
from typing import Any, Dict, List

import requests
from genai_perf.exceptions import GenAIPerfException
from genai_perf.llm_inputs.synthetic_prompt_generator import SyntheticPromptGenerator
from genai_perf.tokenizer import Tokenizer
from genai_perf.utils import load_json_str


class DatasetRetriever:
"""
This class retrieves the dataset from different sources and formats it into a corresponding format.
"""

@staticmethod
def from_url(url: str, starting_index: int, length: int) -> List[Dict[str, Any]]:
url += f"&offset={starting_index}&length={length}"
response = requests.get(url)
response.raise_for_status()
dataset = response.json()
rows = dataset.get("rows", [])[starting_index : starting_index + length]
formatted_rows = [
{
"text_input": row["row"].get("question", ""),
"system_prompt": row["row"].get("system_prompt", ""),
"response": row["row"].get("response", ""),
}
for row in rows
]
return formatted_rows

@staticmethod
def from_file(file_path: Path) -> List[Dict[str, str]]:
with open(file_path, "r") as file:
data = [load_json_str(line) for line in file]

for item in data:
if not isinstance(item, dict):
raise GenAIPerfException(
"File content is not in the expected format."
)
if "text_input" not in item:
raise GenAIPerfException(
f"Missing 'text_input' field in file item: {item}"
)
if len(item) != 1:
raise GenAIPerfException(
f"Field other than 'text_input' field found in file item: {item}"
)

return [{"text_input": item["text_input"]} for item in data]

@staticmethod
def from_directory(directory_path: Path) -> Dict:
# TODO: Add support for an extra preprocessing step after loading the files to optionally create/modify the dataset.
# For files calling this method (e.g. rankings), it is a must to create the dataset before converting to the generic format.
dataset: Dict = {"rows": []}
data = {}

# Check all JSONL files in the directory
for file_path in directory_path.glob("*.jsonl"):
# Get the file name without suffix
key = file_path.stem
with open(file_path, "r") as file:
data[key] = [load_json_str(line) for line in file]

# Create rows with keys based on file names without suffix
num_entries = len(next(iter(data.values())))
for i in range(num_entries):
row = {key: data[key][i] for key in data}
dataset["rows"].append({"row": row})

return dataset

@staticmethod
def from_synthetic(
tokenizer: Tokenizer,
prompt_tokens_mean: int,
prompt_tokens_stddev: int,
num_of_output_prompts: int,
) -> List[Dict[str, str]]:
synthetic_prompts = []
for _ in range(num_of_output_prompts):
synthetic_prompt = SyntheticPromptGenerator.create_synthetic_prompt(
tokenizer, prompt_tokens_mean, prompt_tokens_stddev
)
synthetic_prompts.append({"text_input": synthetic_prompt})
return synthetic_prompts
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

from enum import Enum, auto


class ModelSelectionStrategy(Enum):
ROUND_ROBIN = auto()
RANDOM = auto()


class PromptSource(Enum):
SYNTHETIC = auto()
DATASET = auto()
FILE = auto()

def to_lowercase(self):
return self.name.lower()


class OutputFormat(Enum):
OPENAI_CHAT_COMPLETIONS = auto()
OPENAI_COMPLETIONS = auto()
OPENAI_EMBEDDINGS = auto()
RANKINGS = auto()
TENSORRTLLM = auto()
VLLM = auto()

def to_lowercase(self):
return self.name.lower()


DEFAULT_STARTING_INDEX = 0
DEFAULT_LENGTH = 100
DEFAULT_TENSORRTLLM_MAX_TOKENS = 256
DEFAULT_BATCH_SIZE = 1
DEFAULT_RANDOM_SEED = 0
DEFAULT_PROMPT_TOKENS_MEAN = 550
DEFAULT_PROMPT_TOKENS_STDDEV = 0
DEFAULT_OUTPUT_TOKENS_MEAN = -1
DEFAULT_OUTPUT_TOKENS_STDDEV = 0
DEFAULT_NUM_PROMPTS = 100
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

from typing import Any, Dict, List


class JSONConverter:
"""
This class converts the dataset into a generic format that
is agnostic of the data source.
"""

@staticmethod
def to_generic(dataset: List[Dict[str, Any]]) -> Dict:
if isinstance(dataset, list) and len(dataset) > 0:
if isinstance(dataset[0], dict):
converted_data = []
for item in dataset:
row_data = {
"text_input": item.get("text_input", ""),
"system_prompt": item.get("system_prompt", ""),
"response": item.get("response", ""),
}
converted_data.append(row_data)
return {
"features": ["text_input", "system_prompt", "response"],
"rows": [{"row": item} for item in converted_data],
}
elif isinstance(dataset[0], str):
# Assume dataset is a list of strings
return {
"features": ["text_input"],
"rows": [{"row": {"text_input": item}} for item in dataset],
}
else:
raise ValueError(
f"Dataset is not in a recognized format. Dataset: `{dataset}`"
)
else:
raise ValueError(
f"Dataset is empty or not in a recognized format. Dataset: `{dataset}`"
)
Loading

0 comments on commit 5a55a7e

Please sign in to comment.