Skip to content

Commit

Permalink
restructure benchmark
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Feb 6, 2024
1 parent a197daf commit 7c0b49f
Showing 1 changed file with 5 additions and 6 deletions.
11 changes: 5 additions & 6 deletions content/40.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ The benchmarking framework implements a matrix of component combinations using t
This allows the automated evaluation of all possible combinations of components, such as LLMs, prompts, and datasets.
As a default, we run each test five times to account for the stochastic nature of LLMs.
We generally set the temperature to the lowest value possible for each model to decrease fluctuation.
The results are stored in a database and displayed on the website for easy comparison.
The benchmark is updated upon the release of new models and extensions to the datasets.
The Pytest matrix uses a hash-based system to evaluate whether a model-dataset combination has been run before.
Briefly, the hash is calculated from the dictionary representation of the test parameters, and the test is skipped if the combination of hash and model name is already present in the database.
This allows automatic running of all tests that have been newly added or modified.
The individual dimensions of the matrix are:

- **LLMs**: Testing proprietary (OpenAI) and open-source models (commonly using the Xorbits Inference API and HuggingFace models) against the same set of tasks is the primary aim of our benchmarking framework. We facilitate the automation of testing by including a programmatic way of deploying open-source models.
Expand All @@ -51,10 +52,8 @@ For instance, we test the conversion of numbers (which LLMs are notoriously bad

- **sentiment and behaviour**: To assess whether the models exhibit the desired behaviour patterns for each of the personas, we let a second LLM evaluate the responses based on a set of criteria, including professionalism and politeness.

The Pytest framework is implemented at [https://github.com/biocypher/biochatter/blob/main/benchmark](https://github.com/biocypher/biochatter/blob/main/benchmark), and more information and results are available at [https://biocypher.github.io/biochatter/benchmarking](https://biocypher.github.io/biochatter/benchmarking).
The Pytest matrix uses a hash-based system to evaluate whether a model-dataset combination has been run before.
Briefly, the hash is calculated from the dictionary representation of the test parameters, and the test is skipped if the combination of hash and model name is already present in the database.
This allows automatic running of all tests that have been newly added or modified.
The Pytest framework is implemented at [https://github.com/biocypher/biochatter/blob/main/benchmark](https://github.com/biocypher/biochatter/blob/main/benchmark), and more information is available at [https://biocypher.github.io/biochatter/benchmarking](https://biocypher.github.io/biochatter/benchmarking).
The benchmark is updated upon the release of new models and extensions to the datasets, available at [https://biocypher.github.io/biochatter/benchmark](https://biocypher.github.io/biochatter/benchmark).

To prevent leakage of benchmarking data (and subsequent contamination of future LLMs), we implement an encryption routine on the benchmark datasets.
The encryption is performed using a hybrid encryption scheme, where the data are encrypted with a symmetric key, which is in turn encrypted with an asymmetric key.
Expand Down

0 comments on commit 7c0b49f

Please sign in to comment.