-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
7 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
- [📜 License](#-license) | ||
- [📚 Citation](#-citation) | ||
- [📮 Contact](#-contact) | ||
- [Acknowledgement](#acknowledgement) | ||
|
||
# 🚀 Introduction | ||
Welcome to our GitHub repository for the "Evaluating Open-QA Evaluation" [paper](https://arxiv.org/abs/2305.12421), a comprehensive study on the evaluating of evaluation methods in Open Question Answering (Open-QA) systems. | ||
|
@@ -33,11 +34,7 @@ | |
|
||
Each data point in our dataset is represented as a dictionary with the following keys: | ||
``` | ||
"question": The question asked in the Open-QA task. | ||
"golden_answer": The gold standard answer to the question. | ||
"answer_fid", "answer_gpt35", "answer_chatgpt", "answer_gpt4", "answer_newbing": The answers generated by different models (FiD, GPT-3.5, ChatGPT-3.5, GPT-4, and New Bing, respectively). | ||
"judge_fid", "judge_gpt35", "judge_chatgpt", "judge_gpt4", "judge_newbing": Boolean values indicating whether the corresponding model's answer was judged to be correct or incorrect (True for correct, False for incorrect) by human. | ||
"improper": Boolean flag indicating whether the question was inappropriate or not (True for inappropriate, False for proper). | ||
``` | ||
Here is an example of a data point: | ||
```json | ||
|
@@ -54,7 +51,6 @@ | |
|ChatGPT-4 |3610|2000| | ||
|Bing Chat |3610|2000| | ||
|
||
|
||
# 🏆 Evaluation & Submission | ||
|
||
|
||
|
@@ -64,14 +60,13 @@ | |
graph LR | ||
A[🤗 Huggingface] --(Input Data)--> B[🤖Your Model] | ||
B --(Model output)--> C[⚖️Codabench] | ||
C --(Accuracy Score)--> D[🗳️Google Form] | ||
D ----> E[🏆Leaderboard Website] | ||
A[[🤗 Huggingface]] --(Input Data)--> B[[🤖Your Model]] | ||
B --(Model output)--> C[[⚖️Codabench]] | ||
C --(Accuracy Score)--> D[[🗳️Google Form]] | ||
D ----> E[[🏆Leaderboard Website]] | ||
``` | ||
|
||
|
||
# 📜 License | ||
|
||
This dataset is released under the [Apache-2.0 License](LICENSE). | ||
|
@@ -86,7 +81,7 @@ If you use this dataset in your research, please cite it as follows: | |
We welcome contributions to improve this dataset! | ||
If you have any questions or feedback, please feel free to reach out at [email protected]. | ||
|
||
|
||
## Acknowledgement | ||
|
||
This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io). | ||
|
||
|