Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
ruoxining committed Mar 17, 2024
1 parent 68481b9 commit 2e8e093
Showing 1 changed file with 7 additions and 12 deletions.
19 changes: 7 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [📜 License](#-license)
- [📚 Citation](#-citation)
- [📮 Contact](#-contact)
- [Acknowledgement](#acknowledgement)

# 🚀 Introduction
Welcome to our GitHub repository for the "Evaluating Open-QA Evaluation" [paper](https://arxiv.org/abs/2305.12421), a comprehensive study on the evaluating of evaluation methods in Open Question Answering (Open-QA) systems.
Expand All @@ -33,11 +34,7 @@

Each data point in our dataset is represented as a dictionary with the following keys:
```
"question": The question asked in the Open-QA task.
"golden_answer": The gold standard answer to the question.
"answer_fid", "answer_gpt35", "answer_chatgpt", "answer_gpt4", "answer_newbing": The answers generated by different models (FiD, GPT-3.5, ChatGPT-3.5, GPT-4, and New Bing, respectively).
"judge_fid", "judge_gpt35", "judge_chatgpt", "judge_gpt4", "judge_newbing": Boolean values indicating whether the corresponding model's answer was judged to be correct or incorrect (True for correct, False for incorrect) by human.
"improper": Boolean flag indicating whether the question was inappropriate or not (True for inappropriate, False for proper).
```
Here is an example of a data point:
```json
Expand All @@ -54,7 +51,6 @@
|ChatGPT-4 |3610|2000|
|Bing Chat |3610|2000|


# 🏆 Evaluation & Submission


Expand All @@ -64,14 +60,13 @@
graph LR
A[🤗 Huggingface] --(Input Data)--> B[🤖Your Model]
B --(Model output)--> C[⚖️Codabench]
C --(Accuracy Score)--> D[🗳️Google Form]
D ----> E[🏆Leaderboard Website]
A[[🤗 Huggingface]] --(Input Data)--> B[[🤖Your Model]]
B --(Model output)--> C[[⚖️Codabench]]
C --(Accuracy Score)--> D[[🗳️Google Form]]
D ----> E[[🏆Leaderboard Website]]
```


# 📜 License

This dataset is released under the [Apache-2.0 License](LICENSE).
Expand All @@ -86,7 +81,7 @@ If you use this dataset in your research, please cite it as follows:
We welcome contributions to improve this dataset!
If you have any questions or feedback, please feel free to reach out at [email protected].


## Acknowledgement

This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io).

Expand Down

0 comments on commit 2e8e093

Please sign in to comment.