This is the datasets and source code for OOPSLA-24 paper 'Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models'.
conda create -n <your_env_name> python=3.10
pip install -r requirements.txt
- Java 8 or higher. The command
java -version
should complete successfully with a line like:java version "1.8.0_92"
.
- To run Llama2 series and Mistral series models locally, we apply (at least) two NVIDIA A100 PCIe 80GB GPUs.
- To interact with GPTs, we need an OPENAI API key (replace the
your_api_key
part in theOpenAI Interact
) - We use Ollama to run local models.
- Starting the model on the Linux server (default port:11434).
ollama run [MODEL_NAME]
- Interact using request library.
response = requests.post("http://your_server_ip:11434/api/generate", json={ "model": [MODEL_NAME], "prompt": [INPUT_PROMPT], "stream": False }).json()
- There are also online services for open-source LLMs like DeepInfra. We can use the OPENAI python library to generate responses.
- Download StanfordCoreNLP and unzip the file.
- Starting the service on the Linux server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port your_port -timeout 15000
- Use the code in
Response Validation
to analyze LLM responses.
The benchmark dataset is in Drowzee Benchmark.
Each item in the dataset follows the structure below:
-
qid
: A unique identifier for each test case in the dataset. -
category
: Knowledge domain for entities used in this test case (culture/geography/history/people/health/math/nature/society/technology). -
reasoning
: Reasoning strategy used when generating this test case (transitive/negation/inverse/composite). -
subject
: The subject entity of the derived fact in the test case. -
predicate
: The derived relation betweensubject
andobject
in this test case. -
object
: The secondary entity or concept linked to the subject by the predicate. -
description
: The first sentence from the Wikipedia page ofsubject
andobject
. -
question
: The question generated according to the derived fact in this test case. -
answer
: The ground truth answer to the question based on the derived fact and evidence. -
evidence
: A list of supporting triples that are used to derive facts in this test case.
.
├── 0.fact_extraction
│ ├── category
│ ├── get_wiki_cat_id.py
│ ├── transitive_entity_extract.py
│ ├── transitive_pl_build.py
│ └── wiki_pl_build.py
├── 1.logical_reasoning
│ ├── prolog_inference.py
│ ├── prolog_rules
│ │ ├── composite_problem_dict.json
│ │ ├── composite_rules.pl
│ │ ├── inverse_problem_dict.json
│ │ ├── inverse_rules.pl
│ │ ├── negation_problem_dict.json
│ │ ├── negation_rules.pl
│ │ ├── transitive_problem_dict.json
│ │ └── transitive_rules.pl
│ └── rule_generation.py
├── 2.benchmark_construction
│ ├── benchmark_generation.py
│ └── question_generation.py
├── 3.llm_interaction
│ ├── local_interaction.py
│ ├── openai_interaction.py
│ └── test_vllm.py
├── 4.response_evaluation
│ └── answer_test.py
└── utils
├── file_ops.py
└── wiki_property_cat_v1.xlsx
This project makes use of the following open-source projects:
- PySwip: PySwip is a Python - SWI-Prolog bridge enabling to query SWI-Prolog in your Python programs.
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs.
- Ollama: Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
- StanfordCoreNLP: A suite of NLP tools.
- Phrase-BERT: A model to embed phrases.
@article{10.1145/3689776,
author = {Li, Ningke and Li, Yuekang and Liu, Yi and Shi, Ling and Wang, Kailong and Wang, Haoyu},
title = {Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models},
year = {2024},
issue_date = {October 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {8},
number = {OOPSLA2},
url = {https://doi.org/10.1145/3689776},
doi = {10.1145/3689776},
month = oct,
articleno = {336},
numpages = {30},
keywords = {Hallucination, Large Language Model, Software Testing}
}