From 1ec43efd529f679cc382a7216b9b11bf9347bda3 Mon Sep 17 00:00:00 2001 From: Pedro Azevedo Date: Tue, 14 Apr 2020 19:09:42 +0100 Subject: [PATCH] Adding System Strucuture and the how to reproduce --- README.md | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6f8649ca..5f5bf444 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,11 @@ # DeFactoNLP -DeFactoNLP is an automated fact-checking system designed for the FEVER 2018 Shared Task which was held at EMNLP 2018. It is capable of verifying claims and retrieving sentences from Wikipedia which support the assessment. This is accomplished through the usage of named entity recognition, TF-IDF vector comparison and decomposable attention models. +DeFactoNLP is an automated fact-checking system designed for the FEVER 2018 Shared Task which was held at EMNLP 2018. It is capable of verifying claims and retrieving sentences from Wikipedia which support the assessment. This is accomplished through the usage of named entity recognition, TF-IDF vector comparison, and decomposable attention models. +### Achievements +* 5th place in F1 Evidence Score +* 12th place in the FEVER Score +### Cite If you use this code, please cite: ``` @@ -12,3 +16,41 @@ author={Reddy, Aniketh Janardhan and Rocha, Gil and Esteves, Diego}, year={2018} } ``` + +# System Structure + +![System Structure](/images/work_structure.png) + +# Reproducing + +To reproduce this work, an understanding of the files and scripts are needed. + +### Data + +All the files with the Claim information are in the [Data](/data) folder. +The files [train.jsonl](/data/train.jsonl), [dev.jsonl](/data/dev.jsonl) and [test.jsonl](/data/test.jsonl) are the files extracted from the FEVER database. + +The Wikipedia corpus can be downloaded running the script [download-raw-wiki.sh](/fever-baselines/scripts/download-raw-wiki.sh). To accelerate the algorithms, we divided every article into files using the script [split_wiki_into_indv_docs.py](split_wiki_into_indv_docs.py). + +We also created a train subsample using the script [subsample_training_data.py](subsample_training_data.py). + +The files [subsample_train_relevant_docs.jsonl](/data/subsample_train_relevant_docs.jsonl), [shared_task_dev_public_relevant_docs.jsonl](/data/shared_task_dev_public_relevant_docs.jsonl) and [shared_task_test_relevant_docs.jsonl](/data/shared_task_test_relevant_docs.jsonl) contain the information from the TF-IDF part of Document Retrieval (*predicted_pages*) and Sentence Retrieval (*predicted_sentences*). + +### TF-IDF (Document and Sentence Retrieval) + +The TF-IDF results can be reproduced by running certain scripts inside [fever-baselines](/fever-baselines/) folder. First, download the [database](/fever-baselines#data-preparation) and than, run the [tf-idf](/fever-baselines#evidence-retrieval-evaluation) part. +The files are already generated and can be found in the [data](/data) folder. + +### NER and Entailment (Document and Sentence Retrieval) + +The NER results and calculating all the sentence probability of supporting, refuting and being uninformative to the claim are calculated in the same script [predict.py](predict.py). +The scripts predict will load the TF-IDF results, calculate the NER documents and concatenate the predicted sentences from both. Further, those sentences will go to the RTE module and all probabilities will be calculated. +The functions of the NER Document Retrieval part are in [doc_retrieval.py](doc_retrieval.py). +The functions for the RTE module are in the folder [rte](/rte) that has their READ.me +The [predict.py](predict.py) script will generate a file for every claim with every calculated probability for each sentence. + +### Label Classification + +The Label Classification is performed training a Random Forest model. The [train_label_classifier.py](train_label_classifier.py) script will train and also predict the claim label based on the probabilities from the RTE model. +The folder [entailment_predictions_train](/rte/entailment_predictions_train/) contains already calculated probabilities for our [subsample_train.jsonl](/data/subsample_train.jsonl). A file is genereted ready to be submitted. +