cleaning up unused files and writing more docs

trigaten · Jun 9, 2024 · 66672ab · 66672ab
1 parent 1909def
commit 66672ab
Show file tree

Hide file tree

Showing 30 changed files with 57 additions and 37,785 deletions.
diff --git a/Prompt_Systematic_Review_Dataset b/Prompt_Systematic_Review_Dataset
diff --git a/README.md b/README.md
@@ -4,18 +4,20 @@
 
 after cloning, run `pip install -r requirements.txt` from root
 
-## Set up API keys
+## Setting up API keys
 
 Make a file at root called `.env`.
 
-For HF: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login`
+For OpenAI: https://platform.openai.com/docs/quickstart 
+For Hugging Face: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login`
+For Sematic Scholar: https://www.semanticscholar.org/product/api#api-key 
 
-Put your key in like:
-
-`OPENAI_API_KEY=sk-...`
+Use the reference `example.env` file to fill in your API keys/tokens. 
+`OPENAI_API_KEY=sk.-...`
 `SEMANTIC_SCHOLAR_API_KEY=...`
 `HF_TOKEN=...`
 
+## Setting up keys for running tests
 Then to load the .env file, type:
 pip install pytest-dotenv
 
@@ -29,17 +31,29 @@ env_files =
 .test.env
 .deploy.env
 
+## Structure of the Repository
+The script `main.py` calls the necessary functions to download all the papers, deduplicate and filter them, and then run all the experiments. 
+
+The core of the repository is in `src/prompt_systematic_review`. The `config_data.py` script contains configurations that are important for running experiments and saving time. You can see in `main.py` how some of these options are used. 
+
+The source folder is divided into 4 main sections: 3 scripts (`automated_review.py`, `collect_papers.py`,`config_data.py`) that deal with collecting the data and running the automated review, the `utils` folder that contains utility functions that are used throughout the repository, the `get_papers` folder that contains the scripts to download the papers, and the `experiments` folder that contains the scripts to run the experiments. 
+
+At the root, there is a `data` folder. It comes pre-loaded with some data that is used for the experiments, however the bulk of the dataset can either be generated by running `main.py` or by downloading the data from huggingface. It is in `data/experiments_output` that the results of the experiments are saved.
+
+Notably, the keywords used in the automated review/scraping process are in `src/prompt_systematic_review/utils/keywords.py`. Anyone who wishes to run the automated review can adjust these keywords to their liking in that file. 
+
+## Running the code
+Running `main.py` will download the papers, run the automated review, and run the experiments.
+However, if you wish to save time and only run the experiments, you can download the data from huggingface and move the papers folder into the data folder (should look like `data/papers/*.pdf`). Adjust main.py accordingly. 
+
+Every experiment script has a `run_experiment` function that is called in `main.py`. The `run_experiment` function is responsible for running the experiment and saving the results. However each script can be run individually by just running `python src/prompt_systematic_review/experiments/<experiment_name>.py` from root. 
+
+
 ## blacklist.csv
 
-Papers do not include due to them being poorly written or AI generated (or simply irrelevant).
+Papers to not include due to them being poorly written or AI generated (or simply irrelevant).
 
 ## Notes
 
 - Sometimes a paper title may appear differently on the arXiv API. For example, "Visual Attention-Prompted Prediction and Learning" (arXiv:2310.08420), according to arXiv API is titled "A visual encoding model based on deep neural networks and transfer learning"
 
-- When testing APIs, there may be latency and aborted connections
-
-- Publication dates of papers from IEEE are missing the day about half the time. They also may come in any of the following formats
-  - "April 1988"
-  - "2-4 April 2002"
-  - "29 Nov.-2 Dec. 2022"
diff --git a/data/model_citation_counts.csv b/data/model_citation_counts.csv