Skip to content

Commit

Permalink
cleaning up unused files and writing more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Mcilie committed Jun 9, 2024
1 parent 1909def commit 66672ab
Show file tree
Hide file tree
Showing 30 changed files with 57 additions and 37,785 deletions.
1 change: 0 additions & 1 deletion Prompt_Systematic_Review_Dataset
Submodule Prompt_Systematic_Review_Dataset deleted from 7d8eb4
38 changes: 26 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,20 @@

after cloning, run `pip install -r requirements.txt` from root

## Set up API keys
## Setting up API keys

Make a file at root called `.env`.

For HF: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login`
For OpenAI: https://platform.openai.com/docs/quickstart
For Hugging Face: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login`
For Sematic Scholar: https://www.semanticscholar.org/product/api#api-key

Put your key in like:

`OPENAI_API_KEY=sk-...`
Use the reference `example.env` file to fill in your API keys/tokens.
`OPENAI_API_KEY=sk.-...`
`SEMANTIC_SCHOLAR_API_KEY=...`
`HF_TOKEN=...`

## Setting up keys for running tests
Then to load the .env file, type:
pip install pytest-dotenv

Expand All @@ -29,17 +31,29 @@ env_files =
.test.env
.deploy.env

## Structure of the Repository
The script `main.py` calls the necessary functions to download all the papers, deduplicate and filter them, and then run all the experiments.

The core of the repository is in `src/prompt_systematic_review`. The `config_data.py` script contains configurations that are important for running experiments and saving time. You can see in `main.py` how some of these options are used.

The source folder is divided into 4 main sections: 3 scripts (`automated_review.py`, `collect_papers.py`,`config_data.py`) that deal with collecting the data and running the automated review, the `utils` folder that contains utility functions that are used throughout the repository, the `get_papers` folder that contains the scripts to download the papers, and the `experiments` folder that contains the scripts to run the experiments.

At the root, there is a `data` folder. It comes pre-loaded with some data that is used for the experiments, however the bulk of the dataset can either be generated by running `main.py` or by downloading the data from huggingface. It is in `data/experiments_output` that the results of the experiments are saved.

Notably, the keywords used in the automated review/scraping process are in `src/prompt_systematic_review/utils/keywords.py`. Anyone who wishes to run the automated review can adjust these keywords to their liking in that file.

## Running the code
Running `main.py` will download the papers, run the automated review, and run the experiments.
However, if you wish to save time and only run the experiments, you can download the data from huggingface and move the papers folder into the data folder (should look like `data/papers/*.pdf`). Adjust main.py accordingly.

Every experiment script has a `run_experiment` function that is called in `main.py`. The `run_experiment` function is responsible for running the experiment and saving the results. However each script can be run individually by just running `python src/prompt_systematic_review/experiments/<experiment_name>.py` from root.


## blacklist.csv

Papers do not include due to them being poorly written or AI generated (or simply irrelevant).
Papers to not include due to them being poorly written or AI generated (or simply irrelevant).

## Notes

- Sometimes a paper title may appear differently on the arXiv API. For example, "Visual Attention-Prompted Prediction and Learning" (arXiv:2310.08420), according to arXiv API is titled "A visual encoding model based on deep neural networks and transfer learning"

- When testing APIs, there may be latency and aborted connections

- Publication dates of papers from IEEE are missing the day about half the time. They also may come in any of the following formats
- "April 1988"
- "2-4 April 2002"
- "29 Nov.-2 Dec. 2022"
29 changes: 0 additions & 29 deletions data/model_citation_counts.csv

This file was deleted.

Loading

0 comments on commit 66672ab

Please sign in to comment.