Make sure to configure the corresponding variables in the config.yaml file as required for each section. I will also share the specific variables in the config that can be modified for each script.
Note
If the fine-tuning datasets or fine-tuned models are not found locally, the script will load them from HuggingFace
Make sure to navigate to the llm
directory to run the scripts.
Downloding CIVIC Dataset
To download the CIViC Dataset that has the biomarkers, run the command below:
wget -P data/raw https://civicdb.org/downloads/01-Dec-2023/01-Dec-2023-VariantSummaries.tsv
CIViC Configuration
civic: raw_file: ${data.raw_dir}/01-Dec-2023-VariantSummaries.tsv processed_file: ${data.processed_dir}/civic_processed.csv variant_syn_file: ${data.processed_dir}/variants_synonyms.csv gene_syn_file: ${data.processed_dir}/gene_synonyms.csv
After making sure all the configurations are correct for CIViC, run the command below to process the data:
python -m scripts.process_civic
This should generate three csv files:
data/processed/civic_processed.csv
data/processed/gene_synonyms.csv
data/processed/variants_synonyms.csv
Downloding AACR Project GENIE data (15.1-public)
To download the AACR Project GENIE data make sure to first register and follow the steps in SYNAPSE. Make sure to donwload the correct release, 15.1-public.
wget -P data/raw https://civicdb.org/downloads/01-Dec-2023/01-Dec-2023-VariantSummaries.tsv
AACR GENIE Configuration
aacr: clinical_sample: ${data.raw_dir}/aacr_genie/data_clinical_sample.txt data_mutations: ${data.raw_dir}/aacr_genie/data_mutations_extended.txt data_cna: ${data.raw_dir}/aacr_genie/data_CNA.txt data_sv: ${data.raw_dir}/aacr_genie/data_sv.txt
1. To process the data and compute the percentage of patients having at least one biomarker found in CIViC run the script below. This should print out the percentage and generate `data/processed/patient_with_biomarkers.csv` containing the list of the matching patients with information about their clinical and mutational profile.
Make sure you've downloaded and processed the CIViC dataset since the aacr analysis is dependent on its list of biomarkers
python -m scripts.aacr_analysis.py
- To generate patient cancer distrubtion run the script below:
python -m scripts.plot_cancer_patient_distribution
This should save the plot in figures
DPO Fine-tuning Configuration
DPO_FT: open_source_model: NousResearch/Hermes-2-Pro-Mistral-7B fine_tuned_model: Hermes-FT fine_tuning_train: ${data.processed_dir}/negative.jsonl fine_tuning_test: ${data.processed_dir}/ft_test.jsonl beta: 0.1 learning_rate: 5e-5 max_steps: 200 warmup_steps: 100 per_device_train_batch_size: 1 gradient_accumulation_steps: 4 logging_steps: 1 max_length: 4800 LoRA: r: 2 lora_alpha: 4 lora_dropout: 0.05 target_modules: ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
Note
- To repeat the fine-tuning for Hermes-FT, fine-tuned with the manually annotated dataset only, you should use the training set
data/processed/negative.jsonl
.
However, to repeat the fine-tuning for the model Hermes-synth-FT you should change thefine_tuned_model
to Hermes-synth-FT and thefine_tuned_train
to${data.simulated_dir}/negative.jsonl
After finishing configuring the variables run the command below to start the training:
python -m scripts.dpo_train
Hermes Evaluation
HERMES_EVAL:
open_source_model: Hermes-FT
test_set: ${data.processed_dir}/ft_test.jsonl
open_source_eval_file: Fine_tune_a_Mistral_7b_model_with_DPO_zero-shot_zero-shot_loong_r_2_alpha_4.json # Changes depending on what we are evaluating!
You can choose to evaluate the open source model Hermes-synth-FT or even the base model NousResearch/Hermes-2-Pro-Mistral-7B
To run the evaluation, use the command below:
python -m scripts.evaluate_hermes_models
GPT models Evaluation
GPT_EVAL:
n_shot: 0
model: gpt-3.5-turbo
test_set: ${data.processed_dir}/ft_test.jsonl
train_set: ${data.processed_dir}/ft_train.jsonl
OUTPUT_PROMPTS:
zero_shot: 0shot
one_shot: 1shot
two_shot: 2shot
prompt_chain: 2CoP
PROMPT_FILES:
gpt_zero_shot: prompts/zero-shot.json
gpt_one_shot: prompts/one-shot.json
gpt_two_shot: prompts/two-shot.json
gpt_chain_one: prompts/chain_1.json
gpt_chain_two: prompts/chain_2.json
You can configure which OpenAI model to use for evaluation and specify whether to use few-shot learning (with n_shot set to 0, 1 or 2). Select the desired n_shot, and the script will automatically use the corresponding prompt file.
Important
- Ensure you set your
OPENAI_API_KEY
variable. For example, you can do this by running: exportOPENAI_API_KEY="your_api_key"
- To evaluate the model with zero-shot and few-shot prompting, run the command below:
python -m scripts.evaluate_gpt_fewshots.py
- To evaluate the model with prompt chaining, run the command below:
python -m scripts.evaluate_gpt_chain_of_prompts