Reproducibility Guide

Configuration

Make sure to configure the corresponding variables in the config.yaml file as required for each section. I will also share the specific variables in the config that can be modified for each script.

Note

If the fine-tuning datasets or fine-tuned models are not found locally, the script will load them from HuggingFace

Make sure to navigate to the llm directory to run the scripts.

Process CIViC Dataset

Downloding CIVIC Dataset

To download the CIViC Dataset that has the biomarkers, run the command below:

wget -P data/raw https://civicdb.org/downloads/01-Dec-2023/01-Dec-2023-VariantSummaries.tsv

CIViC Configuration

civic:
  raw_file: ${data.raw_dir}/01-Dec-2023-VariantSummaries.tsv
  processed_file: ${data.processed_dir}/civic_processed.csv
  variant_syn_file: ${data.processed_dir}/variants_synonyms.csv
  gene_syn_file: ${data.processed_dir}/gene_synonyms.csv

After making sure all the configurations are correct for CIViC, run the command below to process the data:

python -m scripts.process_civic

This should generate three csv files:

data/processed/civic_processed.csv
data/processed/gene_synonyms.csv
data/processed/variants_synonyms.csv

Reproduce GENIE analysis

Downloding AACR Project GENIE data (15.1-public)

To download the AACR Project GENIE data make sure to first register and follow the steps in SYNAPSE. Make sure to donwload the correct release, 15.1-public.

wget -P data/raw https://civicdb.org/downloads/01-Dec-2023/01-Dec-2023-VariantSummaries.tsv

AACR GENIE Configuration

aacr:
  clinical_sample: ${data.raw_dir}/aacr_genie/data_clinical_sample.txt
  data_mutations: ${data.raw_dir}/aacr_genie/data_mutations_extended.txt
  data_cna: ${data.raw_dir}/aacr_genie/data_CNA.txt
  data_sv: ${data.raw_dir}/aacr_genie/data_sv.txt

1. To process the data and compute the percentage of patients having at least one biomarker found in CIViC run the script below. This should print out the percentage and generate `data/processed/patient_with_biomarkers.csv` containing the list of the matching patients with information about their clinical and mutational profile.

Make sure you've downloaded and processed the CIViC dataset since the aacr analysis is dependent on its list of biomarkers

python -m scripts.aacr_analysis.py

To generate patient cancer distrubtion run the script below:

python -m scripts.plot_cancer_patient_distribution

This should save the plot in figures

Hermes-2-Pro-Mistral Fine-tuning with Direct Preferenne Optimization

DPO Fine-tuning Configuration

DPO_FT:
  open_source_model: NousResearch/Hermes-2-Pro-Mistral-7B
  fine_tuned_model: Hermes-FT
  fine_tuning_train: ${data.processed_dir}/negative.jsonl
  fine_tuning_test: ${data.processed_dir}/ft_test.jsonl
  beta: 0.1
  learning_rate: 5e-5
  max_steps: 200
  warmup_steps: 100
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 4
  logging_steps: 1
  max_length: 4800

LoRA:
  r: 2
  lora_alpha: 4
  lora_dropout: 0.05
  target_modules: ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']

Note

To repeat the fine-tuning for Hermes-FT, fine-tuned with the manually annotated dataset only, you should use the training set data/processed/negative.jsonl.
However, to repeat the fine-tuning for the model Hermes-synth-FT you should change the fine_tuned_model to Hermes-synth-FT and the fine_tuned_train to ${data.simulated_dir}/negative.jsonl

After finishing configuring the variables run the command below to start the training:

python -m scripts.dpo_train

Hermes-2-Pro-Mistral Evaluation

Hermes Evaluation

HERMES_EVAL:
  open_source_model: Hermes-FT
  test_set: ${data.processed_dir}/ft_test.jsonl
  open_source_eval_file: Fine_tune_a_Mistral_7b_model_with_DPO_zero-shot_zero-shot_loong_r_2_alpha_4.json # Changes depending on what we are evaluating!

You can choose to evaluate the open source model Hermes-synth-FT or even the base model NousResearch/Hermes-2-Pro-Mistral-7B

To run the evaluation, use the command below:

python -m scripts.evaluate_hermes_models

OpenAI models Evaluation

GPT models Evaluation

GPT_EVAL:
  n_shot: 0
  model: gpt-3.5-turbo
  test_set: ${data.processed_dir}/ft_test.jsonl
  train_set: ${data.processed_dir}/ft_train.jsonl
  OUTPUT_PROMPTS:
    zero_shot: 0shot
    one_shot: 1shot
    two_shot: 2shot
    prompt_chain: 2CoP
  
PROMPT_FILES:
  gpt_zero_shot: prompts/zero-shot.json
  gpt_one_shot: prompts/one-shot.json
  gpt_two_shot: prompts/two-shot.json
  gpt_chain_one: prompts/chain_1.json
  gpt_chain_two: prompts/chain_2.json

You can configure which OpenAI model to use for evaluation and specify whether to use few-shot learning (with n_shot set to 0, 1 or 2). Select the desired n_shot, and the script will automatically use the corresponding prompt file.

Important

Ensure you set your OPENAI_API_KEY variable. For example, you can do this by running: export OPENAI_API_KEY="your_api_key"

To evaluate the model with zero-shot and few-shot prompting, run the command below:

python -m scripts.evaluate_gpt_fewshots.py

To evaluate the model with prompt chaining, run the command below:

python -m scripts.evaluate_gpt_chain_of_prompts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Reproducibility Guide

Configuration

Process CIViC Dataset

Reproduce GENIE analysis

Hermes-2-Pro-Mistral Fine-tuning with Direct Preferenne Optimization

Hermes-2-Pro-Mistral Evaluation

OpenAI models Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Reproducibility Guide

Configuration

Process CIViC Dataset

Reproduce GENIE analysis

Hermes-2-Pro-Mistral Fine-tuning with Direct Preferenne Optimization

Hermes-2-Pro-Mistral Evaluation

OpenAI models Evaluation