This project aims to develop a framework that uses Large Language Models (LLMs) to stage breast cancer based on the radiology and pathology reports. Here is the general framework:
The retrieval pipeline is based on indexing. Each merged note will be split to multiple chunks and each chunk will then be embedded. These embeddings will then be stored on a vector database. Then, a conversational retrieval will be used to incorporate only relevant chunks using similarity search, as shown in the following figure:This is the repository for our recent project that uses machine learning to predict atrial fibrillation in critically ill patients. We used data from the Medical Information Mart for Intensive Care (MIMIC-IV) database. The model achieved an area under the receiver operator characteristic curve (AUC) of 0.850. A compact model using 15 features in addition to 2 newly engineered features achieved a comparable AUC of 0.820.
The newly added features were the following:
- Older septic: a categorical feature indicating if the patient is 70 years old or older and is septic.
- Cardiac risk score: a score that incorporates patient's preexisting cardiac related risk factors:
Based on the Shapley Additive exPlanations (SHAP) analysis, the two newly engineered features, older septic and cardiac risk score, were the second and third most influential features on model performance, respectively.