Model Evaluation Questionnaire

Model Evaluation Pathway, MMCi Applied Data Science

Instructions

Please answer each question as briefly as possible in your own words. One to two sentences is ideal but not required.
For Assignment 1 (evaluate Khera et al.), please answer questions 1-4, 6, and 11-16 only. There are 11 questions in total. Each correct or thoughtful response (as appropriate) will receive 1 point, and the assignment will be graded on a 10-point scale.
For Assignment 2 (evaluate Tomašev et al.), please skip question 7 but answer all other questions. Questions 5 and 10 will be graded leniently, and we'll be exploring these topics in more detail in later course weekends. There are 15 questions in total. Each correct or thoughtful response (as appropriate) will receive 1 point, and the assignment will be graded on a 13-point scale.
For Assignment 3 (evaluate Esteva et al., 2017) and Assignment 4 (evaluate Taggart et al., 2018), please answer all questions. There are 16 questions in total. Each correct or thoughtful response (as appropriate) will receive 1 point, and the assignment will be graded on a 15-point scale.

What are the predictors?
What is (are) the predicted outcome(s)?
Where or how were the data collected, and over what period of time? If multiple datasets were used, please answer this question for each one.
Were the data filtered in any way, and if so, how (e.g. inclusion/exclusion, outlier removal)?
What preprocessing or data augmentation steps were completed (e.g. feature extraction, standardization)?

What model(s) were used to predict the outcome(s) (e.g. logistic regression, CNN)? If multiple approaches were used, specify which one was chosen as the final model.
Was any kind of pre-training or transfer learning used, or was the model trained from scratch on the current data?
What data (or portion of the data) were used to train the model (i.e. optimize model parameters)?
What data (or portion of the data), if any, were used to tune the model (i.e. select hyperparameters)?
If the model included hyperparameters that were tuned, please describe these hyperparameters. Note: this should include information about regularization, if it was used.

What data (or portion of the data), if any, were used to evaluate the model?
Based on performance metrics, please give a brief, practical summary of model performance at a particular operating point (Example: the model can achieve 90% sensitivity at 90% specificity, which would result in an expected 5 false positives and 5 false negatives for each 100 people evaluated at an expected prevalence of 50%).
What measures were used to evaluate performance?
Are there any other performance measures that were not used, but that you think are important to the proposed application (e.g. in a clinical scenario)? If not, which of measure(s) reported by the authors do you believe is/are most important to the proposed application, and why?

What is the primary clinical or healthcare use case for this model envisioned by the authors?
Are there any other use cases that the authors could consider with this model, or that you believe would be impactful?