List of Data Science Terms

MMCi Applied Data Science

Weekend 1

During and after weekend 1, please begin to familiarize yourself with the following terms and definitions.

Intro to Predictive Models

Algorithm: A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer
Classification Model: A predictive model for which the predicted value is a categorical label
Dependent variable: Also called a label and often representing a clinical outcome, a dependent variable (often denoted y) is a value that a predictive model is trained to predict
Feature: Also called a predictor or indepedent variable, a feature (often denoted x) most often refers to an input to a predictive model
Independent Variable: Also called a predictor or feature, an independent variable (often denoted x) is an input to a predictive model
Label: Also called a dependent variable and often representing a clinical outcome, the label (often denoted y) is the value that a predictive model is trained to predict
Learning (in machine learning): Adjusting the parameters of a predictive model in order to improve model performance
Model: A mathematical equation defining a possible relationship between variables and typically containing parameters that can be learned from data
Parameters: Numeric values in a predictive model that define a specific relationship between the indepenedent variables (i.e. features) and dependent variables (i.e. label), and are updated during learning
Predictive Model: A mathematical model (see: Model) used to predict the value of an unknown (i.e. dependent) variable based on a set of known (i.e. independent) variables
Predictor: Also called a feature or indepedent variable, a predictor (often denoted x) most often refers to an input to a predictive model
Regression Model: A predictive model for which the predicted value is numeric
Training: Another term for the process of learning model parameters from data
Weights: In the content of machine learning, a weights are another term for the parameters in a predictive model

Logistic Regression

Generalized Linear Model: A generalization of linear regression in which a linear model is related to the dependent variable via link function such as the logit function.
Logistic Regression: A generalized linear model commonly used to predict the probability of a binary dependent variable or outcome
Logistic (i.e. sigmoid) function: An S-shaped function used in logistic regression and elsewhere to convert log-odds values to probabilities. It is the inverse of the logit function.
Odds: The ratio of the probability that an event of interest will occur to the probability that it will not occur
Probability: A number between 0 and 1 quantifying a belief about how likely it is that a given event will occur

Performance Measures

Area under the ROC Curve (AUROC): A common performance metric for binary classification that can be computed by (a) quantifying the area under the receiver operating characteristic (ROC) curve or (equivalently) the sensitivity-specificity curve; or (b) calculating the probability that a randomly selected positive example has a higher predicted probability of being predicted positive than a randomly selected negative example
Average Precision (AP): Also called the average positive predictive value (PPV), the average precision is an estimate of the area under the precision (i.e. PPV) versus recall (i.e. sensitivity) curve
Binary classification: A prediction task in which the label y belongs to one of two classes or categories (e.g. positive/negative, yes/no)
Confusion Matrix: A cross-tabulation of true labels versus model predictions used to summarize the performance of a classification model
False Negative: Also called a Type II error, a false negative is a positive case that is incorrectly believed or predicted (e.g. by a predictive model) to be negative
False Positive: Also called a Type I error, a false positive is a negative case that is incorrectly believed or predicted (e.g. by a predictive model) to be positive
Negative predictive value: The probability that an individual predicted to be negative by a predictive model truly is negative for the condition or outcome of interest
Operating Point: A specific threshold used to convert model-predicted probabilities into binary predictions
Positive predictive value: The probability that an individual predicted to be positive by a predictive model truly is positive for the condition or outcome of interest
Precision: Another term for positive predictive value
Sensitivity: The proportion of positive cases correctly identified (i.e. predicted) as positive by a prediction model or diagnostic test
Specificity: The proportion of negative cases correctly identified (i.e. predicted) as negative by a prediction model or diagnostic test
Recall: Another term for sensitivity
Receiver Operating Characteristic Curve (ROC curve): A graph showing the relationship between the sensitivity and specificity of a binary classification model across the full range of possible classification thresholds
True Negative: A negative case that is correctly believed or predicted (e.g. by a predictive model) to be negative
True Positive: A positive case that is correctly believed or predicted (e.g. by a predictive model) to be positive

Fields and Subfields Related to Data Science

Artificial Intelligence: An interdiscplinary field focused on development of thinking machines that process information and make decisions
Deep Learning: The branch of machine learning concerned with development of deep neural networks (i.e. containing many hidden layers)
Machine Learning: A subfield of artificial intelligence that incorporates elements of statistics, computer science, and other disciplines to develop systems that learn parameters of a statistical model from data to make predictions or decisions
Supervised Learning: The branch of machine learning that focuses on learning a function that maps inputs to outputs based on example input-output pairs

Other Relevant Terms

Categorical variable: A variable that can take on one of a limited and usually fixed number of possible values. Common examples in healthcare include sex, race, diagnosis codes, procedure codes, and medications
Ordinal variables: A variable that is similar to a categorical variable, but the categories are ordered. Example: rating scales / Likert scales
Preprocessing: Steps taken to clean, organize, filter, or transform raw data to prepare it for another purpose, such as training a predictive model
Vector: For our purposes a vector is simply an ordered list of numbers, such as a list of input features or parameters in a predictive model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds_glossary.md

ds_glossary.md

List of Data Science Terms

Weekend 1

Intro to Predictive Models

Logistic Regression

Performance Measures

Fields and Subfields Related to Data Science

Other Relevant Terms

Files

ds_glossary.md

Latest commit

History

ds_glossary.md

File metadata and controls

List of Data Science Terms

Weekend 1

Intro to Predictive Models

Logistic Regression

Performance Measures

Fields and Subfields Related to Data Science

Other Relevant Terms