Logistic Regression in the Diagnosis of Parkinson's Disease

View these instructions in Turkish

In this notebook we will implement logistic regression to diagnose parkinson disease, the dataset is obtained from Oxford Parkinson's Disease Detection Dataset (https://archive.ics.uci.edu/dataset/174/parkinsons)

Dataset

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).
Here is a brief of the dataset, peeking it using pandas

Additional Variable Information

Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Reference:

Little,Max. (2008). Parkinsons. UCI Machine Learning Repository. https://doi.org/10.24432/C59C74.

Data Analyzation

The data analysis phase aims to provide an understanding of the dataset, highlighting key relationships, distributions, and feature types. This knowledge will guide preprocessing steps and contribute to the development of an effective logistic regression model for diagnosing Parkinson's disease.

Correlation Analysis:

Exploring the correlation between each feature and the target variable (status) and identifying features that exhibit a strong correlation, as these may play a significant role in the diagnostic process.

Distribution Analysis:

Examining the distribution of target variable values (1 and 0) to understand the prevalence of Parkinson's disease in the dataset and identifying potential class imbalances that may impact model training and evaluation.

Logistic Regression

Logistic Regression (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring. Since the outcome is a probability, the dependent variable is bounded between 0 and 1 which aligns with the case we are dealing with here.

Reference:

IBM. (n.d.). Logistic Regression. https://www.ibm.com/topics/logistic-regression

Sigmoid Function

Sigmoid/Logistic function

Logistic regression model is represented as

$$ f_{\mathbf{w},b}(x) = g(\mathbf{w}\cdot \mathbf{x} + b)$$

where function $g$ is the sigmoid function. The sigmoid function is defined as:

$$g(z) = \frac{1}{1+e^{-z}}$$

Here is the implementation using Python:

def sigmoid(z):
    """
    Compute the sigmoid of z
    """
    g = 1 / (1 + np.exp(-z))
    
    return g

Cost Function

Logistic Cost Function

Logistic regression cost function is of the form

$$ J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right] \tag{1}$$

where

m is the number of training examples in the dataset
$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is -

$$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{2}$$
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$, which is the actual label
$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x^{(i)}} + b)$ where function $g$ is the sigmoid function.

Here is the implementation using Python:

def compute_cost(X, y, w, b, *argv):
    """
    Computes the cost over all examples
    """

    m, n = X.shape
    cost = 0.
    
    for i in range(m):
        z_i = np.dot(X[i], w) + b
        f_wb_i = sigmoid(z_i)
        cost += - y[i] * np.log(f_wb_i) - (1 - y[i]) * np.log(1 - f_wb_i)

    total_cost = cost / m

    return total_cost

Gradient Descent

Gradient Descent

Gradient descent algorithm is:

$$\begin{align*}& \text{repeat until convergence:} ; \lbrace \newline ; & b := b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline ; & w_j := w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} ; & \text{for j := 0..n-1}\newline & \rbrace\end{align*}$$

where, parameters $b$, $w_j$ are all updated simultaniously

This compute_gradient function is to compute $\frac{\partial J(\mathbf{w},b)}{\partial w}$, $\frac{\partial J(\mathbf{w},b)}{\partial b}$ from equations (2) and (3) below.

$$ \frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)}) \tag{2} $$ $$ \frac{\partial J(\mathbf{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x_{j}^{(i)} \tag{3} $$

m is the number of training examples in the dataset
$f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the actual label

Here is the implementation using Python:

def compute_gradient(X, y, w, b, *argv): 
    """
    Computes the gradient for logistic regression  
    """
    m, n = X.shape
    dj_dw = np.zeros(w.shape)
    dj_db = 0.

    for i in range(m):
        z_wb = np.dot(X[i], w) + b
        f_wb = sigmoid(z_wb)
        
        dj_db_i = f_wb - y[i]
        dj_db += dj_db_i
        
        for j in range(n): 
            dj_dw[j] += X[i, j] * dj_db_i
            
    dj_db /= m
    dj_dw /= m
        
    return dj_db, dj_dw

Output

The output gained from Logistic Regression using pure python is

Train Accuracy: 76.410256

compared to: Using scikit-learn library

Scikit Logistic Regression

Accuracy: 89.74358974358975

Scikit MLP Classifier Neural Network

Accuracy with Neural Network: 94.87179487179486

Conclusion

To reach the optimum accuracy and precision, it is good to tweak some parameter such feature engineering, preprocessing, hyperparameter tuning. Contribution is always welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
data		data
img		img
translation/tr		translation/tr
.gitignore		.gitignore
Analayis.ipynb		Analayis.ipynb
LICENSE		LICENSE
README.md		README.md
dataset_analyzation.ipynb		dataset_analyzation.ipynb
logistic_reg.ipynb		logistic_reg.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logistic Regression in the Diagnosis of Parkinson's Disease

Dataset

Data Analyzation

Logistic Regression

Sigmoid Function

Cost Function

Gradient Descent

Output

Conclusion

About

Releases

Packages

Contributors 2

Languages

License

faridnec/parkinsons-disease-detection

Folders and files

Latest commit

History

Repository files navigation

Logistic Regression in the Diagnosis of Parkinson's Disease

Dataset

Data Analyzation

Logistic Regression

Sigmoid Function

Cost Function

Gradient Descent

Output

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages