Skip to content

[MACHINE LEARNING] Data Science Techniques for Liver Disease Prediction

Notifications You must be signed in to change notification settings

lucashomuniz/Project-06

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

✅ PROJECT-06

In this project, we address the rising incidence of liver disease, driven by factors such as excessive alcohol consumption, exposure to harmful gases, ingestion of contaminated food, and the use of drugs and anabolic steroids. our primary objective is to develop a machine learning model that predicts whether a patient will develop liver disease, thus aiding doctors, hospitals, and governments in planning health budgets and implementing effective prevention policies. the desired outcome is a classification of “yes” or “no.”

to achieve this, we employ supervised classification techniques, creating multiple versions of the model using different algorithms. we present the end-to-end pipeline, encompassing model building, training, evaluation, and selection. throughout this project, we discuss which methods to apply, in what sequence, and with which tools. finally, we compare five distinct algorithms to determine the optimal approach for this classification task.

Keywords: Python Language, Data Analysis, Machine Learning, Classification Model, Supervised Learning, Logistic Regression, Random Forest, KNN, Decision Tree, SVM, Liver Diseases, Health Budget.

✅ PROCESS

Based on the business problem definition and dataframe analysis, it is often challenging to identify the ideal algorithm for a machine learning model. therefore, we adopt an experimentation approach, testing multiple algorithms with various hyperparameters to generate different model versions and compare their performance. similarly, the best preprocessing techniques are not always clear at the outset, so we also experiment to determine the most effective approach. indeed, the data scientist relies on systematic experimentation to select the best algorithm, technique, and overall strategy.

we begin with exploratory analysis immediately after loading the data. this involves data cleaning (removing duplicates, handling missing values) and applying specific transformations if needed. the focus is on understanding the dataframe, examining numerical and categorical variables, exploring their distributions, and identifying outliers via boxplots, descriptive tables, and frequency counts. ensuring there are no duplicate rows or columns is critical to prevent bias in the model. the goal is to achieve a generalizable solution.

Next is attribute engineering, where deeper transformations may be performed, including creating or modifying variables. at this stage, feature selection can be employed to choose the most relevant variables for the machine learning process. additionally, constructing a correlation table helps uncover relationships (positive or negative) between variables and detect possible multicollinearity.

After that, we move on to preprocessing, converting any remaining text-based variables to numeric formats and finalizing the machine learning pipeline, which includes label encoding, normalization, standardization, and data scaling. a common practice is to split the dataframe into training and testing sets, ensuring the model learns from the training set and is evaluated on fresh test data. using the same data for both training and testing is inadvisable, as the model would already be familiar with it. thus, evaluating the model on new data—with known outcomes—provides an accurate measure of its performance.

image

✅ CONCLUSION

There are multiple machine learning algorithms applicable to various problems. below is an overview of logistic regression, random forest, knn, decision tree, and svm, outlining their main advantages and disadvantages.

  • logistic regression: a simple and fast algorithm especially suited for binary classification. it offers a probabilistic interpretation of predictions and evaluates the influence of each variable. however, it may struggle with complex decision boundaries and imbalanced classes.

  • random forest: an ensemble of decision trees, robust against overfitting and effective at handling unbalanced data. it also identifies the most important variables. while powerful, training a large number of trees can be computationally expensive, and results can be harder to interpret.

  • knn (k-nearest neighbors): a non-parametric approach that makes no assumptions about the data distribution. it can handle complex decision boundaries effectively. however, it may degrade with high-dimensional datasets and can be slow during prediction due to the neighbor search process.

  • decision tree: an intuitive algorithm that is easy to interpret and visualize, handling both categorical and numeric variables efficiently. it can be prone to overfitting and may require careful tuning to find an optimal tree structure.

  • svm (support vector machines): well-suited for complex decision boundaries and high-dimensional data, used in both classification and regression tasks. it is less susceptible to overfitting, but choosing the right kernel and hyperparameters can be challenging, and interpreting the model is often more difficult compared to other algorithms.

image

The choice of algorithm depends on the context, data characteristics, and project objectives. careful experimentation is recommended to assess performance before selecting a final approach. After evaluating the five algorithms, random forest achieved the best performance using the auc score. despite these results, there are additional optimization possibilities, such as incorporating more relevant variables, tuning different hyperparameters, extending training time, and revisiting data preprocessing.

Machine learning is an iterative process, allowing repeated cycles of improvement from the initial analysis phase. Once the best model is chosen, it can be used to generate predictions for multiple patients simultaneously by loading their data from a .csv file. after standardization, the results can be saved in a new spreadsheet and provided to the business team for decision-making purposes.

✅ SOURCE

https://archive.ics.uci.edu/dataset/225/ilpd+indian+liver+patient+dataset