This project involves predicting the survival of passengers on the Titanic using a Logistic Regression model. By analyzing various features from the Titanic dataset, we aim to identify factors that influenced survival and build a predictive model.
- Introduction
- Dataset
- Libraries Used
- Data Exploration
- Data Cleaning
- Feature Engineering
- Model Building
- Evaluation
- Results
- How to Run
- References
The Titanic disaster is one of the most infamous shipwrecks in history. In this project, we use machine learning techniques to predict whether a passenger survived the Titanic disaster based on features such as age, sex, passenger class, and others. Logistic Regression is employed due to its effectiveness in binary classification tasks.
The dataset used for this project is the well-known Titanic dataset, which can be found on Kaggle's Titanic: Machine Learning from Disaster competition. It contains data about passengers, such as their age, gender, class, etc., and whether they survived the disaster.
- Total number of samples: 891
- Features: 11 (after preprocessing)
- Target variable:
Survived
(0 = No, 1 = Yes)
pandas
: For data manipulation and analysis.numpy
: For numerical operations.seaborn
: For data visualization.matplotlib
: For plotting graphs.scikit-learn
: For implementing the machine learning model.
We performed an initial exploration of the dataset to understand its structure and contents:
- Viewing the dataset: Used
head()
to get a quick look at the first few rows of the data. - Data information: Used
info()
anddescribe()
to get insights into the data types and summary statistics. - Visualization: Used
seaborn
count plots to understand the distribution of survival among different categories (e.g., male vs. female).
Example visualization code:
# Plot showing survival counts
sns.countplot(x='Survived', data=titanic_data)
Data cleaning is a crucial step to handle missing values and irrelevant features:
- Missing Values: Identified missing values using
isna().sum()
and visualized them with a heatmap. - Filling Missing Values: Filled missing values in the
Age
column with the mean age. - Dropping Irrelevant Features: Dropped features like
Cabin
, which had too many missing values, and other non-numeric features that weren't essential for prediction.
# Fill missing age values with the mean age
titanic_data['Age'].fillna(titanic_data['Age'].mean(), inplace=True)
Converted non-numeric data into numeric form for model processing:
- Used
pd.get_dummies()
to convert theSex
column into a numerical format. - Dropped columns that were not useful for the prediction model.
# Convert the 'Sex' column to numerical
titanic_data['Gender'] = pd.get_dummies(titanic_data['Sex'], drop_first=True)
We built the model using Logistic Regression:
- Data Splitting: Split the data into training and testing sets using
train_test_split()
. - Model Training: Trained the Logistic Regression model using the training set.
- Prediction: Made predictions on the test set.
# Train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(x_train, y_train)
Evaluated the model using:
- Confusion Matrix: To understand the true positives, true negatives, false positives, and false negatives.
- Classification Report: Provided precision, recall, and F1-score for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test, predict))
The Logistic Regression model provided reasonable accuracy and insight into the factors affecting survival:
- Precision: The model's ability to correctly identify those who survived.
- Recall: The model's ability to capture all actual survivors.
- F1 Score: The balance between precision and recall.
Note: Accuracy could potentially be improved by incorporating more features or using more complex models.
-
Clone the repository:
git clone https://github.com/yourusername/titanic-survival-logistic-regression.git
-
Install the required Python libraries:
pip install pandas numpy seaborn matplotlib scikit-learn
-
Run the Jupyter Notebook:
jupyter notebook
-
Open the
Titanic_Survival_Logistic_Regression.ipynb
file and execute the cells.
- Kaggle Titanic Competition: Titanic: Machine Learning from Disaster
- Scikit-Learn Documentation: Logistic Regression
This README provides a comprehensive overview of the Titanic survival prediction project using Logistic Regression. For detailed implementation, visualizations, and insights, refer to the project notebook.