The Titanic dataset is a well-known dataset used for predictive modeling and classification tasks. The dataset includes various attributes such as passenger demographics, ticket information, and survival status. The objective of this analysis is to uncover patterns and insights related to the survival of passengers and to identify key factors influencing survival rates.
Data Cleaning and Preprocessing Initial examination of the Titanic dataset revealed several issues that required data cleaning and preprocessing:
Handling Missing Values: The dataset had missing values in the Age
, Cabin
, and Embarked
columns. The missing values in Age
were imputed using the median age. For the Embarked
column, missing values were filled with the most common port of embarkation ('S'). The Cabin
column had many missing values and was dropped from the analysis due to its sparsity.
Removing Duplicates: No duplicate records were found in the dataset.
Data Transformation: Categorical variables such as Sex
and Embarked
were converted to numerical format using label encoding.
Outlier Detection: Outliers in the Fare
column were identified and analyzed. Since extremely high fares could indicate errors or unique cases, they were handled separately in the analysis.
Descriptive statistics were calculated to summarize the main characteristics of the dataset:
- Survival Rate: Approximately 38.38% of passengers survived.
- Gender Distribution: About 64.76% of passengers were male, and 35.24% were female.
- Class Distribution: Passengers were distributed across three classes: 24.24% in first class, 20.65% in second class, and 55.11% in third class.
- Age Distribution: The median age of passengers was 28 years.
Data visualization techniques were employed to uncover patterns and relationships in the dataset: Survival Rate by Gender:
- Women had a survival rate of approximately 74.20%, while men had a survival rate of about 18.89%. Survival Rate by Class:
- First-class passengers had a survival rate of 62.96%, second-class passengers had a survival rate of 47.28%, and third-class passengers had a survival rate of 24.24%. Age Distribution and Survival:
- A histogram of age distribution revealed that children had a higher survival rate compared to adults. The survival rate for passengers under 16 years of age was around 55.56%. Survival Rate by Embarkation Point:
- Passengers who embarked at 'C' (Cherbourg) had a survival rate of 55.36%, those at 'Q' (Queenstown) had a survival rate of 38.96%, and those at 'S' (Southampton) had a survival rate of 33.70%.
Correlation analysis was conducted to assess the relationships between numerical variables:
- Correlation Matrix: A heatmap of the correlation matrix showed that
Fare
had a positive correlation with survival (0.26), indicating that passengers who paid higher fares were more likely to survive. - Age and Fare: There was a slight negative correlation between age and survival (-0.08), suggesting younger passengers had higher survival rates.
Key Insights
Gender: Gender played a significant role in survival, with women having a substantially higher survival rate than men. Class: Passengers in first class had a higher likelihood of survival compared to those in second and third class. Age: Younger passengers, especially children, had higher survival rates. Embarkation Point: Passengers who embarked from Cherbourg had the highest survival rate among the three embarkation points.
- Further Analysis: Conduct deeper analysis to explore other potential factors influencing survival, such as family size and ticket fare combinations.
- Predictive Modeling: Develop and validate predictive models using machine learning techniques to predict survival based on the identified factors.
- Historical Context: Consider the historical context and additional datasets to enhance the understanding of the survival patterns observed.
- LinkedIn: @bLevi
- Email: [email protected]
- HNG Internship: Internship
- HNG Tech: Hire