This repository contains code and documentation for analyzing COVID-19 vaccination data. The analysis includes data preprocessing, exploratory data analysis, statistical tests, and data visualization. This README provides an overview of the project's phases and instructions for running and replicating the analysis.
The analysis is divided into five phases:
- Load the vaccination data from the provided CSV file.
- Inspect the data's structure using
df.info()
,df.tail()
, anddf.columns
. - Remove unnecessary columns ('source_name' and 'source_website') using
df.drop()
. - Clean and describe the data using
df.describe()
. - Handle missing values by filling them with zero.
- Convert data types: 'total_vaccinations,' 'people_vaccinated,' 'people_fully_vaccinated,' 'daily_vaccinations_raw,' 'daily_vaccinations' to 'int64' and 'iso_code' to 'string.'
- Save the preprocessed data to a new CSV file using
df.to_csv()
.
- Load the preprocessed data.
- Perform exploratory analysis and visualization using Python libraries (e.g., pandas, seaborn, matplotlib).
- Calculate the mean, min, max, and correlations within the dataset.
- Explore country data, including the number of unique countries.
- Explore the minimum and maximum values of fully vaccinated people.
- Explore the minimum and maximum dates in the dataset.
- Visualize the number of daily vaccinations over time.
- Visualize the distribution of total vaccinations by vaccine.
- Visualize the relationship between total vaccinations and people vaccinated.
- Visualize the comparison between countries and the number of fully vaccinated people.
- Import the necessary Python libraries (pandas, numpy, scipy.stats).
- Select the 'total_vaccinations' data and define an 'expected_mean.'
- Perform a one-sample t-test to compare the selected data to the expected mean.
- Print the test statistic (t) and p-value.
- Check for statistical significance based on a chosen alpha level.
- Calculate descriptive statistics for the 'total_vaccinations' data (mean, median, standard deviation, variance).
- Conduct correlation analysis between numeric columns.
- Split the data into training and testing datasets.
- Scale the data for better model performance.
- Choose a model (e.g., simple linear regression) based on problem formulation and dataset characteristics.
- Fit the data to the selected model.
- Evaluate model performance, focusing on the R-squared value.
- Load the preprocessed dataset into IBM Cognos Analytics.
- Create and customize visualizations using the software:
- Visualize the number of daily vaccinations over time.
- Visualize the distribution of total vaccinations by vaccine.
- Visualize the relationship between total vaccinations and people vaccinated.
- Visualize the comparison between countries and the number of fully vaccinated people.