This repository contains my work for the Pandas Descriptive Statistics Script assignment in IDS 706. The script reads a dataset, generates summary statistics, and creates data visualizations. To use it, simply link it to a GitHub Codespace and wait for the devcontainer to run the Makefile, which will execute the following tasks: install, format, lint, and test.
This repository includes the following components:
.devcontainer
: Devcontainer configuration for setting up the development environment in Codespaces.Makefile
: Contains commands for installing dependencies, formatting code, running linters, and testing.requirements.txt
: Lists all the Python packages required for running the script.README.md
: This documentation file..githubactions
: Configuration for GitHub Actions to automate CI tasks.Dockerfile
: Defines the Docker environment to ensure consistent development setup.
The purpose of this project is to create a Python script that performs descriptive statistics on a given dataset using Pandas. The script:
- Reads a dataset (CSV file).
- Filters and processes data to focus on key variables such as age, sex, race, and income.
- Generates important summary statistics such as mean, median, and standard deviation.
- Recodes categorical variables and creates dummy variables.
- Creates bar charts to visualize the mean income by gender and by race.
The project uses matplotlib
and seaborn
for data visualization and produces a clear summary of the dataset through both textual and graphical outputs.
- Open GitHub Codespaces.
- Load the repository into Codespaces.
- Wait for the installation of all dependencies specified in
requirements.txt
. - Run the following command to execute the Makefile: Repository Components This repository includes the following components:
The dataset used in this project is based on survey data from the Integrated Public Use Microdata Series (IPUMS), affiliated with the University of Minnesota (2019). The dataset contains demographic and wage information for individuals aged between 18 and 65. It captures various attributes including sex, age, race, and income from wages.
- Sex: 1 represents male, 2 represents female.
- Age: Represents the actual age of the individual.
- Race: Categorized as:
- 1 = White
- 2 = Black/African American
- 3 = Other (including Asian, American Indian, and mixed races)
- Income (INCWAGE): Annual income in USD, with missing or retired individuals excluded.
The script filters out individuals with no reported income and focuses on individuals of working age (18-65).
The script calculates and displays summary statistics for numerical variables like age, income, and race proportions. The statistics include mean, standard deviation, minimum, and maximum values for different demographic groups.
The script generates two key visualizations:
- Mean Income by Gender: A bar chart that displays the average income for males and females.
- Mean Income by Gender and Race: A grouped bar chart showing the mean income for different races, split by gender.
Male | White | Black | Other_Races | INCWAGE | AGE | |
---|---|---|---|---|---|---|
mean | 0.52 | 0.77 | 0.09 | 0.14 | 55650.02 | 41.42 |
std | 0.50 | 0.42 | 0.29 | 0.09 | 67463.87 | 13.60 |
min | 0.00 | 0.00 | 0.00 | 0.00 | 5000.00 | 18.00 |
max | 1.00 | 1.00 | 1.00 | 1.00 | 250000.0 | 65.00 |