ids706_individual_project1

Breast Cancer Dataset Analysis with Pandas

Introduction

This project performs descriptive statistics on the breast cancer dataset using pandas. The project demonstrates efficient data processing, shared code usage, testing with pytest and nbval, and continuous integration using GitLab CI/CD. If interested, you can access a video of detailed instruction here

Project Structure

project/
├── README.md
├── Makefile
├── requirements.txt
├── lib.py
├── script.py
├── test_lib.py
├── test_script.py
├── notebook.ipynb
└── .gitlab-ci.yml

Installation Instructions

Prerequisites

Python 3.7 or higher
pip package manager

Steps

Clone the Repository

git clone https://gitlab.com/yourusername/yourproject.git
cd yourproject

Install Dependencies
```
make install
```
This command installs all the required packages listed in requirements.txt.

Usage

Running the Script

The script script.py performs the analysis and outputs the results to the console.

python script.py

Expected Output:

Descriptive Statistics for 'mean radius':
Mean: 14.127291739894563
Median: 13.37
Standard Deviation: 3.524049388059983

Exploring the Notebook

The Jupyter Notebook notebook.ipynb provides an interactive exploration of the dataset, including data loading, descriptive statistics, and optional visualizations.

To open the notebook:

jupyter notebook notebook.ipynb

Descriptive Statistics of the Breast Cancer Dataset

Overview of the Dataset

The breast cancer dataset is a classic and very easy multi-class classification dataset. It contains 569 samples of malignant and benign tumor cells, with 30 features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

Features: Mean, standard error, and "worst" or largest (mean of the three largest values) of ten real-valued features computed for each cell nucleus.
Target: Binary classification (malignant = 0, benign = 1).

Statistical Summaries

1. General Statistics

Using pandas' describe() method, we obtain the following statistics for each feature:

Statistic	Value
Number of Rows	569
Number of Columns	31 (30 features + target)

2. Feature-wise Descriptive Statistics

Below are the descriptive statistics for selected features:

Mean Radius

Mean: 14.127
Median: 13.370
Standard Deviation: 3.524

Mean Texture

Mean: 19.289
Median: 18.840
Standard Deviation: 4.301

Mean Perimeter

Mean: 91.969
Median: 86.240
Standard Deviation: 24.299

3. Target Variable Distribution

Malignant (0): 212 samples
Benign (1): 357 samples

4. Full Statistical Summary

Below is a statistical summary for all numerical features:

Feature	Mean	Median	Std Dev	Min	Max
mean radius	14.127	13.370	3.524	6.981	28.110
mean texture	19.289	18.840	4.301	9.710	39.280
mean perimeter	91.969	86.240	24.299	43.790	188.500
mean area	654.889	551.100	351.914	143.500	2501.000
mean smoothness	0.096	0.095	0.014	0.053	0.163
...	...	...	...	...	...
Total Samples	569

(Note: For brevity, only a few features are listed. The full summary includes all features.)

Files Description

README.md: This file. Provides an overview of the project, instructions, and detailed descriptions.
Makefile: Contains commands for installation, testing, linting, and formatting.
requirements.txt: Lists all the required Python packages with pinned versions.
lib.py: Contains shared functions for data loading and calculations.
script.py: A script that performs the analysis using functions from lib.py.
test_lib.py: Contains tests for the functions in lib.py.
test_script.py: Contains tests for script.py.
notebook.ipynb: Jupyter Notebook demonstrating the analysis with detailed explanations.
.gitlab-ci.yml: Configuration file for GitLab CI/CD pipeline.

Testing

Run All Tests

To run all tests, including the notebook tests:

make test

This command executes:

pytest --nbval notebook.ipynb
pytest test_script.py
pytest test_lib.py

Test Coverage

The tests cover:

Data Loading: Ensures the dataset is loaded correctly.
Calculations: Validates mean, median, and standard deviation calculations.
Script Output: Checks that script.py produces the expected output.

Makefile Commands

The Makefile simplifies common tasks:

Install Dependencies:
```
make install
```
Format Code with Black:
```
make format
```
Lint Code with Ruff:
```
make lint
```
Run Tests:
```
make test
```

Continuous Integration

The project uses GitLab CI/CD for continuous integration. The pipeline performs the following stages:

Install: Installs dependencies using make install.
Lint: Checks code style and potential errors using make lint.
Format: Formats code using make format.
Test: Runs all tests using make test.

Pipeline Status Badges

Pipeline Status:
Coverage Report:

License

This project is licensed under the MIT License.

Detailed Steps and Explanations

1. Data Loading

Function: load_breast_cancer_data() in lib.py
Description: Loads the breast cancer dataset from scikit-learn and converts it into a pandas DataFrame.
Usage:
```
df = load_breast_cancer_data()
```

2. Data Processing

Pandas DataFrame Creation:
- Data and feature names are used to create the DataFrame.
- The target variable is added as a new column.
Advantages of pandas:
- High performance for data manipulation.
- Efficient handling of large datasets.

3. Descriptive Statistics Calculations

Functions in lib.py:
- calculate_mean(df, column_name)
- calculate_median(df, column_name)
- calculate_std(df, column_name)

Usage:

mean_value = calculate_mean(df, 'mean radius')
median_value = calculate_median(df, 'mean radius')
std_value = calculate_std(df, 'mean radius')

4. Script Execution

Script: script.py
Purpose: Performs the analysis and prints results.
Execution Flow:
1. Loads data using load_breast_cancer_data().
2. Specifies the column to analyze ('mean radius').
3. Calculates mean, median, and standard deviation.
4. Prints the results.

5. Jupyter Notebook Exploration

Notebook: notebook.ipynb
Contents:
- Data loading and initial exploration.
- Descriptive statistics using pandas.
- Visualizations (optional).
Features:
- Interactive cells to explore data.
- Use of shared functions from lib.py.
- Detailed explanations and markdown cells.

6. Testing

Tests for lib.py:
- Validates data loading and correctness of calculations.
- Ensures functions handle data correctly.
Tests for script.py:
- Captures console output.
- Asserts that the output matches expected results.

7. Linting and Formatting

Linting with Ruff:
- Checks for code style issues and potential errors.
- Ensures code adheres to PEP 8 standards.
Formatting with Black:
- Formats code for consistency.
- Improves readability.

8. Continuous Integration

Configuration: .gitlab-ci.yml
Stages:
- Install: Sets up the environment.
- Lint: Runs make lint.
- Format: Runs make format.
- Test: Runs make test.
Benefits:
- Automated checks on every push.
- Early detection of issues.
- Maintains code quality.

Conclusion

This project demonstrates the use of Pandas for efficient data analysis on the breast cancer dataset. By structuring the project with shared code, thorough testing, and continuous integration, we ensure reliability and maintainability. The detailed steps and explanations provided aim to make it easy for others to understand and replicate the analysis.

Contact Information

For any questions or suggestions, please feel free to contact:

Name: Eleanor Jiang
GitLab: aoaow

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
lib.py		lib.py
makefile		makefile
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt
script.py		script.py
test_lib.py		test_lib.py
test_script.py		test_script.py

aoaow/ids706_individual_project1

Folders and files

Latest commit

History

Repository files navigation