overview This project is part of IDS706, Week 3 assignment, which focuses on analyzing a dataset of soccer player statistics using the Polars library. The project involves generating descriptive statistics and visualizations for the dataset, as well as utilizing continuous integration and continuous deployment (CI/CD) with GitHub Actions.
The dataset contains information such as player names, nationalities, appearances, goals, assists, and clean sheets, among others. This project demonstrates how to summarize key statistics, visualize distributions, and automate these tasks through CI/CD pipelines.
Source of Data: The dataset used for this project is player statistics from kaggle dataset to practice data manipulation with Polars. https://www.kaggle.com/datasets/indrajuliansyahputra/premier-league-player-stats-2324?resource=download&select=player_stats.csv
Project Structure Makefile: Contains build automation commands. In this project, commands include installing dependencies, formatting code, linting, testing, and generating markdown reports. requirements.txt: Specifies the Python dependencies needed for this project. The key packages include Polars and Matplotlib. main.py: Main application script that reads the dataset, generates summary statistics, creates visualizations (e.g., logarithmic histograms), and saves the analysis results to markdown. test_main.py: Test script to ensure that the functions defined in main.py work correctly, including loading data, summarizing statistics, and generating visualizations. README.md: Provides an overview of the project, its structure, and the purpose of the repository. .github/workflows/cicd.yml: Defines the CI/CD pipeline for GitHub Actions. This workflow includes steps such as installing dependencies, linting code, running tests, and pushing generated markdown files. devcontainer: Sets up a development environment in GitHub Codespace, using a Dockerfile to define the base environment (such as tools and libraries). Descriptive Statistics and Visualizations The project uses Polars to compute key summary statistics like mean, median, standard deviation, and count for numeric variables like player appearances, goals, and assists.
Logarithmic Histogram One of the visualizations generated by the project is a logarithmic histogram, which helps in visualizing the wide range of player statistics, where most players score fewer than 20 goals, but a few score over 60.
Summary of Categorical Variables The dataset includes categorical data such as player nationality and position, which are summarized using Polars' grouping functions.
Example Visualizations Goals Logarithmic Histogram: The histogram demonstrates the distribution of player goals, which helps understand the skewness of the data due to a few high outliers. Profiler Comparison Performance comparisons between Polars and other Python libraries (like Pandas) can also be profiled to highlight the efficiency of Polars in handling large datasets.
CI/CD Integration This project uses GitHub Actions to automate key tasks:
Install: Installs required dependencies as defined in requirements.txt. Format: Ensures the code is formatted according to standards using black. Lint: Lints the Python files using pylint to catch code quality issues. Test: Runs the tests defined in test_main.py to ensure that all functions in main.py behave as expected. Generate and Push Reports: Automatically generates markdown files with the analysis and pushes them back to the repository. How to Use This Project Run Locally Clone the repository: bash Copy code git clone https://github.com/jayliu1016/ids_de_polars.git cd ids_de_polars Install the dependencies: bash Copy code make install Run the analysis: bash Copy code python main.py Run tests: bash Copy code make test CI/CD Pipeline This repository is integrated with GitHub Actions to run the following workflows automatically:
Build and Test: Automatically installs dependencies, lints, and tests the code on every push. Generate Reports: After generating analysis reports (like markdown files), it commits and pushes the changes to the repository. Conclusion This project demonstrates the power of Polars in handling large datasets efficiently and integrates continuous testing and deployment through GitHub Actions. The project showcases both numerical and categorical data analysis, making it a robust example for handling sports statistics data.