GitHub Repository Analysis

This project aims to analyze repositories from specific organizations on GitHub and collect data about the repositories.

Installation

To run this project, you will need to install the following Python libraries:

PyGithub
python-dotenv
pandas
tqdm
matplotlib
seaborn

Configuration

Create a .env file in the root directory of the project and add your GitHub token as GITHUB_TOKEN.
Update the list of organizations and excluded repositories as needed.

Features

Main Functions

Repository Data Collection
- fetch_organization_repositories(organization_name, language=None)
  - Collects data about repositories from a specific organization on GitHub. Filters out excluded repositories and applies specific validity criteria based on the organization.
CSV File Generation with Repository Metrics
- generate_metrics_csv()
  - Generates a CSV file containing detailed data on the analyzed repositories, including information such as name, URL, primary language, creation and update dates, size, star count, watchers count, forks count, open issues count, subscribers count, and total lines of code.
Programming Language Usage Calculation
- calculate_language_usage(dataframe)
  - Calculates the usage of programming languages in the analyzed repositories based on the number of lines in each language, generating a CSV with the percentage distribution of each language and the total lines of code per language.
Graph Generation
- plot_metrics_distribution()
  - Generates distribution charts for each of the analyzed metrics, showing frequency and central tendency (mean and median).
- plot_boxplot_metrics()
  - Generates boxplots to visualize the dispersion and possible outliers in key repository metrics.
- plot_metrics_statistics()
  - Generates charts with descriptive statistics for key repository metrics (e.g., stars, watchers, forks, open issues, subscribers, network size).

Statistics Generation

generate_statistics(dataframe)
- Generates descriptive statistics for key repository metrics, such as mean, median, mode, and standard deviation. Results are saved in a CSV file.

Table Structure

Repository Table (codesamples.csv)

full_name: Full name of the repository (e.g., organization/repo_name)
name: Repository name
owner: Repository owner
html_url: Repository URL on GitHub
description: Repository description
language: Primary language of the repository
created_at: Repository creation date
updated_at: Last update date of the repository
pushed_at: Date of the last push to the repository
size: Repository size in KB
stargazers_count: Number of stars on the repository
watchers_count: Number of watchers of the repository
forks_count: Number of forks of the repository
open_issues_count: Number of open issues in the repository
subscribers_count: Number of subscribers to the repository
network_count: Number of repositories in the network (total forks)
archived: Indicates if the repository is archived
total_lines: Total number of lines of code in the repository
langs_percentage: Percentage distribution of programming languages in the repository

Language Usage Table (languages_usage.csv)

Language: Programming language name
Usage Percentage: Percentage of the language used in the total lines of code
Lines: Total number of lines of code written in the language

Statistics Table (statistics.csv)

mean: Mean of the metrics
median: Median of the metrics
mode: Mode of the metrics
std: Standard deviation of the metrics

Results

The results of the analyzed data will be saved in the results folder:

codesamples.csv: Repository data.
languages_usage.csv: Programming language usage.
statistics.csv: Descriptive statistics of repository metrics.
language_usage_charts.png: Charts of programming language usage.
distribution_metrics.png: Charts of metric distributions.
boxplot_metrics.png: Boxplots of metrics.
desc_stats.png: Charts of descriptive statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
0-GetCodeSamples		0-GetCodeSamples
1-GetQuestions		1-GetQuestions
3-GetMetrics		3-GetMetrics
results		results
.gitignore		.gitignore
analyze.ipynb		analyze.ipynb
get_repositories.ipynb		get_repositories.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Repository Analysis

Installation

Configuration

Features

Main Functions

Statistics Generation

Table Structure

Repository Table (codesamples.csv)

Language Usage Table (languages_usage.csv)

Statistics Table (statistics.csv)

Results

About

Contributors 4

Languages

yellowisk/git-infer

Folders and files

Latest commit

History

Repository files navigation

GitHub Repository Analysis

Installation

Configuration

Features

Main Functions

Statistics Generation

Table Structure

Repository Table (codesamples.csv)

Language Usage Table (languages_usage.csv)

Statistics Table (statistics.csv)

Results

About

Resources

Stars

Watchers

Forks

Contributors 4

Languages