Skip to content

PySpark exploration/learning project - utilizing Spark to explore data made available by the GitHub Archive project

Notifications You must be signed in to change notification settings

jayengee/githubarchiveexplorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Github Archive Explorer

PySpark exploration/learning project - utilizing Spark to explore data made available by the GitHub Archive project

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Python

This code was written for Python3. Ensure that the box this is being run on has Python3 installed

Spark

As of the current iteration, Spark is being run locally with no additional configurations. This will change in future iterations.

To install Spark on OSX, install via brew:

$ brew install apache-spark

On Ubuntu, wget the appropriate version of Spark listed on the downloads section of the site, and untar the file to the directory of choice

NOTE: Spark has a number of requirements, including Java. For information, refer to the official API docs and for more information

PySpark

Spark tasks are coded in PySpark, and sent to a Spark master node to kick off the appropriate tasks

Deployment

This particular project was launched on Google Cloud, leveraging the following:

  • Compute Engine
    • To run extract and trigger PySpark tasks
  • Cloud Storage Standard
    • To store ``.json.gz` log files, as well as results from PySpark tasks
  • Cloud Dataproc
    • One master, and a number of worker nodes, with Spark pre-installed

Compute Engine

  1. Install gcloud authentication tool on
  2. Run initialization to set up connections to other boxes with gcloud init
  3. git clone this repo
  4. Install Python
  5. Install repo dependencies:
    githubArchiveExplorer$ pip install -r requirements.txt
    
  6. Update various project, cluster, and bucket variables in config.py if necessary, based on project setup

Usage

extract

This set of functions is used to wget the .gzipped .json files containing the events data for each hour of each day, within a specified date range, from the GitHub Archive.

The functions can be triggered by running the get_files() function in the module. If no date ranges are passed, the function grabs data from the entire 2011 to 2016 date range.

To get all data from Jan 2011

> import extract
> from datetime import date
> extract.get_files(date(2011, 1, 1), date(2011, 2, 1))

To get all data from Jan 2011 to Dec 2016

> import extract
> from datetime import date
> extract.get_files()

The grabber current wgets the file for each hour to a local directory, then, using the google-cloud module, copies the file to a remote bucket. After each day's worth of files have been copied over, the local directory is wiped, so as to keep storage usage on the Compute Engine low.

Running Spark tasks

Spark tests are currently sent from the Compute Engine to the Spark Master Node via command line, and may be optimized to run via a script in the future.

Leveraging the gcloud library, a request is sent to the Spark Master Node by sending the appropriate task .py file, along with additional .py files that contain the various modules imported by the task .py file. These task .py files are broken out by resarch question, though some of them have been consolidated to reduce overhead when calling the appropriate tasks.

Questions 1 and 2

On the Compute Engine, run the following:

githubArchiveExplorer$ gcloud dataproc jobs submit pyspark --cluster spark-cluster --py-files config.py,pyspark_jobs/monthly_stats.py pyspark_jobs/q1_and_q2.py
Question 3

On the Compute Engine, run the following:

githubArchiveExplorer$ gcloud dataproc jobs submit pyspark --cluster spark-cluster --py-files config.py,pyspark_jobs/agg_stats.py pyspark_jobs/q3.py

Accessing results

Spark task status can be monitored via the cluster details tab under the Cloud Dataproc part of the Google Console

Results can be found within the Cloud Storage Bucket, under the Results sub-directory.

Coalescing parts

Natively, Spark exports its results in parts, so as to take advantage of the different worker nodes and partitions of data stored.

To combine data spread across these parts into a single file, we can leverage gsutil to cat and combine the files:

githubarchiveexplorer$ gsutil cat gs://githubarchivedata/Results/q1.csv/* > ./q1_results.csv
githubarchiveexplorer$ gsutil cat gs://githubarchivedata/Results/q2a.csv/* > ./q2a_results.csv
githubarchiveexplorer$ gsutil cat gs://githubarchivedata/Results/q2b.csv/* > ./q2b_results.csv
githubarchiveexplorer$ gsutil cat gs://githubarchivedata/Results/q3a.csv/* > ./q3a_results.csv
githubarchiveexplorer$ gsutil cp q1_results.csv gs://githubarchivedata/Results/q1.csv/
Copying file://q1_results.csv [Content-Type=text/csv]...
/ [1 files][  5.5 KiB/  5.5 KiB]
Operation completed over 1 objects/5.5 KiB.
githubarchiveexplorer$ gsutil cp q2a_results.csv gs://githubarchivedata/Results/q2a.csv/
Copying file://q2a_results.csv [Content-Type=text/csv]...
/ [1 files][  2.2 KiB/  2.2 KiB]
Operation completed over 1 objects/2.2 KiB.
githubarchiveexplorer$ gsutil cp q2b_results.csv gs://githubarchivedata/Results/q2b.csv/
Copying file://q2b_results.csv [Content-Type=text/csv]...
/ [1 files][  371.0 B/  371.0 B]
Operation completed over 1 objects/371.0 B.
githubarchiveexplorer$ gsutil cp q3a_results.csv gs://githubarchivedata/Results/q3a.csv/
Copying file://q3a_results.csv [Content-Type=text/csv]...
/ [1 files][  371.0 B/  371.0 B]
Operation completed over 1 objects/371.0 B.

Schema

GitHub events schema

The schema for GitHub events can be found here.

Given the schema change that occurred starting with 2015 data, ingestion rules are different for pre and post 2015 events data.

While working on this project, however, it seems that githubarchive's data for pre 2015 no longer abides by the old timeline schema, where repository (and as a result, respository.language) was accessible at the base event level. In fact, with the exception that some keys are now missing, the schema for pre-2015 events data now resembles that of post 2015 data. As a result, we have chosen to exclude pre-2015 events data until the appropriate keys are populated or can be found.

Given the loss of repository language information in events other than PullRequestEvents, all post-2015 data is limited to that event type for this project.

To standardize for this, and to answer the research questions outlined for this project, the following two schemas have been put together to appropriately capture statistics for repos across time:

Monthly stats schema

This schema rolls up certain stats at the month level, and captures, as well, the rank of the repo's engagement over the course of the month

column type description
year_month string year and month of stats (e.g. '2014-04')
year_month_rank int repo rank of n_events, within year and month
repo_id string repo's id. For pre-2015 events, this is a string, whereas for post, this is an int
repo_name string repo's name
n_events int number of qualifying events for repo, within year and month
n_actors int number of unique actors for the qualifying events for repo, within year and month

Aggregate stats schema

This schema rolls up certain stats all time

column type description
repo_id string repo's id. For pre-2015 events, this is a string, whereas for post, this is an int
repo_name string repo's name
n_events int number of qualifying events for repo
n_additions int number of lines of code added over time
n_deletions int number of lines of code removed over time
edit_size int number of lines of code modified over time (added and removed)
n_actors int number of unique actors for the qualifying events for repo

Research Questions

This project is designed to answer the following research questions:

  1. Over the years, on a monthly basis, what are the top 10 javascript repos people interact with (committed, starred, forked, etc)?
  2. If we track these repos ever made to the monthly top 10 list, can we update how active the repos are based on its peak number of interactions? This way, we can learn how live these repos are.
  3. If we look into how people are communicating in Github through commits, reviews, comments, etc in these repos, can we find out the distribution of the member cluster size and how it is related to the number of lines per member would merge?

Authors

Acknowledgements

About

PySpark exploration/learning project - utilizing Spark to explore data made available by the GitHub Archive project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages