yelp-dataset-challenge

Brought to you by Dr. Mahsa Mirzargar's independent study trio (In no particular order): Nathan Michaels, Devin Grossman, and David Michaels.

Downloading the repository

To download our code, do a git clone on this bad boy. To download our data, first get access to Yelp's dataset: https://www.yelp.com/dataset/. Then, please reach out to the maintainers of this repository to receive access to our engineered features.
At the moment, the plan is to have a sqlite3 database on each machine locally. When we want to add to that database, we will share a json file amongst the collaborators who will be able to run a simple script to load that data into their database.

Notes on The Dataset

Data brought to you by Yelp

We have access to data from 12 "metropolitan areas", 4.7 million reviews of 156,000 businesses. We also have data on 1.1 million users and 1 million "tips" from these users.
Business Features: For each business, we have its location, average rating, information on its food category, its business hours, and data on when people went to the restaurant.
Users Features: We have data on a User's friends, a User's reviews, and a User's review count.

Exploratory Data Analysis

In order to gain a good understanding of our dataset, we are exploring simple correlations between variables. Here are some correlations we're looking to explore:

David

Business rating vs. (# Reviews or # tips)
photos vs. Average Rating
location (city/state) vs # reviews
Business number of reviews vs. Business check-ins
Number of businesses by city
Rating distribution by city

Devin

explore “business competition” (normalize by average of zip/city/state)
1. Clustering of similarly rated businesses
  1. By radius
  2. By avg rating with circle vs. business merge
2. patterns of clustered businesses (good competition? bad competition?)
  1. displayed in reviews
  2. consolidation between competing restaurants ratings
  3. geographically close or not close?
reviews vs avg length vs rating of that review vs user rating
Business Rating vs. Business Review Count
Business Rating vs. Total Check-ins (need to engineer this feature)
User’s number of friends vs. User review count
Business rating vs. Total review count of all users that reviewed that Business
Business rating vs. user’s average rating weighted by number of friend

Nathan

Business rating vs. User’s reviewed were elite
Number of reviews that are on a business we have
Number of reviews that are on Users we have
Number of friends that we don’t have data on
user number of ratings vs. User number of friends
General patterns of users
1. How many users are there?
2. How many review more than 5 times

Single Variables to Explore

David

Geographic data: Businesses by city
1. Number of
2. Which cities
3. Distribution of ratings

Nathan

Time data of tips and reviews
Check in data
Usefulness of compliments, fans, cool, funny, etc for users

Devin

Yelping since (for Users)
Look into some individual users

Big Questions to Explore

Define a User Rating
1. What makes a user valuable? In what context?
  1. How do you find "valuable users"
2. Variance of User's Reviews
3. User Influentiality
  1. Graphs/Interconnectedness
  2. How to __ "Influential Users"
    1. ID
    2. Use

How to share and store data

It's difficult. We are in this mid-range data size that we will use for read-heavy operations and will only write new data once in a while (when we add features). We can't and shouldn't share this data via GitHub since it screws with the file size. We tried creating a module, juicy, that would partition any JSON file into many <100MB files that we could share via GitHub. But that turned out poorly as it took forever to upload and download and Git isn't meant for data. Also, the loading of JSON into Python objects is pretty honking memory intensive. After a handful of optimizations, it still takes up at least double the memory size (for a 4GB file, that's at least 8GB and we can't assume RAM sizes of over 8GB).
A path that hasn't been explored for the json route is the ijson module.
The plan as of November 9th, 2017 is to give everyone a copy of a sqlite3 version of the Yelp SQL database (the conversion process relied on the mysql2sqlite.sh that you can find on GitHub). When someone wants to add a feature, we can share that feature via a json file (which should usually be relatively small) and each person can run a script to add that json file to their sqlite3 database.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.gitignore		.gitignore
Checkin.ipynb		Checkin.ipynb
Creating checkin_total table.ipynb		Creating checkin_total table.ipynb
Data Cleaning.ipynb		Data Cleaning.ipynb
Devin's EDA.ipynb		Devin's EDA.ipynb
EDA - Business.ipynb		EDA - Business.ipynb
EDA - Categories.ipynb		EDA - Categories.ipynb
EDA - Geographic Data.ipynb		EDA - Geographic Data.ipynb
EDA - Review.ipynb		EDA - Review.ipynb
Nathan's EDA.ipynb		Nathan's EDA.ipynb
README.md		README.md
Tips.json Analysis.ipynb		Tips.json Analysis.ipynb
User Graph Creation.ipynb		User Graph Creation.ipynb
database.sqlite		database.sqlite
juicy.py		juicy.py
juicy_testing.py		juicy_testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yelp-dataset-challenge

Downloading the repository

Notes on The Dataset

Data brought to you by Yelp

Exploratory Data Analysis

David

Devin

Nathan

Single Variables to Explore

David

Nathan

Devin

Big Questions to Explore

How to share and store data

About

Releases

Packages

Contributors 3

Languages

gravelcycles/yelp-dataset-challenge

Folders and files

Latest commit

History

Repository files navigation

yelp-dataset-challenge

Downloading the repository

Notes on The Dataset

Data brought to you by Yelp

Exploratory Data Analysis

David

Devin

Nathan

Single Variables to Explore

David

Nathan

Devin

Big Questions to Explore

How to share and store data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages