SDS 410 Capstone in Statistical & Data Sciences, Spring 2021, Smith College
Semester-long project in partnership with StreetEasy investigating predicting New York real estate prices using natural language processing and machine learning algorithms
- Team: Lauren Low, Dayana Meza, Emma Scott, Xian (Elaine) Ye, Yanwan Zhu
- Project Partner: Yipeng Lai @ StreetEasy
- Faculty Mentor: Prof. Ben Baumer
- data - Create this folder within working directory to store .csv files of data
- paper_MDPI - Folder containing final paper paper_MDPI.Rmd and related files like figures.Rmd and mybibfile.bib
- pre-processing.R - Data cleaning script to prepare sale_listings for modeling: filter out unreasonable values, join with
zipcodeR
data, remove duplicate listings, impute NA values, split data into training and test sets - random_forest.Rmd - Random forest machine learning model using existing variables and text-based variables created from
listing_description
; also includes visualizations of model error - text_processing.R - Natural language processing script that creates binary keyword variables and performs
AFINN
sentiment analysis
- In working directory, create data folder containing amenities.csv, documentation - amenities.csv, documentation - sale_listings.csv, sale_listings.csv
- Run pre-processing.R to load script functions into global environment
- Run text_preprocessing.R to load script functions into global environment
- Run random_forest.Rmd to generate random forest model and corresponding model error visualizations