The origin of this project is from the Kaggle Competition located here: Kaggle College Scorecard
Quoted from the Kaggle description: "While it's understood that students from elite colleges tend to earn more than graduates from less prestigious universities, the finer relationships between future income and university attendance are quite murky."
If university attendance is viewed a financial investment, then it makes sense to properly understand the expected risk and return of those investments. However, it is one of the more difficult investments to anticipate return, and so I hope to find relationships in this dataset that predict return on investment.
My guiding question is: "What features of a university education correlate with a better return of investment?"
Who is your client and why do they care about this problem? In other words, what will your client DO or DECIDE based on your analysis that they wouldn’t have otherwise?
The client for this problem is future or returning university students. The former consists of newly graduated high school students and adults attending university for the first time. The latter can be classified as adults returning to restart and finish a degree, or to further their education.
The client would be able to better rank the effectiveness of their university choices based on return on investment and could make a more accurate decision about what university best suits them.
At this time, the data that will be used for this is the data provided by the Kaggle competition. This data set, as described by the Kaggle webpage says "...the US Department of Education has matched information from the student financial aid system with federal tax returns to create the College Scorecard dataset."
Further data sets may be used if the need arises.
My approach to solving this problem will first start with initially becoming familiar with the data and making sure that it is ready for data analysis. This will be facilitated through some simple graphs. Perhaps some clustering algorithms, once I understand how to implement those.
After that, I'll start to ask questions of the dataset and develop hypothesis of small complexity to test. As I feel I understand the data more comfortably, those hypothesis shall grow in complexity. I hope to end with some null hypothesis tests about predictive features that indicate a higher likelihood of a good return on investment.
What are your deliverables? Typically, this would include code, along with a paper and/or a slide deck.
The deliverables will be a R markdown document detailing any discoveries along with the code that led to those discoveries.