by Chris Adu
I built a recommender system to help select movies to watch based on item similarities. I am using the Movie Lense Dataset which has over one hundred thousand entries of film ratings between 0 - 5. My focus was on the two films star wars(1977 original) and Liar Liar and I wanted to see if the recommender model will be able to provide me good recommendations based on selected film of choice.
I initially analysed the database to understand two key things:
-
- What films had the highest ratings
-
- What films had the most ratings
Distribution of Number of Ratings
- This demonstrates that there are only a smaller number of films with over 200 ratings.
- Most movies have between 0 and 1 ratings, this seem plausible because the most popular blockbuster films will have the most ratings and they are the minority of films therefore less successful film which will be the majority will be watched by fewer people hence less ratings.
Distribution of Ratings
- This distribution of the ratings themselves has peaks at the whole numbers(1,2,3,4,5), this possibly because only a few people watched those movies and gave it whole number rating.
- Most are distributed normally between 3 and 4.
- There are a few outliner ratings at 1 and 5 ; these movies most likely had a few people watching it and gave it very bad or very good rating. However could also be the very popular blockbuster movies that the majority of people really liked or disliked.
JointPlot of average rating vs number of rating
- we see a positive correlation between higher ratings and number of ratings.
- there are a few outliners again in at 1 and 5.
Using correlation based on results ratings of movies and focusing on films that had more than over 100 ratings( to get rid of potential outliners) I was able to find results that proved convincing in similarity.
Below is the results for correlation with Star wars:
- We can see that movies Empire Strikes Back , Return of the Jedi are the highest correlated with 0.749 & 0.673 respectively. This is followed by Raders of the Lost Ark which is a similar style movie. There is drop off in the model when it comes to Austin Powers with a correlation of 0.377 this could be due to the movie being a comedy and not a sci-fi like star wars, however the reason it made the recommendations is potentially due to popularity of the film.
Below is the results for correlation with Liar Liar:
-
The model doesn't show as strong a correlation when it comes to the movie 'Liar Liar', with the highest correlations being 0.517.
-
This possibly due the movies highest recommendations being from a variety of genres, however the same lead actor(Jim Carrey) in all the top three movies is the same.
-
There could be a link that a popular actor in a blockbuster film could be key to other blockbuster popular films with the same actor getting a similar rating.
- Seaborn
- Matplotlib
- Correlations
- Movie Lense Dataset