Logbook

The template of the thesis was implemented in an overleaf document before week 1, where the parts of the thesis design were placed in their fitting sections.

Week 1

1/04: EDA based on the comments of the thesis design. Specifically, investigating the data types of individual variables, plotting distributions, and selecting the features that are suitable for the research. A start was made on extracting topics from subtitles. Also, the introduction was written.
3/04: Worked on extracting subtitles out of POMS data, especially the cleaning of the subtitles (dutch POS-tagging, stemming, stop word & punctuation removal, tokenization). Started on performing LDA to model topics.
4/04: Performed LDA on test and POMS data. Investigate other types of topic modeling methods (LDA was determined to be most fitting). EDA on the enriched dataset.
5/04: Worked on improving the topic modeling with bigrams and trigrams. Added eda on subtitles. Investigated richness of descriptions per broadcaster in POMS.

Week 2

8/04: Investigated Dutch lemmatizers for subtitle processing (no good candidates were found that worked). Investigated literature on hybrid recommender systems based on RecSys conferences. Looked a bit into the genres of items. More investigation into NaN values of both poms and poms-enriched data.
10/04: New EDA on the enriched poms dataset. Getting the right stream data.
11/04: Reading literature on hybrid recommender systems. Improving the data preparation on new stream data, and selecting features for further investigation. Familiarize a bit with the current recommendation algorithm. EDA almost finished on the right POMS streams data (only subtitles).
12/04: Working on improved subtitle extraction using the API (partitioning). Looked into pyspark topic modeling.

Week 3

15/04: Selected possible features for content, investigating how to represent them best for each item. Some more preprocessing based on new data with new stream specifications. EDA on new data and topic modeling on new data. Look a bit into related work.
16/04: Work on literature.
18/04: More work on literature. Complete EDA on final form of data.
19/04: More literature.

Week 4

23/04: Finish literature.
24/04: Work on progress presentation for the thesis meeting. Tried out several recommender algorithms with Movielens dataset. Work on data section.
25/04: Look at event data. Work on data section. Have thesis meeting.
26/04: Finished the data section, it may be the case that some small details need to be added. Process event data. Worked on hot-encoding the features genres and credits. Will work on hot-encoding the subtitles using TF-IDF on series level on the top amount of words. Investigated the evaluation metrics and came to the conclusion to use CTR and precision.

Week 5

29/04: Finish evaluation section with chosen metrics: CTR and NDCG. Worked on hot-encoding the credits. Investigate the event data some more. Write a bit about the current recommendation system.
30/04: Improve EDA with data on series level. Aggregated subtitles on series level and performed TF-IDF (using HashingTF and CountVectorizer). Finished pre-processing the event data and investigated making an offline train/test set.
01/05: Work on preparing notebooks for thesis meeting. Working on making offline train/test set. Finished investigating the performance of the current recommendation system (RQ1).
02/05: Work on offline evaluation set (RQ2+3). Also, look at content features (RQ3). Thesis meeting.

Week 6

6/05: Working on train and test set of March data. Executing a hybrid recommender system using the LightFM model on the pre-processed event data, by using only the broadcaster as feature.
7/05: Improving the offline dataset. Ran the LightFM model over improved dataset. New literature.
8/05: Finished mid-report with literature, improved data and improved evaluation section. A skeleton for a method section was made.
9/05: Perform the LightFM model with no content features, title as feature and broadcast + title as feature. Thesis meeting.
10/05: Test LightFM model with more event data. Look a bit at text feature selection. Investigate the workings of LightFM some more.

Week 7

13/05: Test LightFM with pre-processed event data (removing duplicates and thresholding likes). Try out different settings of LightFM. Look at tf-idf of titles.
14/05: Perform LightFM model with standard deviation included as result.
15/05: Receive feedback Go/No-Go. Perform the LightFM model on pre-processed titles (title words and title words with no stopwords). Present Progress presentation for the NPO team.
16/05: Implement feedback Go/No-Go into notebook RQ3. Look at the encoded vector length of the content features. Thesis meeting.
19/05: Working on tf-idf (getting top tf-idf weights per row).

Week 8

20/05: Got TF-IDF and now performing on text content features (figuring out parameters). Prepared broadcaster, credits and genres as sideinfo. Got TF-IDF for titles, now only description and subtitles are left. These features are all on series level.
21/05: Got TF-IDF for description, however there is not enough memory for subtitles. So working on TF-IDF on pyspark. Also working on predicting/ producing recommendation lists for users.
22/05: TF-IDF for all text features on pyspark (series level). Added all seriesRefs (mids), not only of the series that have been chosen. Refactor data pre-processing now in pyspark.
23/05: Have thesis meeting (without Marx). Worked on the content feature combinations. Work on preparing the model for running multiple models.
24/05: Filter out user interaction data based on the length of a stream (at least 0.5 of a broadcast should be watched). Fix inconsistencies of the features to ensure good combinations.
26/05: Fix the splitting of train and test into a training of 21 days and a testing of 1 day (it's linear now, like the real setting). Work on running all the combinations smoothly (in order to compare models).

Week 9

27/05: Figure out how to produce recommendations lists for each user. First try running a few combinations -> ran 1/3 of all combinations. Work a bit on related work section (add metadata subsection).
28/05: Finish running all combinations. Looking at initial results and start interpreting them. Also, ran the best combination through the hyperparameter grid search.
29/05: Work on the deadline of Maarten Marx of today: finished describing the thesis skeleton.
31/05: Running all the combinations twice, and optimizing the best model once.

Week 10

03/06: Working with mentor to get the model set for production (online). Optimizing the best model again. Cleaning notebooks and bucket. Working a bit on feedback May 29.
04/06: Work on understanding the results better. Tweak model some more (thresholding). Prepare events data for more days. Prepare progress presentation for the NPO (tomorrow).
05/06: Finish and give progress presentation for the NPO. Work on reflection and results for thesis meeting tomorrow. Ran model a bit on new train/test set & only user thresholding.
06/06: Prepare reflection and results for thesis meeting. Improved evaluation current rec. system. Improved research questions notebook with statistics (t-tests).
07/06: Investigate the results with and without thresholding of users (without thresholding gives better results). Remove features that are unique (do not appear more than once for other series).
09/06: Make a new evaluation set: one that contains everything that is recommended and then watched (not only chosen or only watched).

Week 11

10/06: Fix train and test set. Change the testing with new train and test sets. Run all combinations with new train/test set.
11/06: Optimize the best combinations several times. Fix EDA a bit on series level. Write methodology a bit.
12/06: Work on writing the methodology. Look into results of combinations a bit. Work on evaluation of RQ1. Have the global method section finished, however i'm not happy with the flow and may change it up a bit. Also, looked a bit into the data section.
13/06: Running RQ2 experimental setup.
14/06: Running RQ2 experimental setup.
15/06: Writing out experimental setup.

Week 12

17/06 - 21/06: Writing and fixing thesis sections. Cleaning notebooks for github.

http://www.statskingdom.com/140MeanT2eq.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logbook

Clone this wiki locally