This is a semester project from Distributed Information Systems Laboratory (LSIR) @ EPFL in spring 2020.
Data analysis and model training can be performed in an ordinary computer as well as parallel maner if the underlying architecture supports it.
This work strove to figure out what people talk about in Twitter in every day in a high level topic groupping. We collected and analysed a big dataset of tweets along with daily trending topics, clustered them and categorised the trending topics into conventional media categories by using LDA (Latent Dirichlet Allocation). As result, we created clean dataset of tweets matched with their trend topics and their general categorizes. Interestingly, keywords naturally obtained in the process of LDA also can be used in summarization, search or description purposes in the future for a particular category of news\texts.
For security purposes Data is not published.
Only the best model is publish under LDA along with model checking scripts
Notebooks are for data preprocessing and cleanup as well as reporting; numbered step by step.
Scripts contains all the code written in Notebooks with proper comments.
Papers folder contains articles read during the project; not each one of them appears as reference in the report