This is a project work of Xccelerate Data Science Bootcamp Cohort 3.
-
Find a document and auto-summarize it. It can be a blog/news article/research paper. Use a machine based approach for this.
2 versions of text summarization are built:
- Version 1: Do word count of the paragraph. The score of each word will be its occurrence.
- Version 2: Using TF-IDF for summarization.
Due to time-constraint, the machine learning part is still to be developed. You can go to the notebook to find out more.
Skills: BeautifulSoup, nltk, TF-IDF (without using tfidf vectorizer)
-
Find a list of blogs from any website and find a theme. Use this learning to test if new articles can have be classified into themes as well.
2 models are built:
- Version 1: K-means to classify TF-IDF matrix of the news articles
- Version 2: Supervised learning for news articles categorization
Skills: nltk, K-means, elbow method, Silhouette Coefficient, tfidf_vectorizer, randomforestclassifier, xgboost, LinearSVC, KNN