machineLearning_sentimentAnalysis

project2 for comp90049 knowledge technologies The main goal for this assignment is to identify people's sentiments by analysing their tweets.

Datasets

Three kind of datasets are given for different purposes. The ‘train’ related tweets data sets are leveraged to train classifier models with various machine learning algorithms while the ‘eval’ related data sets are used to evaluate the performance of developed models.

Scikit-Learn

Scikit-Learn is leveraged in this project as a tool to build sentiment analysis models and predict for the unseen dataset. It is a popular machine learning library designed for Python which provides a robust set of algorithms including Naïve Bayes, Decision Trees and so on. The main process is:

At first, extract and select feature words from tweets after tokenizing, counting and normalizing.
Secondly, train the classifier model with given the train data set utilizing build-in method based on certain machine learning algorithm.
After building the model, evaluate it with given evaluation data sets and calculate evaluation metrics with output results.
Last step, the well trained classifier model allots labels for given unlabeled testing data.

Feature Engineering

To improve the performance of analysis model, feature engineering has to be conducted. In this assignment, the feature selection tool provided by Scikit-Learn is used to gather more feature words to identify emotions. The main steps:

Tokenizing. Convert each tweet text into a sequence of tokens.
Data cleaning. Remove all punctuations, and stop words. Stop words refers to commonly used words such as “a”, “the”, “in” which do no help to train a model.
Featuring selection. After feature extraction, we may obtain a huge number unigram feature words. Only top 5000 most frequently occurring words are concerned in this project.
Train the classifier with selected features.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
venv		venv
COMP90049 Project 2 Report.pdf		COMP90049 Project 2 Report.pdf
README.md		README.md
README.txt		README.txt
eval-labels.txt		eval-labels.txt
eval-tweets.txt		eval-tweets.txt
eval.csv		eval.csv
extractFeatures.py		extractFeatures.py
mergeCSV.py		mergeCSV.py
new-eval.csv		new-eval.csv
new-train.csv		new-train.csv
test-tweets.txt		test-tweets.txt
test.csv		test.csv
train-labels.txt		train-labels.txt
train-tweets.txt		train-tweets.txt
train.csv		train.csv
twitterSentiment.py		twitterSentiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

machineLearning_sentimentAnalysis

Datasets

Scikit-Learn

Feature Engineering

About

Releases

Packages

Languages

HaochenQ/Sentiment_Analysis

Folders and files

Latest commit

History

Repository files navigation

machineLearning_sentimentAnalysis

Datasets

Scikit-Learn

Feature Engineering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages