- Abstract
- Data Preprocessing
- Generative and Discriminative Classifiers
- Logistic Regression
- The Cross-Entropy Loss Function
- Gradient Descent
- Naive Bayes
- Conditional Probability
- Reference
本文將探討 Discriminative 和 Generative Algorithms 作為 Natural Language Sentiment Classifier 時的原理和分類模式,將用一點點的代數、機率和微積分加以佐證,內容包括了 Maximum Likelihood Estimation, Cross-Entropy Loss Function, Gradient Descent 和 Conditional Probability。
本文使用的是 nltk
中的 twitter_samples
資料集。共有 Positive Tweets 和 Negative Tweets 各 5000 筆真實 Twitter 平台上的資料,共 10000 筆。下面我將利用 Logistic Regression 中的 Sigmoid Function 和 Naive Bayes 兩種截然不同的分類方法試圖透過 Natural Laguage Processing 的方式做出 Tweets Sentiment Classifier。
def process_tweet(tweet):
"""
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet
"""
# Cleaning
tweet = re.sub(r'^RT[\s]+','',tweet)
tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
tweet = re.sub(r'#', '',tweet)
tweet = re.sub(r'@', '',tweet)
# Tokenization
token = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokenized = token.tokenize(tweet)
# Stop words & Punctuation
stopwords_english = stopwords.words('english')
tweet_processed = []
for word in tweet_tokenized:
if (word not in stopwords_english and
word not in string.punctuation):
tweet_processed.append(word)
# Stemming & Lowercasing
tweet_stem = []
stem = PorterStemmer()
for word in tweet_processed:
stem_word = stem.stem(word)
tweet_stem.append(stem_word)
return tweet_stem
文字清理的部分由四個步驟組成,分別是基本的 Cleaning, Tokenization, Remove Stop words and Punctuation 和 Stemming and Lowercasing。
Cleaning 移除 Twitter 中常見的如 RT (retweet), https, #, @
Tokenization 將句子拆成單詞
tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3'
, 'and', 'some', 'arrows', '<', '>', '->', '<--']
Remove Stop words & Punctuation 移除 Tokenized 後在 nltk.corpus.stopwords.words('english')
和 string.punctuation
中的詞彙
Stemming & Lowercasing 將所有詞彙除去詞綴以得到詞根並全部變成小寫的過程。
def build_freqs(tweets, ys):
yslist = np.squeeze(ys).tolist()
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
return freqs
input
tweets
: a list of unprocessed tweets
ys
: sentiment label (1, 0) of each tweet (m, 1)
return
freqs
: a dictionary contains all of the words and it's sentiment frequency
freqs.keys()
: (word, sentiment) (Ex: ('pleas', 1.0)
)
freqs.values()
: frequency (Ex: 81
)
Which means that there are 81 positive tweets contain 'pleas'
{('followfriday', 1.0): 23,
('france_int', 1.0): 1,
('pkuchli', 1.0): 1,
('57', 1.0): 2,
('milipol_pari', 1.0): 1,
('top', 1.0): 30,
('engag', 1.0): 7,
('member', 1.0): 14,
('commun', 1.0): 27,
('week', 1.0): 72,
(':)', 1.0): 2960,
('lamb', 1.0): 1,
('2ja', 1.0): 1,
('hey', 1.0): 60,
('jame', 1.0): 7,
('odd', 1.0): 2,
(':/', 1.0): 5,
('pleas', 1.0): 81}
keys = ['''words that are interested''']
data = []
for word in keys:
pos = 0
neg = 0
if (word, 1) in freqs:
pos = freqs[(word, 1)]
if (word, 0) in freqs:
neg = freqs[(word, 0)]
data.append([word, pos, neg])
data
![](https://private-user-images.githubusercontent.com/123567363/269242488-fa9c8c2f-9eef-4853-a1c6-01c749816ce5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTI0MTUsIm5iZiI6MTczOTE1MjExNSwicGF0aCI6Ii8xMjM1NjczNjMvMjY5MjQyNDg4LWZhOWM4YzJmLTllZWYtNDg1My1hMWM2LTAxYzc0OTgxNmNlNS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwMTQ4MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jZDIxMzQyMGQzYWU1ZDIxNTE3OTczZmQ5ZDAyMWYzMDBhN2NhN2VmNDRkM2I1YzhhMDE5OTI5MDYzOTQ1NzRiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.Mj3DZYRGSu_v-yz41q8MksqQnCZioIc9SU0dHAuvf-k)
Logistic Regression 和 Naive Bayes 兩者最大的不同在於 Losgistic Regression 屬於 Discriminative Classifiers 而 Naive Bayes 屬於 Generative Classifiers [1]。
Discriminative Classifier 著重在分類,在畫出兩個類別中間的邊界線,因此不像 Generative Classifier 會對資料做假設並計算條件機率。Generative Classifier 顧名思義在找出如何可以生成類似於訓練集的資料新資料點的模型,因此更著重於訓練集中類別的資料分布,學習分佈的特性及型態。
以貓狗分類器作為例子,Generative Classifier 會想去了解貓和狗分別長什麼樣子,而 Discriminative Classifier 只想知道該怎麼去分辨這兩個種類。也可以透過 eq. 1 & 2 來了解兩者的差異,在 Generative Classifier 中,模型會基於
![](https://private-user-images.githubusercontent.com/123567363/269245298-de0fa343-c417-48cb-b0c9-df2f53543a35.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTI0MTUsIm5iZiI6MTczOTE1MjExNSwicGF0aCI6Ii8xMjM1NjczNjMvMjY5MjQ1Mjk4LWRlMGZhMzQzLWM0MTctNDhjYi1iMGM5LWRmMmY1MzU0M2EzNS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwMTQ4MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yMzJlMDU1OTRkZjU1NDg1MjRjMTljNjk1NDhjNTZiMGUxZTZhZjk4YzdkMmMzZTU4ODAwOTdkOGQxNTViYWUzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.cMz9MWBFNilb8BSxAE-FyUXcn7Dwir-a1L-4OLZ8jpc)
![](https://private-user-images.githubusercontent.com/123567363/269245328-08001ec2-335e-4b24-9659-0445cb165dc6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTI0MTUsIm5iZiI6MTczOTE1MjExNSwicGF0aCI6Ii8xMjM1NjczNjMvMjY5MjQ1MzI4LTA4MDAxZWMyLTMzNWUtNGIyNC05NjU5LTA0NDVjYjE2NWRjNi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwMTQ4MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xOWY4Y2FlYzI3ODBjZjFhNTYxNWZiYjU1NjM1MDlmYjVkZmRiODMwYmY2MTZmNGIwNDEyOGQxZDU1MDkwZmUyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.k1IdWj1snooyXYJKj7NRcBz7gGn2yGxqYF0bGwNYWfg)
在 NLP 中 Logistic Regression 是一個很基礎的監督式學習分類演算法,神經網路即是由一系列的 Logistic Regression Classifiers 堆疊而成的。
在 Logistic Regression 模型中,首先將每句 Tweets 經過資料前處理後,組成一個擁有所有詞彙 (令有 m 個不同詞彙) 的 Vocabulary。
在計算特定詞彙在 Positive Tweets 和 Negative Tweets 中分別出現的次數後,每個詞彙會形成一個 (1x3) 的矩陣,分別是 bias、在 Postive 中的詞頻、在 Negative 中的詞頻
def sigmoid(z):
h = 1/(1 + np.exp(-z))
return h
在 Logistic Regression 中用來量化模型表現的方式為去計算 Classifier output (
P | y = 0 | y = 1 |
---|---|---|
1 | 0 | |
0 | 1 |
若再將 Loss Function 做細部的拆解的話可以發現其實是由兩個部分所貢獻,分別是當
y | ||
---|---|---|
0 | 0 | any |
1 | any | 0 |
y | ||
---|---|---|
0 | any | 0 |
1 | 0.99 | ~0 |
1 | ~0 |
y | ||
---|---|---|
1 | any | 0 |
0 | 0.01 | ~0 |
0 | ~1 |
Gradient Descent 的目的是在找出一個理想的
在做 Gradient Descent 的過程中,會先找出在權重
Gradient Descent 的另一個參數
def gradientDescent(x, y, theta, alpha, num_iters):
m = len(x)
for i in range(0, num_iters):
z = np.dot(x, theta)
h = sigmoid(z)
# Loss function
J = -1/m * (np.dot(y.T, np.log(h)) + np.dot((1 - y).T, np.log(1 - h)))
# Gradient Descent
theta = theta - alpha/m * (np.dot(x.T, (h - y)))
J = float(J)
return J, theta
Input
x
: input matrix (m, n+1) (training x)
y
: corresponging label matrix (m, 1) (training y)
theta
: initial weight vector (n+1, 1)
alpha
: learning rate
num_iters
: max iteration number
Return
J
: cost after training
theta
: trained weight vector
X = np.zeros((len(train_x), 3))
Y = train_y
alpha_values = [1e-9, 1e-10, 1e-11, 1e-12]
num_iters_values = [1000, 5000, 10000, 50000, 1000000]
cost_values = np.empty((len(alpha_values), len(num_iters_values)))
for i, alpha in enumerate(alpha_values):
for j, num_iters in enumerate(num_iters_values):
start_time = time.time()
J, _ = gradientDescent(X, Y, np.zeros((3, 1)), alpha, int(num_iters))
end_time = time.time()
time_consume = end_time - start_time
cost_values[i, j] = J
print(f'alpha = {alpha}, iter = {num_iters} Calculated, Cost = {J:.4f}, Time elapsed: {time_consume:.2f} sec')
print('---------------------------------------------')
plt.figure(figsize = (10, 6))
contour = plt.contourf(np.log10(alpha_values), num_iters_values, cost_values, levels = 20, cmap = 'viridis')
plt.colorbar(contour, label = 'Cost (J)')
plt.xlabel('log10(Learning Rate alpha)')
plt.ylabel('Number of Iterations (num_iters)')
plt.title('Cost vs. Learning Rate and Number of Iterations')
plt.show()
alpha = 1e-09, iter = 1000 Calculated, Cost = 0.2773, Time elapsed: 0.63 sec
alpha = 1e-09, iter = 5000 Calculated, Cost = 0.1286, Time elapsed: 2.89 sec
alpha = 1e-09, iter = 10000 Calculated, Cost = 0.1013, Time elapsed: 5.81 sec
alpha = 1e-09, iter = 50000 Calculated, Cost = nan, Time elapsed: 34.93 sec
alpha = 1e-09, iter = 1000000 Calculated, Cost = nan, Time elapsed: 670.09 sec
---------------------------------------------
alpha = 1e-10, iter = 1000 Calculated, Cost = 0.5952, Time elapsed: 0.58 sec
alpha = 1e-10, iter = 5000 Calculated, Cost = 0.3847, Time elapsed: 2.94 sec
alpha = 1e-10, iter = 10000 Calculated, Cost = 0.2773, Time elapsed: 7.85 sec
alpha = 1e-10, iter = 50000 Calculated, Cost = 0.1286, Time elapsed: 33.51 sec
alpha = 1e-10, iter = 1000000 Calculated, Cost = nan, Time elapsed: 700.17 sec
---------------------------------------------
alpha = 1e-11, iter = 1000 Calculated, Cost = 0.6820, Time elapsed: 0.62 sec
alpha = 1e-11, iter = 5000 Calculated, Cost = 0.6404, Time elapsed: 2.93 sec
alpha = 1e-11, iter = 10000 Calculated, Cost = 0.5951, Time elapsed: 7.89 sec
alpha = 1e-11, iter = 50000 Calculated, Cost = 0.3847, Time elapsed: 33.50 sec
alpha = 1e-11, iter = 1000000 Calculated, Cost = 0.1013, Time elapsed: 682.91 sec
---------------------------------------------
alpha = 1e-12, iter = 1000 Calculated, Cost = 0.6920, Time elapsed: 0.57 sec
alpha = 1e-12, iter = 5000 Calculated, Cost = 0.6875, Time elapsed: 3.23 sec
alpha = 1e-12, iter = 10000 Calculated, Cost = 0.6819, Time elapsed: 7.58 sec
alpha = 1e-12, iter = 50000 Calculated, Cost = 0.6404, Time elapsed: 33.33 sec
alpha = 1e-12, iter = 1000000 Calculated, Cost = 0.2772, Time elapsed: 683.42 sec
從 Tuning Parameter 的過程中可以看到,隨著迭代次數的上升,Cost 和 Time elapsed 之間存在著取捨的關係,且所需時間隨著迭代次數呈指數上升。Cost 來到最低的 0.1013 的有兩組,分別是 alpha = 1e-09, iter = 10000, time = 5.81 sec
, alpha = 1e-11, iter = 1000000, time = 682.91 sec
,最後在時間成本的考量下我在後面的 Model Training 中選擇了 alpha = 1e-09, iter = 10000
這組參數。可以特別注意的是在 alpha = 1e-12
這組中,在 iter < 50000
的情況下 Cost 都還非常的大,原因是因為 Learning Rate 實在是太小了,導致 Gradient Descent 的速率太慢,一直到 iter = 1000000
Cost 才下降到平均水準。
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, int(1e4))
t = []
for i in np.squeeze(theta):
t.append(i)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {t}")
The cost after training is 0.10133100
The resulting vector of weights is [3e-07, 0.00127474, -0.0011083]
def predict_tweet(tweet, freqs, theta):
x = extract_features(tweet, freqs)
y_pred = sigmoid(np.dot(x, theta))
return y_pred
Generate the sentences with the help of ChatGPT lol
vali_tweet = [
"Another day, another opportunity.",
"Do the right things, do things right.",
"Celebrate the journey, not just the destination.",
"Every sunset is an opportunity to reset.",
"Stars can't shine without darkness.",
"Inhale courage, exhale fear.",
"Radiate kindness like sunshine.",
"Find beauty in the ordinary.",
"Chase your wildest dreams with the heart of a lion.",
"Life is a canvas; make it a masterpiece.",
"Let your soul sparkle.",
"Create your own sunshine.",
"This summer would not be perfect without you." ]
for tweet in vali_tweet:
print(process_tweet(tweet))
print('%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))
print('\n')
Stem: ['anoth', 'day', 'anoth', 'opportun']
Another day, another opportunity. -> 0.533046
Stem: ['right', 'thing', 'thing', 'right']
Do the right things, do things right. -> 0.508265
Stem: ['celebr', 'journey', 'destin']
Celebrate the journey, not just the destination. -> 0.500568
Stem: ['everi', 'sunset', 'opportun', 'reset']
Every sunset is an opportunity to reset. -> 0.509145
Stem: ['star', 'shine', 'without', 'dark']
Stars can not shine without darkness. -> 0.499932
Stem: ['inhal', 'courag', 'exhal', 'fear']
Inhale courage, exhale fear. -> 0.500083
Stem: ['radiat', 'kind', 'like', 'sunshin']
Radiate kindness like sunshine. -> 0.515079
Stem: ['find', 'beauti', 'ordinari']
Find beauty in the ordinary. -> 0.506431
Stem: ['chase', 'wildest', 'dream', 'heart', 'lion']
Chase your wildest dreams with the heart of a lion. -> 0.496330
Stem: ['life', 'canva', 'make', 'masterpiec']
Life is a canvas; make it a masterpiece. -> 0.501446
Stem: ['let', 'soul', 'sparkl']
Let your soul sparkle. -> 0.514518
Stem: ['creat', 'sunshin']
Create your own sunshine. -> 0.502758
Stem: ['summer', 'would', 'perfect', 'without']
This summer would not be perfect without you. -> 0.509757
def test_logistic_regression(test_x, test_y, freqs, theta):
y_hat = []
for tweet in test_x:
x = extract_features(tweet, freqs)
y_pred = sigmoid(np.dot(x, theta))
if y_pred > 0.5:
y_hat.append(1)
else:
y_hat.append(0)
accuracy = (y_hat == np.squeeze(test_y)).sum()/len(test_x)
return accuracy
Logistic regression model's accuracy = 0.9950
不同於 Logistic Regression 是利用各項的權重去做判別 (等同於繪製一條類別分界線),在 Naive Bayes 模型中,模型會逐字去計算該字為 Positive Sentiment 和 Negative Sentiment 的條件機率
Naive Bayes Model 有兩個主要假設:
- 詞彙所在的位置不重要,模型只會紀錄詞彙的性質 (即條件機率),而不會紀錄其在 document 中的位置
- 各詞彙之間的條件機率是獨立的,表示為 eq. 13
在開始做計算之前,Naive Bayes Model 會先將詞彙匯集成一袋的文字並統計各詞彙分別在 Positive / Negative Sentiment 句子中出現的次數,即可計算該詞彙的 Likelihood
![截圖 2023-09-12 20 24 43](https://private-user-images.githubusercontent.com/123567363/269244427-f371e30c-6772-4a05-bed5-82abddb74376.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTI0MTUsIm5iZiI6MTczOTE1MjExNSwicGF0aCI6Ii8xMjM1NjczNjMvMjY5MjQ0NDI3LWYzNzFlMzBjLTY3NzItNGEwNS1iZWQ1LTgyYWJkZGI3NDM3Ni5qcGc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwMTQ4MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iYzRlZmY5MDFhODJiMjIyZWQ1YjIxZmYyYmNmMjFkNTFhMDMzNjcwOTY3ODViMTgzYjk0ZmM4ODA5NmMzZWNlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.dgaIgL726adKCihZKu85l1d_3pWwYqlb6VipWXcmvco)
![截圖 2023-09-12 20 24 43](https://private-user-images.githubusercontent.com/123567363/269244499-cfc07adf-8afa-4144-a9de-739e29e5345c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTI0MTUsIm5iZiI6MTczOTE1MjExNSwicGF0aCI6Ii8xMjM1NjczNjMvMjY5MjQ0NDk5LWNmYzA3YWRmLThhZmEtNDE0NC1hOWRlLTczOWUyOWU1MzQ1Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwMTQ4MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MTJmMjUyMjdiYWU2M2FjMzA2Mzg1NWIzMzY3MmI4ODQyNDY1OTQ2ODczZjhhNzI1ODMzMWZlNmZiZTVjNmFmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.hOc2WxT03gQ4XRaq0qjTsGLoQEjaBZ-ehhwzFFUgMTs)
總結以上,Naive Bayes Model 大致可以分成 4 個步驟,分別是
- 文本前處理 (Cleaning, Word frequency count) (Rewind the Data Preprocessing part)
- 資料處理
- Conditional Probability
- Laplacian Smoothing
- Log Likelihood & Log Prior
- 計算並預測 Sentiment
資料處理
def train_naive_bayes(freqs, train_x, train_y):
data = {'word': [], 'positive': [], 'negative': [], 'sentiment': []}
loglikelihood = {}
logprior = 0
# Calculate V, the number of unique words in the vocabulary
vocab = set([pair[0] for pair in freqs.keys()])
V = len(vocab)
N_pos = N_neg = 0
# Calculate Prior
for pair in freqs.keys():
if pair[1] > 0:
N_pos += freqs[pair]
else:
N_neg += freqs[pair]
D = len(train_y)
D_pos = len(list(filter(lambda x: x > 0, train_y)))
D_neg = len(list(filter(lambda x: x <= 0, train_y)))
logprior = np.log(D_pos) - np.log(D_neg)
# Calculate Likelihood
for word in vocab:
freq_pos = lookup(freqs, word, 1)
freq_neg = lookup(freqs, word, 0)
# Laplacian Smoothing
p_w_pos = (freq_pos + 1) / (N_pos + V)
p_w_neg = (freq_neg + 1) / (N_neg + V)
loglikelihood[word] = np.log(p_w_pos) - np.log(p_w_neg)
if p_w_pos > p_w_neg:
sentiment = 1
else:
sentiment = 0
data['word'].append(word)
data['positive'].append(np.log(p_w_pos))
data['negative'].append(np.log(p_w_neg))
data['sentiment'].append(sentiment)
return logprior, loglikelihood, data
LogPrior: 0.0 # Stands for pos/neg Balanced Dataset
Likelihood:
{'easili': -0.452940736126882,
'melodi': 0.6456715525412289,
'ohstylesss': 0.6456715525412289,
'steelseri': -0.7406228085786619,
'harsh': -0.7406228085786619,
'weapon': -0.452940736126882,
'maxdjur': -0.7406228085786619,
'thalaivar': 0.6456715525412289,
'theroyalfactor': 0.6456715525412289,
'fought': 0.6456715525412289,
'louisemensch': -0.7406228085786619,
'hayli': 0.6456715525412289}
計算並預測 Sentiment
Validation with ChatGPT tweet again
def naive_bayes_predict(tweet, logprior, loglikelihood):
word_l = process_tweet(tweet)
p = 0
p += logprior
for word in word_l:
if word in loglikelihood:
p += loglikelihood[word]
return p
Tweets: Another day, another opportunity.
Stem: ['anoth', 'day', 'anoth', 'opportun']
Another day, another opportunity. -> 2.267723
Tweets: Do the right things, do things right.
Stem: ['right', 'thing', 'thing', 'right']
Do the right things, do things right. -> -0.122857
Tweets: Celebrate the journey, not just the destination.
Stem: ['celebr', 'journey', 'destin']
Celebrate the journey, not just the destination. -> -0.324748
Tweets: Every sunset is an opportunity to reset.
Stem: ['everi', 'sunset', 'opportun', 'reset']
Every sunset is an opportunity to reset. -> 2.054798
Tweets: Stars can not shine without darkness.
Stem: ['star', 'shine', 'without', 'dark']
Stars can not shine without darkness. -> 0.572238
Tweets: Inhale courage, exhale fear.
Stem: ['inhal', 'courag', 'exhal', 'fear']
Inhale courage, exhale fear. -> -0.142427
Tweets: Radiate kindness like sunshine.
Stem: ['radiat', 'kind', 'like', 'sunshin']
Radiate kindness like sunshine. -> 1.410585
Tweets: Find beauty in the ordinary.
Stem: ['find', 'beauti', 'ordinari']
Find beauty in the ordinary. -> 1.288319
Tweets: Chase your wildest dreams with the heart of a lion.
Stem: ['chase', 'wildest', 'dream', 'heart', 'lion']
Chase your wildest dreams with the heart of a lion. -> -1.379487
Tweets: Life is a canvas; make it a masterpiece.
Stem: ['life', 'canva', 'make', 'masterpiec']
Life is a canvas; make it a masterpiece. -> 0.917726
Tweets: Let your soul sparkle.
Stem: ['let', 'soul', 'sparkl']
Let your soul sparkle. -> 1.488666
Tweets: Create your own sunshine.
Stem: ['creat', 'sunshin']
Create your own sunshine. -> 1.445494
Tweets: This summer would not be perfect without you.
Stem: ['summer', 'would', 'perfect', 'without']
This summer would not be perfect without you. -> 1.041158
Naive Bayes model's accuracy = 0.9950
[1] Speech and Language Processing. Dan Jurafsky and James H. Martin Jan 7, 2023
[2] Natural Language Processing with Classification and Vector Spaces DeepLearning.AI
[3] Decoding Generative and Discriminative Models Chirag Goyal — Published On July 19, 2021 and Last Modified On September 13th, 2023
[4] Gradient at Wikipedia