-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path3_WMD.py
282 lines (176 loc) · 8.3 KB
/
3_WMD.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
# coding: utf-8
# # Word Mover's distance
#
# In this note notebook we will see an application of Optimal Transport to the problem of computing similarities between sentences and texts. The method under the lens is called 'Word Mover's Distance' in reference to 'Earth Mover's Distance', another name of the Wasserstein $1$ distance, mostly used in computer vision.
#
# Traditionnally, portions of texts are compared by Cosine similarity on bag-of-words vectors, i.e. histograms of occurences of words in a text. It captures the exact similarity in terms of words, but two very related sentences can be orthogonal if the words that are used have the same semantic but are different. Such a semantic distance can be obtained by using *word embeddings*, that are embeddings of words in a Euclidean space (of potentially large dimension) where the Euclidean distance have a semantic meaning: two related words will be close in such embeddings. A popular embedding is the *word2vec* embedding, obtained with neural networks. A study of those mechanisms is not in the scope of this notebook, but the interested reader can find more information on [the corresponding Wikipedia page](https://en.wikipedia.org/wiki/Word2vec). Throughout the rest of this tutorial, we will use a subset of the [GloVe](https://nlp.stanford.edu/projects/glove/) embedding.
#
# The key observation made by Kusner and colleagues [1] is that when confronted to a sentence/document, the optimal transport distance can be used between histograms of occuring words using a ground metric obtained through word embeddings. In such a way, related words will be matched together, and the resulting distance will somehow express semantic relatedness between the content.
#
# [1] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International Conference on Machine Learning (pp. 957-966). http://proceedings.mlr.press/v37/kusnerb15.pdf
#
# ## A basic example
#
# We will start by reproducing the Figure $1$ in the original paper
#
# <img src="https://remi.flamary.com/cours/otml/fig1.png">
# Two sentences are considered: 'Obama speaks to the media in Illinois' and 'The president greets the press in Chicago'. It is clear from this example that the Cosine similarity between the two sentences indicates that the two sentences are totally not related, since there is no word in common. We will start by some imports and creating a list of the two sentences as words without stopwords that are not relevant for our analysis.
# In[1]:
import os
import numpy as np
import matplotlib.pylab as pl
import ot
s1 = ['Obama','speaks','media','Illinois']
s2 = ['President','greets','press','Chicago']
# We will use a subset of the GloVe word embedding, expressed as a dictionnary (word,embedding) that you can load this way
# In[2]:
model=dict(np.load('data/model.npz'))
# Then the embedded representation of the sentences can be obtained by
# In[3]:
s1_embed = np.array([model[w] for w in s1])
s2_embed = np.array([model[w] for w in s2])
# From the multidimensional scaling method in Scikitlearn, try to visualize the corresponding embedding of words in 2D.
# In[4]:
from sklearn import manifold
C = ot.dist(np.vstack((s1_embed,s2_embed)))
nmds = manifold.MDS(
2,
eps=1e-9,
dissimilarity="precomputed",
n_init=1)
npos = nmds.fit_transform(C)
pl.figure(figsize=(6,6))
pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k')
for i, txt in enumerate(s1):
pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20)
pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k')
for i, txt in enumerate(s2):
pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20)
pl.axis('off')
pl.tight_layout()
pl.show()
# Let's now compute the coupling between those two distributions and visualize the corresponding result
#
# In[5]:
C2= ot.dist(s1_embed,s2_embed)
G=ot.emd(ot.unif(4),ot.unif(4),C2)
pl.figure(figsize=(6,6))
pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k')
for i, txt in enumerate(s1):
pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20)
pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k')
for i, txt in enumerate(s2):
pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20)
for i in range(G.shape[0]):
for j in range(G.shape[1]):
if G[i,j]>1e-5:
pl.plot([npos[i,0],npos[j+4,0]],[npos[i,1],npos[j+4,1]],'k',alpha=G[i,j]/np.max(G))
pl.title('Word embedding and coupling with OT')
pl.axis('off')
pl.tight_layout()
pl.show()
# ## Sentence similarity
# We will now explore the superiority of this Word mover distance (WMD) in a regression context, where our goal is to estimate the similarity (or relatedness) of two sentences on a scale of 0 to 5 (5 being the most similar). Given a set of pairs of sentences with a human annotated relatedness, our goal is predict the relatedness from a new pair of sentences.
#
# We will use the [SICK (Sentences Involving Compositional Knowledge) dataset](http://clic.cimec.unitn.it/composes/sick.html) for this purpose.
#
# We first load it.
# In[6]:
data=np.load('data/data_text.npz')
setA=data['setA']
setB=data['setB']
scores=data['scores']
print (setA[0])
print (setB[0])
print(scores[0])
np.savez('data/data_text.npz',setA=setA,setB=setB,scores=scores)
# We will only keep 200 sentences for learning our regression model and the rest for testing
# In[7]:
n=200
testA=setA[n:]
trainA=setA[:n]
testB=setB[n:]
trainB=setB[:n]
scores_train=scores[:n]
scores_test=scores[n:]
# Using the countVectorizer model from ScikitLearn, compute all the bag-of-words representations of the sentences
# In[8]:
from sklearn.feature_extraction.text import CountVectorizer
vect = # TO BE FILLED
# Build a big data matrix of all the words present in the dataset embeddings
#
# In[9]:
all_feat = # TO BE FILLED
# Compute a big matrix of all pairwise feature distances using the dist() method of POT
# In[10]:
D = ot.dist(all_feat)
# now you can write a code that will compute the Cosine and WMD dissimilarities from all the pairs of the training set
# In[11]:
X_cos=[]
X_wmd=[]
Y=[]
for i in range(len(trainA)):
s1 = vect.transform([trainA[i]]).toarray().ravel()
s2 = vect.transform([trainB[i]]).toarray().ravel()
# Cosine similarity between bag of words
d_cos=# TO BE FILLED
X_cos.append(d_cos)
# WMD
d_wmd=# TO BE FILLED
X_wmd.append(d_wmd)
Y.append(scores_train[i])
# Visualize the corresponding golden similarities / distance from the learning set. Hence you have a first appreciation of how much WMD better captures this similarity.
# In[12]:
pl.figure()
pl.scatter(X_cos,Y)
pl.title('Cosine Similarity VS golden score')
pl.show()
pl.figure()
pl.scatter(X_wmd,Y)
pl.title('WMD Similarity VS golden score')
pl.show()
# You can learn a simple regression model between those 2 quantities. Use a polynomial of degree 2 to learn the regression model.
# In[13]:
import numpy.polynomial.polynomial as poly
k_cos = # TO BE FILLED
k_wmd = # TO BE FILLED
# Now compute from your regression model the estimated relatedness for all the pairs in the test set.
# In[14]:
X_cos=[]
X_wmd=[]
Y_test=[]
for i in range(len(testA)):
s1 = vect.transform([testA[i]]).toarray().ravel()
s2 = vect.transform([testB[i]]).toarray().ravel()
# cosine similarity between bag of words
d_cos=# TO BE FILLED
X_cos.append(d_cos)
# WMD
d_wmd=# TO BE FILLED
X_wmd.append(d_wmd)
Y_test.append(scores_test[i])
# Final regression scores
Y_cos = # TO BE FILLED
Y_wmd = # TO BE FILLED
# We will use MSE, Spearman's rho and Pearson coefficients to measure the quality of our regression model
# In[15]:
from sklearn.metrics import mean_squared_error as mse
from scipy.stats import pearsonr
from scipy.stats import spearmanr
# Estimate the quality of your regression model for both Cosine and WMD dissimilarities
# In[16]:
print('-------- Cosine')
pr = pearsonr(Y_cos, Y_test)[0]
sr = spearmanr(Y_cos,Y_test)[0]
se = mse(Y_cos,Y_test)
print('Test Pearson (test): ' + str(pr))
print('Test Spearman (test): ' + str(sr))
print('Test MSE (test): ' + str(se))
print('-------- WMD')
pr = pearsonr(Y_wmd, Y_test)[0]
sr = spearmanr(Y_wmd,Y_test)[0]
se = mse(Y_wmd,Y_test)
print('Test Pearson (test): ' + str(pr))
print('Test Spearman (test): ' + str(sr))
print('Test MSE (test): ' + str(se))
# Not bad isn't it ?