Skip to content

Commit

Permalink
Corrected some typos
Browse files Browse the repository at this point in the history
  • Loading branch information
cif2cif authored Nov 2, 2017
1 parent 904b905 commit d86effb
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/sources/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Camacho-Collados, José, Mohammad Taher Pilehvar, and Roberto Navigli. "A Framew

When developing new similarity metrics, proper evaluation is important, whereas sometimes it is tedious. Sematch helps to save such efforts by providing a evaluation framework, where similarity methods are evaluated with common word similarity datasets and can be compared with each other.

The most established methodology for evaluating performance of semantic similarity methods in word similarity dataset, is measuring the Spearman correlation between similarity scores generated by the similarity methods and scores assessed by human. Note that both Spearman's and Pearson's correlations coefficients have been commonly used in the literatures. They are equivalent if rating scores are ordered and we use Spearman correlation coefficients as default. A similarity method is acknolwedged to have better performance if it has higher correlation score (the closer to 1.0 the better ) with human judgements, while it is acknowledged to be unrelated to human assessment if the correlation is 0. Since the Spearman's rank correlation coefficients produced by different similarity methods are dependent on the human ratings for each dataset, we need to conduct statistical significance tests on two dependent (overlapping) correlations. Thus, the Steiger's Z Significance Test is used to calculate statistical significance test between the dependent correlation coefficients produced by different similarity methods, using a one-tailed hypothesis test for assessing the difference between two paired correlations. To illustrate the evaluation process, we have demonstrate the evaluation of a novel similarity method WPath, and we want to evaluate with some datasets and compare it to Lin method.
The most established methodology for evaluating performance of semantic similarity methods in word similarity dataset, is measuring the Spearman correlation between similarity scores generated by the similarity methods and scores assessed by humans. Note that both Spearman's and Pearson's correlation coefficients have been commonly used in the literatures. They are equivalent if rating scores are ordered and we use Spearman correlation coefficients as default. A similarity method is acknowledged to have better performance if it has higher correlation score (the closer to 1.0 the better ) with human judgements, while it is acknowledged to be unrelated to human assessment if the correlation is 0. Since the Spearman's rank correlation coefficients produced by different similarity methods are dependent on the human ratings for each dataset, we need to conduct statistical significance tests on two dependent (overlapping) correlations. Thus, the Steiger's Z Significance Test is used to calculate statistical significance test between the dependent correlation coefficients produced by different similarity methods, using a one-tailed hypothesis test for assessing the difference between two paired correlations. To illustrate the evaluation process, we have demonstrated the evaluation of a novel similarity method WPath, and we want to evaluate with some datasets and compare it to Lin method.

```python
from sematch.evaluation import WordSimEvaluation
Expand Down Expand Up @@ -56,7 +56,7 @@ print evaluation.evaluate_metric('wpath_en_es', wpath_en_es, 'rg65_EN-ES')

## Category Classification Evaluation

Although the word similarity correlation measure is the standard way to evaluate the semantic similarity metrics, it relies on human judgements over word pairs which may not have same performance in real applications. Therefore, apart from word similarity evaluation, the Sematch evaluation framework also includes a simple aspect category classification for Aspect Based Sentiment Analysis. We use the dataset from SemEval2015 and SemEval2016, sentence-level Aspect-based Sentiment Analysis. The original dataset can be found in [Aspect Based Sentiment Analysis 15](http://alt.qcri.org/semeval2015/task5/) and [Aspect Based Sentiment Analysis 16](http://alt.qcri.org/semeval2016/task5/).
Although the word similarity correlation measure is the standard way to evaluate the semantic similarity metrics, it relies on human judgements over word pairs which may not have the same performance in real applications. Therefore, apart from word similarity evaluation, the Sematch evaluation framework also includes a simple aspect category classification for Aspect Based Sentiment Analysis. We use the dataset from SemEval2015 and SemEval2016, sentence-level Aspect-based Sentiment Analysis. The original dataset can be found in [Aspect Based Sentiment Analysis 15](http://alt.qcri.org/semeval2015/task5/) and [Aspect Based Sentiment Analysis 16](http://alt.qcri.org/semeval2016/task5/).

To evaluate the mode, you need to first define a word similarity measurement function, and then train and evaluate the classification model.

Expand Down

0 comments on commit d86effb

Please sign in to comment.