diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png new file mode 100644 index 000000000..d0b07440d Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-elser-average-ndcg.png differ diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png b/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png new file mode 100644 index 000000000..1befc5df5 Binary files /dev/null and b/docs/en/stack/ml/nlp/images/ml-nlp-elser-ndcg10-beir.png differ diff --git a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc index 1b6f94fff..e1b78d513 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc @@ -45,10 +45,18 @@ more allocations or more threads per allocation, which requires bigger ML nodes. Autoscaling provides bigger nodes when required. If autoscaling is turned off, you must provide suitably sized nodes yourself. +[discrete] +[[elser-benchamrks]] +== Benchmarks + +The following sections provide information about how ELSER performs on different +hardwares and compares the model performance to {es} BM25 and other strong +baselines such as Splade or OpenAI. + [discrete] [[elser-hw-benchamrks]] -== Hardware benchmarks +=== Hardware benchmarks Two data sets were utilized to evaluate the performance of ELSER in different hardware configurations: `msmarco-long-light` and `arguana`. @@ -83,6 +91,35 @@ configurations. |================================================================================================================================================================================== +[discrete] +[[elser-qualitative-benchmarks]] +=== Qualitative benchmarks + +The metric that is used to evaluate ELSER's ranking ability is the Normalized +Discounted Cumulative Gain (NDCG) which can handle multiple relevant documents +and fine-grained document ratings. The metric is applied to a fixed-sized list +of retrieved documents which, in this case, is the top 10 documents (NDCG@10). + +The table below shows the performance of ELSER compared to {es} BM25 with an +English analyzer broken down by the 12 data sets used for the evaluation. ELSER +has 10 wins, 1 draw, 1 loss and an average improvement in NDCG@10 of 17%. + +image::images/ml-nlp-elser-ndcg10-beir.png[alt="ELSER benchmarks",align="center"] +_NDCG@10 for BEIR data sets for BM25 and ELSER - higher values are better)_ + +The following table compares the average performance of ELSER to some other +strong baselines. The OpenAI results are separated out because they use a +different subset of the BEIR suite. + +image::images/ml-nlp-elser-average-ndcg.png[alt="ELSER average performance compared to other baselines",align="center"] +_Average NDCG@10 for BEIR data sets vs. various high quality baselines (higher_ +_is better). OpenAI chose a different subset, ELSER results on this set_ +_reported separately._ + +To read more about the evaluation details, refer to +https://www.elastic.co/blog/may-2023-launch-information-retrieval-elasticsearch-ai-model[this blog post]. + + [discrete] [[download-deploy-elser]] == Download and deploy ELSER