Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArcticEmbedLEncoder #2694

Merged
merged 15 commits into from
Feb 1, 2025
Merged

ArcticEmbedLEncoder #2694

merged 15 commits into from
Feb 1, 2025

Conversation

vincent-4
Copy link
Member

@vincent-4 vincent-4 commented Jan 26, 2025

Starting PR for @b8zhong
Mostly looking for a sanity check of initial impl of src/main/java/io/anserini/encoder/dense/ArcticEmbedLEncoder.java

  • Bindings generally follow Splade pattern

Not implemented yet:

  • Component test for output verification
  • (L2 Norm)
  • End-to-end test with HNSW retrieval
  • (L1 Check) and check with cached vs. ONNX model

Q: am I good to convert Arctic-embed-l to ONNX, then run with src/main/java/io/anserini/encoder/dense/ArcticEmbedLEncoder.java now?

@vincent-4 vincent-4 changed the title Upstream master ArcticEmbedLEncoder Jan 30, 2025
@vincent-4
Copy link
Member Author

vincent-4 commented Jan 31, 2025

Here is the full command used, for posterity

SHARDS=(00 01 02 03 04 05 06 07 08 09)
for shard in "${SHARDS[@]}"; do
    bin/run.sh io.anserini.search.SearchHnswDenseVectors \
        -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 \
        -topics tools/topics-and-qrels/topics.rag24.test.txt \
        -topicReader TsvString \
        -topicField title \
        -encoder ArcticEmbedL \
        -output runs/run.rag24.test.arctic-l-msv2.1.shard${shard}.txt \
        -hits 250 \
        -threads 32 \
        -efSearch 1000 \
        > logs/log.run.rag24.test.arctic-l-msv2.1.shard${shard}.txt 2>&1
done

technically, e2e is done, pending l2 norm or whatever verifications for
closeness.
@vincent-4
Copy link
Member Author

vincent-4 commented Jan 31, 2025

Results using [2cc3e6f](https://github.com/castorini/anserini/pull/2694/commits/2cc3e6f3ee1887b6a1a35d84445e8e3fef98bb52)
# cat them
$ cat runs/run.rag24.test.arctic-l-msv2.1.shard0* > runs/run.rag24.test.arctic-l-msv2.1.txt

# cat them
$ cat runs/run.rag24.test.CACHEDarctic-l-msv2.1.shard0* > runs/run.rag24.test.CACHEDarctic-l-msv2.1.txt

# non cached
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 \
    tools/topics-and-qrels/qrels.rag24.test-umbrela-all.txt \
    runs/run.rag24.test.arctic-l-msv2.1.txt
ndcg_cut_10             all     0.5162

# eval results
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 \
    tools/topics-and-qrels/qrels.rag24.test-umbrela-all.txt \
    runs/run.rag24.test.CACHEDarctic-l-msv2.1.txt
ndcg_cut_10             all     0.5603

# Summary of Results:
# Non-cached NDCG@10: 0.5162
# Cached NDCG@10:     0.5603

Apparently the mean pooling strategy might cause the drop? 0.0441. Clearly too high.

@vincent-4
Copy link
Member Author

(base) v4zhong@orca:/store/scratch/v4zhong/github/anserini$ SHARDS=(00 01 02 03 04 05 06 07 08 09)
for shard in "${SHARDS[@]}"; do
    bin/run.sh io.anserini.search.SearchHnswDenseVectors \
        -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 \
        -topics tools/topics-and-qrels/topics.rag24.test.txt \
        -topicReader TsvString \
        -topicField title \
        -encoder ArcticEmbedL \
        -output runs/run.rag24.test.arctic-l-msv2.1.shard${shard}.txt \
        -hits 250 \
        -threads 32 \
        -efSearch 1000 \
        > logs/log.run.rag24.test.arctic-l-msv2.1.shard${shard}.txt 2>&1
done
(base) v4zhong@orca:/store/scratch/v4zhong/github/anserini$ cat runs/run.rag24.test.arctic-l-msv2.1.shard0* > runs/run.rag24.test.arctic-l-msv2.1.txt
(base) v4zhong@orca:/store/scratch/v4zhong/github/anserini$ tools/eval/trec_eval.9.0.4/trec_eval -c -m ndcg_cut.10 \
    tools/topics-and-qrels/qrels.rag24.test-umbrela-all.txt \
    runs/run.rag24.test.arctic-l-msv2.1.txt
ndcg_cut_10             all     0.5603
(base) v4zhong@orca:/store/scratch/v4zhong/github/anserini$ 

After aligning with how Snowflake does the pooling stuff... it works. Results attached above

static private final String MODEL_URL = "https://huggingface.co/Snowflake/snowflake-arctic-embed-l/blob/main/onnx/model.onnx";
static private final String VOCAB_URL = "https://huggingface.co/Snowflake/snowflake-arctic-embed-l/blob/main/vocab.txt";

static private final String MODEL_NAME = "snowflake-arctic-embed-l-official.onnx";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left official in for now just because I haven't updated the huggingface link to be on UW server

* ArcticEmbedL
*/
public class ArcticEmbedLEncoder extends DenseEncoder {
static private final String MODEL_URL = "https://huggingface.co/Snowflake/snowflake-arctic-embed-l/blob/main/onnx/model.onnx";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and of course both urls have to change at some point

@vincent-4 vincent-4 marked this pull request as ready for review January 31, 2025 21:03
@vincent-4
Copy link
Member Author

Adding tokenization and inference tests

static private final int EMBEDDING_DIM = 1024;

static private final Object[][] EXAMPLES = new Object[][] {
{ new long[] { 101, 5050, 2023, 6251, 2005, 6575, 7882, 13768, 1024, 2339,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - can make wider to take up fewer lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@lintool lintool self-requested a review February 1, 2025 13:37
@lintool
Copy link
Member

lintool commented Feb 1, 2025

@vincent-4 we're good to merge?

@vincent-4
Copy link
Member Author

@vincent-4 we're good to merge?

Yes. Thanks!

@lintool lintool merged commit 8f70672 into castorini:master Feb 1, 2025
1 check passed
@b8zhong b8zhong deleted the upstream-master branch February 1, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants