Skip to content
This repository has been archived by the owner on Apr 20, 2020. It is now read-only.

Explain how sample vectors should be selected in README #2

Open
henrywallace opened this issue Jun 20, 2018 · 3 comments
Open

Explain how sample vectors should be selected in README #2

henrywallace opened this issue Jun 20, 2018 · 3 comments

Comments

@henrywallace
Copy link

I see that there initialized from twitter images in the demo. But in scratch, I see the use of random projections: https://github.com/alexklibisz/elastik-nearest-neighbors/blob/025f7291512a5f5d56fd34d1cf1c10efc7c83df7/scratch/es-lsh-glove/glove_lsh_es_index.py.

How can I learn further?

@henrywallace
Copy link
Author

@alexklibisz
Copy link
Owner

Hi @henrywallace

For all of my demo/experiments I did a random sample, somewhat biased to the first vectors. For example, for the benchmarks I have a parameter that defines the probability of selecting a vector to be a sample while iterating over the glove vectors. So if you set this to 0.3, it would pick roughly 3 of every 10 vectors to be a sample vector, until it's picked enough for the sample.

@alexklibisz
Copy link
Owner

In general the best place to see full usage of the plugin in its current state is in the benchmarking script here: https://github.com/alexklibisz/elastik-nearest-neighbors/blob/master/elasticsearch-aknn/benchmark/aknn.py

It's using Glove vectors, which are just text files: https://nlp.stanford.edu/projects/glove/

Here is where the sampling happens: https://github.com/alexklibisz/elastik-nearest-neighbors/blob/master/elasticsearch-aknn/benchmark/aknn.py#L79-L84

For what it's worth, I never saw a meaningful difference in different sampling strategies or sample seeds. There would however be a meaningful difference if you start populating many vectors which are very different from your sampled vectors used to build the LSH model. e.g. if you exclusively sample vectors from images of cats and then start populating vectors for images of fine artwork or some other completely different domain.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants