-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600
Comments
Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine? |
Its little different as per my understanding. In lucene we create graphs incrementally during indexing(when docs are still in lucene buffer) but above incremental creation of graph is during segment creation(merges and refresh) |
@vamshin I have updated the proposal for fixing both nmslib and faiss memory footprint for indexing. @jmazanec15 can you add the solution for fixing the footprint during saveIndex call of nmslib. |
@navneet1v sure Right now, nmslib duplicates the memory during indexing to build the optimized index: https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L470. Because we do not search in memory the index that is created at index time (i.e. we create the index, write it to disk, load it from disk, and then search it), we can avoid this duplication by stopping index creation before we make the additional allocations (https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L351) and then create a new function that will write the non-optimized index to disk as optimized. I played around with this some time ago - will share code |
@vamshin whats your thought on the overall solution here? |
LGTM! Thanks for the deep dive. |
In my understanding we should be able to reduce the total memory footprint by half. So currently if graph takes 1GB then during indexing we need atleast 2GB of off heap space. With the proposed change we should be pretty close to 1GB. I did a small experiment with the current status, and I can see that we need double the memory and what should be the ideal memory usage. With the above proposal we should be close to the Ideal line I have added in this graph. Jemalloc was there but it was not getting actually used. I will add new experiment results with jemalloc. Thanks to @jmazanec15 for helping in setting up jemalloc correctly. |
Completed the POC code for Faiss library. Ref: navneet1v@7165079 Here is the general idea:
Next steps: |
So I did some testing and I can see that with the POC code(navneet1v@7165079). Here are the observations:
Experiment Configuration
Ideal Memory usage includes Heap + Graph + some extra usage of memory that can come from other processes. Raw data: This provides a strong evidence that we should build the graphs in progressive fashion. Further Items to be looked at:
Index Directory
For setting up the env code can found here: https://github.com/navneet1v/opensearch-vectorsearch-sample/tree/main/constraint-env-benchmarks cc: @vamshin , @jmazanec15 |
On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it. I also confirmed by killing the containers and running them back and just did the search. The memory was < 2GB which is correct expectation. |
Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system. Also, did these set of experiments use jemalloc? I dont see it in https://github.com/navneet1v/opensearch-vectorsearch-sample. |
@jmazanec15 it uses jemalloc: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/Dockerfile_Memory_Fix#L3 . I think I have configured it correctly.
Yeah I want to do that. Let me talk to you on this. |
So I have updated the code to capture RSS metrics too here: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/capture_memory_metrics.sh#L42-L43 I will run experiments to give more granular details on RSS. |
With the help of @jmazanec15 I am able to setup the jemalloc in the experiments. I am seeing some positive results after that with Memory Fix version. Will add more details once I complete the experiments. |
I completed the experiments and here are the results. Jemalloc helped to remove a lot of fragmentation which was happening earlier. Experiment Configuration
Memory Fix Version
2.14 Version
Raw values: cc: @jmazanec15 , @vamshin |
@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix? |
Yes the recall were same for the both the experiments. |
@vamshin can we plan this for a release? |
i want to talk about the memory reduce scenarios. if we just write the vector into file like indexFlat format in java layer. and we do not transfer the memory to jni layer. let faiss read the memory from file as needed. it can reduce the memory even less than 1GB in write and merge scenarios. like #1693 we talked |
@luyuncheng yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors. And as we were discussing on the #1693 reading from file adds a lot of iops. in case of indexing this can really slow down the whole process. |
sure, we need take the iops into consideration. BTW, i really like this: navneet1v@7165079. |
@navneet1v One idea I came across: I am wondering if we can avoid the page cache via passing O_DIRECT for reading/writing the file. O_DIRECT will hint to OS to avoid using page cache if possible. On read, it will prevent an additional copy. So, instead of disk copyto pagecache copyto anon memory, it will be disk copyto anon memory. Thus, it may speed up loads as well as well as reduce memory usage/page cache thrashing. On write, I think its a little trickier with synchronization, but it would avoid copying to page cache before writing to disk. Downsides would be:
@fddattal and I had a discussion some time ago about this I believe. Do you have any thoughts on this? |
@jmazanec15 Thanks for looping me into the discussion! I recall when we discussed with Kiran that one of the issues with O_DIRECT is its behavior is actually not standardized. Here's the best article I have found on explaining that: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics As I am aware Scylladb folks use O_DIRECT. They have a pretty good engineering blog which may be of help to us At the time we considered implementing it, we didn't due to time constraints. In theory, it can outperform typical read/writes but it is a bit of a can of worms as you will need to direct the OS more regarding how to behave. |
Thanks @fddattal. I was thinking that we already cache in-memory graphs. So, the fact that there is an intermediate stage where we are reading the file into the page cache and then copying it seems inefficient - if we just remove it, we wouldnt have to do anything else to get benefit.
|
RFC for solution is added here: #1938 |
Closing this issue as the feature is merged in 2.17 branch and will be released in 2.17 version of Opensearch |
Description
While running some experiments I can am able to see that we need more memory than final graph which gets created in a segment.
Based on some research what I can see for Faiss
We give faiss, all the vectors and ids for which graph needs to be created. here: https://github.com/navneet1v/k-NN/blob/badbb1d3438a37536ce6de29c4085cded0ef2723/jni/src/faiss_wrapper.cpp#L159 .
After this faiss, internally calls add on the index function. here: https://github.com/facebookresearch/faiss/blob/8898eabe9f16e7bbebdc19667c993f7dc55a6a0c/faiss/IndexIDMap.cpp#L80
This will call internally add function of IndexHNSW, which will first store the vector in a flat index(which is nothing but a binary representation of vectors) and then start creating the HNSW graph.
So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory.
For Nmslib:
Issue
Due to this, we need more memory to create a graph than the output Faiss/Nmslib index at a segment level. This really throw off the memory estimations while force merging the segments.
Resolution
With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.
For Nmslib, we can start creating the object which is required to create the index of nmslib. This should at-least avoid increase memory foot print. @jmazanec15 added the resolution here: #1600 (comment)
### LimitationK-NN also provides nmslib as KNNEngine, but based on my deep-dive I didn't find similar feature or Nmslib. But overall I still believe we should implement this feature for Faiss Engine.
The text was updated successfully, but these errors were encountered: