-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main keystore #2
base: main
Are you sure you want to change the base?
Conversation
* int values. The internal representations may have collisions. Example transformations include a modulo | ||
* or -abs(value), or some combination. | ||
*/ | ||
public interface IntKeyLookupStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to define interface specifically for int.
You can use Generic instead and define KeyLookupStore, and accordingly parameterize the methods.
Then IntKeyLookupStore can be one of the implementation.
* Check if the implementing class supports safe removals. If it doesn't, remove() will always return false. | ||
* @return true if the class supports safe removals, false if it doesn't. | ||
*/ | ||
boolean supportsRemoval(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in the first place? We ideally should not expose any implementation which can't support removals as it is a basic/mandatory requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that because of collisions either in the hash function or the modulo, removing values might also remove the keys which collided with those values, and we don't want to have false negatives because it might mean we have to recompute the query. In the RemovableIntKeyLookupStore we have a set of values which have had collisions which lets us safely remove un-collided values, which should be most of them. I also added a forceRemove function so you can get rid of any value for either class, at the cost of no longer guaranteeing no false negatives.
* Check if the object currently guarantees having no false negatives when running contains(). | ||
* @return false if there will not be false negatives, true if there could be false negatives. | ||
*/ | ||
boolean canHaveFalseNegatives(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think we should disallow false negatives at all cost? As it should be a basic requirement, and can be costly depending on the use case(like computing query again though it was present in cache)
* If it uses a RBM without a modulo or doesn't use an RBM, returns 0. | ||
* @return The modulo. | ||
*/ | ||
int getModulo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm don't think we need to expose this. If we use modulo 2^22 for example, the concrete class can itself can be called RoaringBitMapModule22 or something like that. I don't see a use case for this function.
/** | ||
* Returns whether the store is at memory capacity | ||
*/ | ||
boolean getIsAtCapacity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to isAtCapacity() ?
* Returns true if the transformation involves taking -abs(), simplifying int[] access and sorting | ||
* @return Whether transformed values are always negative. | ||
*/ | ||
boolean isUsingNegativeOnly(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep the interface simple for now and remove this if not a strong requirement.
|
||
/** | ||
* Returns an estimate of the store's memory usage. | ||
* @return The memory usage, in MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use bytes instead of MB. And long instead of double?
Also explicitly name function like getMemorySizeInBytes() to be clear.
* int values. The internal representations may have collisions. Example transformations include a modulo | ||
* or -abs(value), or some combination. | ||
*/ | ||
public interface IntKeyLookupStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also add a clear() function to clear all the keys.
… unneeded functions, added clear(), changed MB to bytes, misc renamings)
…for an RBM-only store
…s track of collided values, see comment for justification for code reuse
* Returns the current internal data structure. | ||
* @return A string representing the currently used internal data structure. | ||
*/ | ||
String getCurrentStructure() throws Exception; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about it's use case, but if we want to have it, create a enum for it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly only useful in testing, but you're right, we can just have it return the enum it uses internally instead of a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it from the interface, just kept it in the hybrid class and changed it to return an enum. The RBM only class doesn't need it
* Check if the implementing class supports safe removals. If it doesn't, remove() will always return false. | ||
* @return true if the class supports safe removals, false if it doesn't. | ||
*/ | ||
boolean supportsRemoval(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go ahead with just Removable implementation, so I guess this will not be needed as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the additional overhead for the Removable implementation is very small so it's worth it
* Check if the object currently guarantees having no false negatives when running contains(). | ||
* @return false if there will not be false negatives, true if there could be false negatives. | ||
*/ | ||
boolean canHaveFalseNegatives(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, lets not have a implementation which can lead to false negatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get rid of forceRemove()? I think if we keep it, it'd still be useful to have a function keeping track of whether it's been called. But if we don't keep it no need for this function
* Returns the number of times add() has been run, including unsuccessful attempts. | ||
* @return The number of adding attempts. | ||
*/ | ||
int getNumAddAttempts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename it to getTotalAdds() ?
* Returns the number of times add() has returned false due to a collision. | ||
* @return The number of collisions. | ||
*/ | ||
int getNumCollisions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename it to getCollisions(), as it is a int anyways, so "Num" looks redundant.
return false; | ||
} | ||
int transformedValue = transform(value); | ||
boolean alreadyContained = contains(transformedValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. We are passing transformedValue to contains, which internally again transforms it further. Shouldn't we be just passing contains(value) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also considering this will mostly return false than true, as collisions would be less frequent, better to just call rbm.add() directly and avoid extra seek?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right it's redundant, although transform() should be idempotent. I'll change that. But I think we need to keep the call to contains(), otherwise we won't know if there is a collision, and then we can't maintain our set of values which have had collisions, which we need to safely remove entries.
protected int size; | ||
protected long memSizeCapInBytes; | ||
protected int numAddAttempts; | ||
protected int numCollisions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned elsewhere, lets move this to separate KeyStoreStats class.
Also use CounterMetric or LongAdder to avoid concurrency related issues.
return true; | ||
} | ||
handleCollisions(transformedValue); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can make this function add as void itself, as client calling add() might not find this value useful.
|
||
@Override | ||
public String getCurrentStructure() throws Exception { | ||
return "RBM"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, define an enum for this if needed.
import org.roaringbitmap.RoaringBitmap; | ||
import java.util.HashSet; | ||
|
||
public class RemovableRBMIntKeyLookupStore extends RBMIntKeyLookupStore implements KeyLookupStore<Integer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, lets only keep this implementation and avoid false negatives at all cost.. And rename it to RBMIntKeyLookupStore.
…earch into main-keystore
…ch is renamed as RBMIntKeyLookupStore
Description
This change adds an RBM implementation of KeyLookupStore, designed to store integer hashcodes of keys for a future disk cache in a memory-efficient way. It stores values in the RBM with a modulo, as very sparse RBMs are not memory-efficient. This comes with a tradeoff of some rare collisions (~0.3% of values for the recommended modulo 2^29 in a store with 10^7 values). To handle this, we also maintain a hash set of values which have had collisions. Values with no collisions can be safely removed without risking any false negatives in contains().
The implementation is generic and can be used in other places as well.
Testing
The implementing class was unit tested. The tests include basic functionality for all the different underlying data structures that the classes use, ability to set a memory cap to prevent the store from growing too large, and making sure the implementations are thread-safe. See RBMIntKeyLookupStoreTests.java for these tests.
Related Issues
N/A, part of larger tiered caching feature.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.