Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known limitations of KDTree? #44

Open
robert-dodier opened this issue Dec 29, 2018 · 6 comments
Open

Known limitations of KDTree? #44

robert-dodier opened this issue Dec 29, 2018 · 6 comments

Comments

@robert-dodier
Copy link

robert-dodier commented Dec 29, 2018

Hi, I'm working with about 13,000 points in 34 dimensions. I'm loading the data from a file and then creating a KDTree via KDTree(values: a) where a is my array of 34-element vectors (instances of a class which I created).

Is there any reason to think that shouldn't work? I am getting the following after reading the data and then trying to construct the KDTree:

Execution interrupted. Enter code to recover and continue.
Enter LLDB commands to investigate (type :help for assistance.)
Process 26088 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7ffeef3ffff8)
    frame #0: 0x00007fff5562dcf8 libcorecrypto.dylib`ccaes_vng_ctr_crypt + 4
libcorecrypto.dylib`ccaes_vng_ctr_crypt:
->  0x7fff5562dcf8 <+4>:  pushq  %r15
    0x7fff5562dcfa <+6>:  pushq  %r14
    0x7fff5562dcfc <+8>:  pushq  %r13
    0x7fff5562dcfe <+10>: pushq  %r12
Target 0: (repl_swift) stopped.

It's not clear to me what's going on -- I assume that it's an out of memory error? or something else? (How would I figure out what kind of error it is?)

What is the most expensive part of constructing a KDTree -- where is the bottleneck? Does KDTree create any auxiliary indices or other data structures which might be large?

Thanks in advance for any insights you can offer.

Robert Dodier

PS. I am working in the Swift repl with Swift 5.0 (current snapshot build). I disabled printing out the values of items declared via :set set print-decls false (so it isn't the case that the KDTree gets constructed and then the repl fails while trying to print a representation of it).

@Bersaelor
Copy link
Owner

Bersaelor commented Dec 29, 2018

Mhmm, I haven't seen this error before. Unfortunately it is a rather cryptic one and KDTree doesn't even have a dependency on libcorecrypto.
Is it possible that REPL fails due to heavy load in KDTree code and then the error occurs during some regular execution in libcrypto?

That said there are the LoadTest and PerformanceTest files where we routinely run the code on test sets of 500000 points to make sure it's still working.

Could you try the same thing you are doing in Swift4 ? Maybe what you discovered is a general bug in Swift 5 , since it's not stable yet?

@robert-dodier
Copy link
Author

Hi, thanks for your reply. Looks like the error is not in the crypto library, that was just incidental. The problem appears to be that the program runs out of memory when constructing the KDTree instance.

I ran my example program in Xcode 10.0 + Swift 4.2 and after loading the data (about 14,000 items with 34 dimensions each) the program was using 60 MB. After entering KDTree.init, memory use increased greatly, and Xcode stopped the program when it reach about 1000 MB. At the moment it stopped, KDTree.init was in partitionLomuto, at the place where randomIndex is created -- the random number generator is apparently in libcrypto, so that's connection with the crypto stuff. Of course that's just incidental. It appears that the real problem is that building the KDTree requires a very large amount of memory.

I wonder, what is known about how memory use varies with the number of cases and the number of dimensions in each case? Is there a strategy to limit the amount of memory used? (E.g. limiting the depth of the tree or something like that.)

@Bersaelor
Copy link
Owner

That is very interesting indeed @robert-dodier , I haven't tested it that much with dim > 10 thus far.

Would you be able to distill your example into a small test-case, that we could add as a new test file to the test suite? Then we can play around with the memory usage?

@hsnetzer implemented the partitionLomuto, maybe we can look at the code to see whether there are cases where we could nudge ARC to release some intermediate variables earlier?

Also, @robert-dodier , could you run a test with the algorithm the way it was before we merged #38 . That PR by @hsnetzer improved performance. In my experience CPU usage and memory usage optimizations are often tradeoffs between each other.

@hsnetzer
Copy link
Contributor

hsnetzer commented Jan 1, 2019

Certain data sets cause very deep recursion. Maybe you have lots of duplicate points. If that’s the problem then the earlier sorting algorithm should also fail. All I can think of at the moment. Sorry this is happening in my code.

@robert-dodier
Copy link
Author

Hi, thanks for your replies. I will try to post a condensed version of the problem using some made-up data. The data are a little funny in the sense that for many vectors, there are a number of elements in the vector which are equal to 0 -- it would be interesting to know if that changes things. Or perhaps it's just the number of dimensions (i.e., 34). It might be a couple of days before I can return to this topic.

@robert-dodier
Copy link
Author

I've attached 3 files that might help with testing the KDTree code to see what is going on. It seems like there are some different possibilities: maybe the number of cases + dimensions is the problem? maybe the large number of identical (zero) values? maybe the small number of distinct nonzero values? I've created these data sets to embody these different characteristics.

Unfortunately I can't devote more time to these questions right now, I might be able to come back to it next month. Perhaps this is useful in some way all the same.

generated_different.csv: same number of cases, same number of dimensions, same min and max values, random values between min and max

generated_similar.csv: same number of cases, same number of dimensions, same min and max values, same proportion of 0 values, random values between min and max

generated_very_similar.csv: same number of cases, same number of dimensions, same min and max values, same proportion of 0 values, random values chosen from the distinct values of each dimension

generated_different.csv.gz
generated_similar.csv.gz
generated_very_similar.csv.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants