Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inbloom.error: internal initialization failed -- Parameter bounds? #4

Open
jbrockmendel opened this issue Sep 8, 2015 · 4 comments

Comments

@jbrockmendel
Copy link

Trying to initialize large filters results in memory errors, see below. Most likely explanation is the filter is not intended to be used with these parameters. If so, what limits should be used in practice?

>>> import inbloom
>>> bf = inbloom.Filter(10**8, 10**-4)
>>> bf = inbloom.Filter(10**8, 10**-5)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> 
>>> bf = inbloom.Filter(10**8, 10**-4)
>>> bf = inbloom.Filter(10**9, 10**-4)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> 
>>> bf = inbloom.Filter(10**9, 10**-3)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> 
>>> bf = inbloom.Filter(10**9, 10**-2)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> 
>>> bf = inbloom.Filter(5*10**8, 10**-2)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> 
>>> bf = inbloom.Filter(5*10**8, 10**-1)
Python(83086,0x7fff7288e310) malloc: *** mach_vm_map(size=18446744073441116160) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
inbloom.error: internal initialization failed
>>> bf = inbloom.Filter(4*10**8, 10**-1)
>>> 

This output was produced using OSX with Python 2.7.9 and inbloom 0.2.2. I got identical results on Ubuntu 14 with Python 2.7.10, except that the error messages were uniformly "inbloom.error: internal initialization failed".

@bergundy
Copy link
Collaborator

Thanks for the report, I'll take a look in the upcoming week.

@jbrockmendel
Copy link
Author

Following up, I noticed in the libbloom docs that it defaults to a 32-bit build.

I then poked at the parameters and found that with an error rate of 10**-4, the largest number of elements it will take is 112022460. With an error rage of 10**-5, the limit is 89617968. Going to 10**-6 reduces the limit by exactly 5/6 to 74681640. That log-linear pattern continues through at least 10**-7.

Ideally, reducing the error rate by a factor of ten should require 4.8 bits/element. Using that estimate, these maximal filters would be taking 112022460 * (4*4.8) bits, which is just a smidge above 2**31. I have no idea where to get an extra factor of two, but this looks like the 32-bit build may be at the root of it.

@bergundy
Copy link
Collaborator

Hi, it's definitely an issue with libbloom, it doesn't support filters which are larger than 2 ** 31, this can be fixed by changing libbloom to use size_t instead of int for the entries, bits and bytes fields of the bloom struct. I have a patch which fixes that and will submit it upstream.
I think we should support up to UINT32_MAX (2 ** 32 - 1) which is the maximum size allowed in the serialization protocol, I can also make libbloom raise a more specific error when passing large values.

Does this solution seem reasonable to you?
cc: @dvirsky

@bergundy
Copy link
Collaborator

Here's the commit for libbloom EverythingMe/libbloom@87c929a (I started a fork since the library changed a lot in the past month and will need some modifications to inbloom).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants