Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inbloom.error: invalid data length when trying to load bloom filter #11

Open
klingerko opened this issue Dec 10, 2020 · 5 comments
Open

Comments

@klingerko
Copy link

klingerko commented Dec 10, 2020

Hi,

I would like to use the inbloom library for creating a bloom filter for the alexa top 1 million domain list. When trying to dump and load the bloom filter from a file I always get the following error:

$ python test_inbloom.py
Traceback (most recent call last):
  File "test_inbloom.py", line 34, in <module>
    bf = inbloom.load(base64.b64decode(data))
inbloom.error: invalid data length

It seems like I'm running into this error clause: https://github.com/EverythingMe/inbloom/blob/master/py/inbloom/inbloom.c#L221

My test script looks like this:

import requests
import sys
import csv
import base64
import zipfile
import inbloom
from io import BytesIO, TextIOWrapper

ALEXA_URL = "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip"
FP_RATIO = 0.00001 # 0.0001 -> 2.3MB bloom filter file, 0.00001 -> 2.9MB bloom filter file

if __name__ == "__main__":
    alexa_inbloom = None
    response = requests.get(ALEXA_URL)
    if not response or response.status_code != 200:
        sys.exit(-1)

archive = zipfile.ZipFile(BytesIO(response.content))
file = archive.open("top-1m.csv")
with TextIOWrapper(file, encoding="utf-8") as text_file:
    reader = csv.reader(text_file)
    alexa_inbloom = inbloom.Filter(entries=1000000, error=FP_RATIO)
    for row in reader:
        alexa_inbloom.add(row[1].lower())

assert alexa_inbloom.contains("youtube.com")

with open("alexa.inbloom", "wb") as f:
    data = base64.b64encode(inbloom.dump(alexa_inbloom))
    f.write(data)

with open("alexa.inbloom", "rb") as f:
    data = f.read()
    bf = inbloom.load(base64.b64decode(data))

assert bf.contains("youtube.com")

May I ask you to have a look please?

Thanks,
Konstantin

@ankit-nassa
Copy link

Hi,

I am also facing a similar issue but in golang. During the unmarshalling(using inbloom.Unmarshal) i am getting following error:

Expected 1148277 bytes, got 1258401

Can you please look at this issue?

@dvirsky
Copy link
Member

dvirsky commented Jun 11, 2021

Hi, my suspicion is that it's something with the filter size, TBH we didn't test it with filters that big IIRC (thought it's been many years and I might be wrong). @bergundy do you have some time to try and recreate this?

@bergundy
Copy link
Collaborator

Hi,
It should be easy to confirm if the issue is the size of the filter by just reducing the filter size in the example.
The golang implementation is completely separate from the Python one so I find it strange that the 2 issues would be related.
@ankitnassa can you post a sample that reproduces the issue?

@ankit-nassa
Copy link

Hi,
It should be easy to confirm if the issue is the size of the filter by just reducing the filter size in the example.
The golang implementation is completely separate from the Python one so I find it strange that the 2 issues would be related.
@ankitnassa can you post a sample that reproduces the issue?

@bergundy So sorry to bother you. There was some issue with the API response which was sending us bf stream. Seems like inbloom library did not have any issues. Thanks for help.

@bergundy
Copy link
Collaborator

No worries, I'm glad it worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants