Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX #1

Open
caseybrown89 opened this issue Feb 12, 2016 · 8 comments
Open

MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX #1

caseybrown89 opened this issue Feb 12, 2016 · 8 comments
Assignees

Comments

@caseybrown89
Copy link

Hello Susumu,

We are using the Python Marisa trie wrapper (https://github.com/kmike/marisa-trie) which implements your library. The amount of data we've been placing in the trie has been increasing over time and the most recent trie generation caused the following overflow:

File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670)
File "marisa_trie.pyx", line 127, in marisa_trie.Trie._build (src/marisa_trie.cpp:2768)
RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX

If there's any more info you need please let me know!

@s-yata s-yata self-assigned this Feb 12, 2016
@s-yata
Copy link
Owner

s-yata commented Feb 12, 2016

It seems that your data reach the limitation of marisa-trie.

If you set num_tries greater than 3 (DEFAULT_NUM_TRIES), it might be able to avoid the limitation.
Please note that it is an ad hoc approach even if it goes well.

@caseybrown89
Copy link
Author

Thanks for the quick response, I'll give that shot. Do you think it is possible for this library to scale out to support bigger data sets? My naive thought is that I could try moving things from a 32 bit limit to a 64 limit? Do you think that would work? Thanks again.

@mikepb
Copy link
Contributor

mikepb commented Mar 24, 2016

The library is hard-coded to use UInt32 for the length. May I suggest accepting template arguments for the stored value and size storage as an enhancement? It would be a significant undertaking...

@s-yata
Copy link
Owner

s-yata commented Nov 7, 2018

This limitation should be removed in future...

@dkoslicki
Copy link

@s-yata Any chance this ancient issue will be addressed? I'm running into the same problem

@lacerda
Copy link

lacerda commented Jul 15, 2020

at the risk of being an echo, would add that as datasets grow larger, more and more people will run into this issue. Marisa Trie is really great for my work, but on my latest project I've encountered this issue.

@erpic
Copy link

erpic commented Feb 12, 2022

I have encountered this issue as well (for example when trying to build a trie of around 100m elements with 100 bytes each). I have noticed that the library is capable of creating files that are larger than 4GB (2^32 bytes).

Data has become bigger since 2016. This data structure is a real gem.

Anyone has ideas/suggestions on how to fix this UInt32 limitation? Is that a few hours/days of work or more? What needs to be done really? I have not done anything in C++ for a very long time (using the python bindings) but I would be happy to try/help with this issue. My end goal would be to be able to create tries of 10 to 100GB using python.

Thanks in advance for any help/pointers and congratulations to the author for an amazing library.

@KevinEdry
Copy link

Datasets are larger in 2024, i've been reaching this issue myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants