You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As already noted in (closed) issue #18 , marisa::Keyset poses an input size (memory) bottleneck for trie construction.
dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's marisa::Keyset to be).
Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a marisa::Keyset? I think some of us would be willing to externally preprocess our input data if so.
The text was updated successfully, but these errors were encountered:
As already noted in (closed) issue #18 ,
marisa::Keyset
poses an input size (memory) bottleneck for trie construction.dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's
marisa::Keyset
to be).Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a
marisa::Keyset
? I think some of us would be willing to externally preprocess our input data if so.The text was updated successfully, but these errors were encountered: