You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to be able to do 1. vector-based operations and 2. model-based ML and NLP-adjacent stuff. xan is probably not the good place for this (although point 1 is negociable or could still remain in Rust in a different tool).
Available names for those projects could be:
yog for the Rust part (a reference to Yog-Sothoth, the Lovecraft ancient one, a reference actually vibing with LLM communities, cf. shoggoths)
rust-bert: very nice. Also implements sentence transformers. But limited to Bert-based models. Does not support all Bert models from huggingface out-of-the-box. For instance https://huggingface.co/dangvantuan/sentence-camembert-large often used by @bmaz can probably work but requires work on the user's end. What's more it relies on the C++ version of Torch which could be a hassle to cross-compile on some setups.
candle: a new framework by huggingface written in Rust, less sprawling and more focused than Torch (it does not support training, only inferrence, which is the only thing we actually need in the scope of this endeavor). Does not support as many model found on huggingface vs. the python transformers counterpart. What's more, my early tests seems to indicate this is slower than Torch, albeit more easy to parallelize.
Takeaway: it is probably better to stay in python to beneficiate from the large ecosystem like transformers, spacy etc. This also means the pip install will be hardcore because it will need to rely on very heavy dependencies, but that's already the case if you install those dependencies on your own anyway. It will probably be impossible to create a standalone executable using pyinstaller also, but I think the target public will not care in this instance.
Python also enable us to expose high-level helpers as part of a library more easily usable by people than the Rust counterpart.
What I don't like with python is the slow startup time, that is not ideal for CLI tools (we can still do some lazyloading to alleviate this), but since you will need to load models in RAM anyway, it will be long to startup.
Parallelization notes
Most python tools are relying on pytorch which does parallelization on its own in the C++ layer. The number of threads used by Torch is customizable at runtime. This means Rust does not have an edge here because the heavy work already happens in a low-level language.
pytorch can work on small batches of texts at once, this is often required if one wants to fully leverage the C++ multithreading
on very small texts (tweets typically), the interaction between python and Torch and the infamous GIL makes the parallelization subpar. This means multithreading or multiprocessing can sometimes help. But pytorch is not threadsafe at all. This means the model data must be duplicated by each worker. And this is fine with light Bert model, this is not realistic with LLMs, even more so if GPU resources are required.
Envisioned features
Named Entity Recognition (through spacy or Bert), CSV with text as input, multiplexed CSV with one row per found entity as output
Spacy-based alternative to xan tokenize so we can get access to lemmas, POS tags, dependency parsing, noun chunks etc. This could interface very well with xan vocab downstream. E.g. I want to tokenize my text but only keep nouns and lemmatize them, targeting the French language.
Most indices do not support online insertion in out-of-RAM storage. They all expect you to build the index in memory then dump the result to a file. Then they are able to query without loading everything in RAM. But still, this is an issue if your dataset exceeds your RAM. This is the case with FAISS and HNSW.
This could be useful in Rust also: https://github.com/meilisearch/arroy but it seems to be based on Annoy and I did not hear good things about Annoy (annoying, right?).
Things we want:
K-ANN querying and graph building
clustering
embedding
streamable storage (.npy is not very suited for this, although it can be "hacked", CSV cells of stringified floats is a bit costly, CSV base64 binary cells is somewhat a step outside the comfort zone of CSV and not standard at all)
Stuff "easily" implementable in Rust to suit our needs
n-dimensional ForceAtlas2 layout (using k-means for repulsion)
non-simd k-means (this lib seems very nice but does not work without SIMD and seems to be very hard to compile)
We want to be able to do 1. vector-based operations and 2. model-based ML and NLP-adjacent stuff.
xan
is probably not the good place for this (although point 1 is negociable or could still remain in Rust in a different tool).Available names for those projects could be:
yog
for the Rust part (a reference to Yog-Sothoth, the Lovecraft ancient one, a reference actually vibing with LLM communities, cf. shoggoths)xzar
for the python part (another Baldur's Gate 1 chaotic evil wizard)Bert & transformers explorations in Rust
transformers
counterpart. What's more, my early tests seems to indicate this is slower than Torch, albeit more easy to parallelize.Takeaway: it is probably better to stay in python to beneficiate from the large ecosystem like
transformers
,spacy
etc. This also means thepip install
will be hardcore because it will need to rely on very heavy dependencies, but that's already the case if you install those dependencies on your own anyway. It will probably be impossible to create a standalone executable usingpyinstaller
also, but I think the target public will not care in this instance.Python also enable us to expose high-level helpers as part of a library more easily usable by people than the Rust counterpart.
What I don't like with python is the slow startup time, that is not ideal for CLI tools (we can still do some lazyloading to alleviate this), but since you will need to load models in RAM anyway, it will be long to startup.
Parallelization notes
pytorch
which does parallelization on its own in the C++ layer. The number of threads used by Torch is customizable at runtime. This means Rust does not have an edge here because the heavy work already happens in a low-level language.pytorch
is not threadsafe at all. This means the model data must be duplicated by each worker. And this is fine with light Bert model, this is not realistic with LLMs, even more so if GPU resources are required.Envisioned features
xan tokenize
so we can get access to lemmas, POS tags, dependency parsing, noun chunks etc. This could interface very well withxan vocab
downstream. E.g. I want to tokenize my text but only keep nouns and lemmatize them, targeting the French language.Vector-related stuff
Most indices do not support online insertion in out-of-RAM storage. They all expect you to build the index in memory then dump the result to a file. Then they are able to query without loading everything in RAM. But still, this is an issue if your dataset exceeds your RAM. This is the case with FAISS and HNSW.
This could be useful in Rust also: https://github.com/meilisearch/arroy but it seems to be based on Annoy and I did not hear good things about Annoy (annoying, right?).
Things we want:
Stuff "easily" implementable in Rust to suit our needs
The text was updated successfully, but these errors were encountered: