The yog/xzar controversy #446

Yomguithereal · 2025-01-16T11:03:52Z

We want to be able to do 1. vector-based operations and 2. model-based ML and NLP-adjacent stuff. xan is probably not the good place for this (although point 1 is negociable or could still remain in Rust in a different tool).

Available names for those projects could be:

yog for the Rust part (a reference to Yog-Sothoth, the Lovecraft ancient one, a reference actually vibing with LLM communities, cf. shoggoths)
xzar for the python part (another Baldur's Gate 1 chaotic evil wizard)

Bert & transformers explorations in Rust

rust-bert: very nice. Also implements sentence transformers. But limited to Bert-based models. Does not support all Bert models from huggingface out-of-the-box. For instance https://huggingface.co/dangvantuan/sentence-camembert-large often used by @bmaz can probably work but requires work on the user's end. What's more it relies on the C++ version of Torch which could be a hassle to cross-compile on some setups.
candle: a new framework by huggingface written in Rust, less sprawling and more focused than Torch (it does not support training, only inferrence, which is the only thing we actually need in the scope of this endeavor). Does not support as many model found on huggingface vs. the python transformers counterpart. What's more, my early tests seems to indicate this is slower than Torch, albeit more easy to parallelize.

Takeaway: it is probably better to stay in python to beneficiate from the large ecosystem like transformers, spacy etc. This also means the pip install will be hardcore because it will need to rely on very heavy dependencies, but that's already the case if you install those dependencies on your own anyway. It will probably be impossible to create a standalone executable using pyinstaller also, but I think the target public will not care in this instance.

Python also enable us to expose high-level helpers as part of a library more easily usable by people than the Rust counterpart.

What I don't like with python is the slow startup time, that is not ideal for CLI tools (we can still do some lazyloading to alleviate this), but since you will need to load models in RAM anyway, it will be long to startup.

Parallelization notes

Most python tools are relying on pytorch which does parallelization on its own in the C++ layer. The number of threads used by Torch is customizable at runtime. This means Rust does not have an edge here because the heavy work already happens in a low-level language.
pytorch can work on small batches of texts at once, this is often required if one wants to fully leverage the C++ multithreading
on very small texts (tweets typically), the interaction between python and Torch and the infamous GIL makes the parallelization subpar. This means multithreading or multiprocessing can sometimes help. But pytorch is not threadsafe at all. This means the model data must be duplicated by each worker. And this is fine with light Bert model, this is not realistic with LLMs, even more so if GPU resources are required.

Envisioned features

Named Entity Recognition (through spacy or Bert), CSV with text as input, multiplexed CSV with one row per found entity as output
Spacy-based alternative to xan tokenize so we can get access to lemmas, POS tags, dependency parsing, noun chunks etc. This could interface very well with xan vocab downstream. E.g. I want to tokenize my text but only keep nouns and lemmatize them, targeting the French language.
lang detection (https://github.com/pemistahl/lingua-rs with python bindings)
text translation
topic detection (LDA, BertTopic)
keyphrase extraction
summarization
one-shot classification, sentiment analysis etc.
various embeddings (word2vec, sentence embedding)

Vector-related stuff

Most indices do not support online insertion in out-of-RAM storage. They all expect you to build the index in memory then dump the result to a file. Then they are able to query without loading everything in RAM. But still, this is an issue if your dataset exceeds your RAM. This is the case with FAISS and HNSW.

This could be useful in Rust also: https://github.com/meilisearch/arroy but it seems to be based on Annoy and I did not hear good things about Annoy (annoying, right?).

Things we want:

K-ANN querying and graph building
clustering
embedding
streamable storage (.npy is not very suited for this, although it can be "hacked", CSV cells of stringified floats is a bit costly, CSV base64 binary cells is somewhat a step outside the comfort zone of CSV and not standard at all)

Stuff "easily" implementable in Rust to suit our needs

n-dimensional ForceAtlas2 layout (using k-means for repulsion)
non-simd k-means (this lib seems very nice but does not work without SIMD and seems to be very hard to compile)
product vector quantization and IVFPQ indices: http://www.irisa.fr/texmex/people/jegou/papers/jegou_searching_with_quantization.pdf
fully disk-based HNSW indexation: https://arxiv.org/pdf/1603.09320
NN-descent: https://www.cs.princeton.edu/cass/papers/www11.pdf
Louvain/Leiden

The text was updated successfully, but these errors were encountered:

Yomguithereal added the discussion label Jan 16, 2025

Yomguithereal assigned Yomguithereal and bmaz Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The yog/xzar controversy #446

The yog/xzar controversy #446

Yomguithereal commented Jan 16, 2025 •

edited

Loading

The yog/xzar controversy #446

The yog/xzar controversy #446

Comments

Yomguithereal commented Jan 16, 2025 • edited Loading

Bert & transformers explorations in Rust

Parallelization notes

Envisioned features

Vector-related stuff

Stuff "easily" implementable in Rust to suit our needs

Yomguithereal commented Jan 16, 2025 •

edited

Loading