Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The yog/xzar controversy #446

Open
Yomguithereal opened this issue Jan 16, 2025 · 0 comments
Open

The yog/xzar controversy #446

Yomguithereal opened this issue Jan 16, 2025 · 0 comments
Assignees

Comments

@Yomguithereal
Copy link
Member

Yomguithereal commented Jan 16, 2025

We want to be able to do 1. vector-based operations and 2. model-based ML and NLP-adjacent stuff. xan is probably not the good place for this (although point 1 is negociable or could still remain in Rust in a different tool).

Available names for those projects could be:

  1. yog for the Rust part (a reference to Yog-Sothoth, the Lovecraft ancient one, a reference actually vibing with LLM communities, cf. shoggoths)
  2. xzar for the python part (another Baldur's Gate 1 chaotic evil wizard)

Bert & transformers explorations in Rust

  • rust-bert: very nice. Also implements sentence transformers. But limited to Bert-based models. Does not support all Bert models from huggingface out-of-the-box. For instance https://huggingface.co/dangvantuan/sentence-camembert-large often used by @bmaz can probably work but requires work on the user's end. What's more it relies on the C++ version of Torch which could be a hassle to cross-compile on some setups.
  • candle: a new framework by huggingface written in Rust, less sprawling and more focused than Torch (it does not support training, only inferrence, which is the only thing we actually need in the scope of this endeavor). Does not support as many model found on huggingface vs. the python transformers counterpart. What's more, my early tests seems to indicate this is slower than Torch, albeit more easy to parallelize.

Takeaway: it is probably better to stay in python to beneficiate from the large ecosystem like transformers, spacy etc. This also means the pip install will be hardcore because it will need to rely on very heavy dependencies, but that's already the case if you install those dependencies on your own anyway. It will probably be impossible to create a standalone executable using pyinstaller also, but I think the target public will not care in this instance.

Python also enable us to expose high-level helpers as part of a library more easily usable by people than the Rust counterpart.

What I don't like with python is the slow startup time, that is not ideal for CLI tools (we can still do some lazyloading to alleviate this), but since you will need to load models in RAM anyway, it will be long to startup.

Parallelization notes

  • Most python tools are relying on pytorch which does parallelization on its own in the C++ layer. The number of threads used by Torch is customizable at runtime. This means Rust does not have an edge here because the heavy work already happens in a low-level language.
  • pytorch can work on small batches of texts at once, this is often required if one wants to fully leverage the C++ multithreading
  • on very small texts (tweets typically), the interaction between python and Torch and the infamous GIL makes the parallelization subpar. This means multithreading or multiprocessing can sometimes help. But pytorch is not threadsafe at all. This means the model data must be duplicated by each worker. And this is fine with light Bert model, this is not realistic with LLMs, even more so if GPU resources are required.

Envisioned features

  • Named Entity Recognition (through spacy or Bert), CSV with text as input, multiplexed CSV with one row per found entity as output
  • Spacy-based alternative to xan tokenize so we can get access to lemmas, POS tags, dependency parsing, noun chunks etc. This could interface very well with xan vocab downstream. E.g. I want to tokenize my text but only keep nouns and lemmatize them, targeting the French language.
  • lang detection (https://github.com/pemistahl/lingua-rs with python bindings)
  • text translation
  • topic detection (LDA, BertTopic)
  • keyphrase extraction
  • summarization
  • one-shot classification, sentiment analysis etc.
  • various embeddings (word2vec, sentence embedding)

Vector-related stuff

Most indices do not support online insertion in out-of-RAM storage. They all expect you to build the index in memory then dump the result to a file. Then they are able to query without loading everything in RAM. But still, this is an issue if your dataset exceeds your RAM. This is the case with FAISS and HNSW.

This could be useful in Rust also: https://github.com/meilisearch/arroy but it seems to be based on Annoy and I did not hear good things about Annoy (annoying, right?).

Things we want:

  • K-ANN querying and graph building
  • clustering
  • embedding
  • streamable storage (.npy is not very suited for this, although it can be "hacked", CSV cells of stringified floats is a bit costly, CSV base64 binary cells is somewhat a step outside the comfort zone of CSV and not standard at all)

Stuff "easily" implementable in Rust to suit our needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants