Skip to content

Almarch/NLP-from-a-PC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Run a LLM with RAG from a Gaming PC

The goal of this repo is to play with natural language processing with relatively limited resources. The specs it has been built with are:

  • a linux/amd64 platform ;
  • with git, docker and python3 ;
  • a Nvidia GPU with cuda.

To make use of the later, the Nvidia container toolkit is needed.

It may work with different specs though but make sure the amount of VRAM + RAM available is soundly higher than the size of the model you intend to use.

The cluster encompasses a Open-WebUI + Ollama stack, as well as a Jupyter Notebook for experimentation.

🚧 Next step: connect WebUI database with the dedicated chromaDB service (work in progress) in order to efficiently set up the RAG pipeline

Deploy

The project is containerized with docker. However, some preliminary steps are required in order to prepare the LLM on the host machine.

git clone https://github.com/Almarch/NLP-from-a-PC
cd NLP-from-a-PC
docker compose build
docker compose up -d
docker ps

A frugal deepseek model has been picked for illustration. Access the running Ollama container, say number 123:

docker exec -it 123 bash
ollama pull deepseek-r1:8b

See the Ollama collections.

Fill the Vector DB

A Chroma vector DB is included in the stack. In order to fill it from PDFs, the following pipeline is under development (./services/jupyter/notebook/Resource.py):

image

OCR

PDFs are imported as images and read using py-tesseract.

It works for some resources, though it sometimes include typos. However it completely fails reading this paper.

Text splitting

Text splitting is done with an arbitrary rule of 1000 words per chunk - 100 words overlap, which is the default configuration in the WebUI RAG pipeline using the all-MiniLM-L6-v2 encoder. Using a different encoder, the chunks size should be adjusted.

Roughly half of the chunks reach the maximum number of tokens with this configuration as experimented in ./services/jupyter/notebook/fill_chroma.ipynb :

Language detection & translation

Most free encoders are language-specific as clearly stated in their documentation and further experimented in ./services/jupyter/notebook/encoding.ipynb with all-MiniLM-L6-v2 :

"Les chiens sont fidèles" is an exact translations of "Dogs are loyal". I also checked all-mpnet-base-v2 with a similar result.

The approach I explored was to use the LLM to detect the language, then to perform the translation. Using Mistral 7b, the translations were pretty good but the language detection was difficult to industrialize. My prompts (./services/jupyter/notebook/prompts.py) may very likely be improved, and they may be post-processed further as well.

Noteworthily, my limited resources make the translation step extremely long.

Encoding

We already mentioned the importance of the encoder especially with regards to:

  • its language, raising the need for translation ;
  • its number of input token, that impacts the chunks size.

Let's add to the list:

  • its number of output, i.e. the embedding vector size: longer vectors carry more informance hence allow more nuance ;
  • its relevance for the topic.

Especially, for this last point, an encoder destined to a very specific topic should be considered for fine-tuning. Otherwise, only a very limited fraction of the embedding space would be actually occupied by the topic resources, making it difficult to sort and associate them efficiently. Fine-tuning the encoder is an order of magnitude easier than fine-tuning the LLM itself.

For a French RAG project, I keep that one in my back pocket:

ChromaDB

The next step will be to use the ChromaDB in a RAG pipeline from Open-WebUI.

That's all

Onced launched with docker, the WebUI is available at http://localhost:8080 and the Jupyter Notebook at http://localhost:8000 .

Resource consumptions may be followed-up with:

nvtop

for the VRAM and GPU; and:

htop

for the RAM and CPU.

Tunneling

It is sometimes easier to take a virtual private server (VPS) than obtaining a fixed IP from the Internet provider. We want some services from the gaming machine, let's call it A, to be accessible from anywhere, including from machine C. In the middle, B is the VPS used as a tunnel.

Name A B C
Description Gaming machine VPS Client
Role Host the models Host the tunnel Plays with NLP
User userA userB doesn't matter
IP doesn't matter 11.22.33.44 doesn't matter

The services we need are:

  • The web UI and the notebook, available at ports 8080 and 8888 respectively.
  • A SSH endpoint. Port 22 of the gaming machine (A) will be exposed through port 2222 of the VPS (B).

From A) the gaming machine

The ports are pushed to the VPS:

ssh -N -R 8888:localhost:8888 -R 8080:localhost:8080 -R 2222:localhost:22 [email protected]

From B) the VPS

The SSH port 2222 has to be opened.

sudo ufw allow 2222
sudo ufw reload

From C) the client

The jupyter notebook is pulled from the VPS:

ssh -N -L 8888:localhost:8888 -L 8080:localhost:8080 [email protected]

And the VPS is a direct tunnel to the gaming machine A:

ssh -p 2222 [email protected]

Note that userA, not userB, is required for authentication ; idem for the password.

Other branches

There are several branches in this repo, corresponding to exploratory steps.

  • in laptop I attempted to run deepseek-llm-7b-chat from HF on a laptop. It actually "worked", with about 5 minutes / token.
  • with fastapi_everywhere, I downloaded all models (LLM, OCR, encoder) into a distinct service with a fastAPI. It still used HF for the LLM.
  • I switched to Ollama-WebUI framework from from_hugging_face. In this branch, I still download a HF model and convert it to Ollama, which was unnecessarily complicated.

About

Run a light-weighted LLM on a gaming computer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published