Skip to content

An extended version of the LlamaEdge API Server with web search. In development.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



30 Commits

Repository files navigation

LlamaEdge-Search API Server


LlamaEdge-Search API server is an extension of the Llama API Server with search capabilties. This server will run every query against search results obtained from the internet. The server is implemented in WebAssembly (Wasm) and runs on WasmEdge Runtime.


/v1/models endpoint

rag-api-server provides a POST API /v1/models to list currently available models.


You can use curl to test it on a new terminal:

curl -X POST http://localhost:8080/v1/models -H 'accept:application/json'

If the command runs successfully, you should see the similar output as below in your terminal:

  "object": "list",
  "data": [
      "id": "Llama-2-7b-chat-hf-Q5_K_M",
      "created": 1721824510,
      "object": "model",
      "owned_by": "Not specified"

/v1/chat/completions endpoint

Ask a question using OpenAI's JSON message format.

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Who is Robert Oppenheimer?"}], "model":"Llama-2-7b-chat-hf-Q5_K_M"}'

Here is the response.

                "content":"Ah, a most excellent question! Robert Oppenheimer (1904-1967) was an American theoretical physicist and director of the Manhattan Project, the secret research and development project that produced the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century.\n\nOppenheimer was born in New York City and grew up in a family of intellectuals. He studied physics at Harvard University, where he earned his undergraduate degree, and later at Cambridge University, where he earned his PhD. After completing his education, he worked at several universities and research institutions, including the University of California, Berkeley, and Princeton University.\n\nOppenheimer's most significant contribution to physics was his work on quantum mechanics, particularly his development of the theory of quantum field theory. He also made important contributions to the study of nuclear physics and was one of the leaders of the Manhattan Project, which produced the atomic bomb during World War II.\n\nDespite his many accomplishments in physics, Oppenheimer is perhaps best known for his role in the development of the atomic bomb. He was a strong advocate for international cooperation on nuclear weapons and later became a vocal critic of the arms race between the United States and the Soviet Union.\n\nOppenheimer's life was marked by both personal and professional struggles. He was openly gay, which was illegal at the time, and he struggled with alcoholism and depression throughout his life. Despite these challenges, he remained a brilliant physicist and a passionate advocate for peaceful uses of nuclear energy until his death in 1967.\n\nToday, Oppenheimer is remembered as one of the most influential scientists of the 20th century, and his legacy continues to inspire new generations of physicists and peace activists around the world."

/v1/files endpoint

Upload files to to chunk them and compute their embeddings.


The following command upload a text file paris.txt to the API server via the /v1/files endpoint:

curl -X POST http://localhost:8080/v1/files -F "[email protected]"

If the command is successful, you should see the similar output as below in your terminal:

    "id": "file_4bc24593-2a57-4646-af16-028855e7802e",
    "bytes": 2161,
    "created_at": 1711611801,
    "filename": "paris.txt",
    "object": "file",
    "purpose": "assistants"

The id and filename fields are important for the next step, for example, to segment the uploaded file to chunks for computing embeddings.

/v1/chunks endpoint

To segment the uploaded file to chunks for computing embeddings, use the /v1/chunks API.


The following command sends the uploaded file ID and filename to the API server and gets the chunks:

curl -X POST http://localhost:8080/v1/chunks \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"id":"file_4bc24593-2a57-4646-af16-028855e7802e", "filename":"paris.txt", "chunk_capacity":100}'

The following is an example return with the generated chunks:

    "id": "file_4bc24593-2a57-4646-af16-028855e7802e",
    "filename": "paris.txt",
    "chunks": [
        "Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.",
        "Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."

/v1/embeddings endpoint

To compute embeddings for user query or file chunks, use the /v1/embeddings API.


The following command sends a query to the API server and gets the embeddings as return:

curl -X POST http://localhost:8080/v1/embeddings \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"model": "e5-mistral-7b-instruct-Q5_K_M", "input":["Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.", "Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."]}'

The embeddings returned are like below:

    "object": "list",
    "data": [
            "index": 0,
            "object": "embedding",
            "embedding": [
            "index": 1,
            "object": "embedding",
            "embedding": [
    "model": "e5-mistral-7b-instruct-Q5_K_M",
    "usage": {
        "prompt_tokens": 491,
        "completion_tokens": 0,
        "total_tokens": 491

/v1/info endpoint

/v1/info endpoint provides the information of the API server, including the version of the server, the parameters of models, and etc.


You can use curl to test it on a new terminal:

curl -X POST http://localhost:8080/v1/info -H 'accept:application/json'

If the command runs successfully, you should see the similar output as below in your terminal:

  "api_server": {
    "type": "llama",
    "version": "0.1.0",
    "ggml_plugin_version": "b3405 (commit 5e116e8d)",
    "port": "8080"
  "chat_model": {
    "name": "Llama-2-7b-chat-hf-Q5_K_M",
    "type": "chat",
    "ctx_size": 4096,
    "batch_size": 512,
    "prompt_template": "Llama2Chat",
    "n_predict": 1024,
    "n_gpu_layers": 100,
    "temperature": 1.0,
    "top_p": 1.0,
    "repeat_penalty": 1.1,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0
  "embedding_model": {
    "name": "all-MiniLM-L6-v2-ggml-model-f16",
    "type": "embedding",
    "ctx_size": 384,
    "batch_size": 512
  "extras": {}


Llama-RAG API server runs on WasmEdge Runtime. According to the operating system you are using, choose the installation command:

For macOS (apple silicon)
# install WasmEdge-0.13.4 with wasi-nn-ggml plugin
curl -sSf | bash -s -- --plugin wasi_nn-ggml

# Assuming you use zsh (the default shell on macOS), run the following command to activate the environment
source $HOME/.zshenv
For Ubuntu (>= 20.04)
# install libopenblas-dev
apt update && apt install -y libopenblas-dev

# install WasmEdge-0.13.4 with wasi-nn-ggml plugin
curl -sSf | bash -s -- --plugin wasi_nn-ggml

# Assuming you use bash (the default shell on Ubuntu), run the following command to activate the environment
source $HOME/.bashrc
For General Linux
# install WasmEdge-0.13.4 with wasi-nn-ggml plugin
curl -sSf | bash -s -- --plugin wasi_nn-ggml

# Assuming you use bash (the default shell on Ubuntu), run the following command to activate the environment
source $HOME/.bashrc


# Clone the repository (dev branch)
git clone --branch dev

# Change the working directory
cd search-api-server

# Build `search-api-server.wasm` with the `http` support only, or
cargo build --target wasm32-wasi --release

# Copy the `search-api-server.wasm` to the root directory
cp target/wasm32-wasi/release/search-api-server.wasm .
To check the CLI options,

To check the CLI options of the search-api-server wasm app, you can run the following command:

$ wasmedge ./target/wasm32-wasip1/release/search-api-server.wasm -h
LlamaEdge-Search API Server

Usage: search-api-server.wasm [OPTIONS] --prompt-template <PROMPT_TEMPLATE>

  -m, --model-name <MODEL_NAME>
          Sets names for chat and/or embedding models. To run both chat and embedding models, the names should be separated by comma without space, for example, '--model-name Llama-2-7b,all-minilm'. The first value is for the chat model, and the second is for the embedding model [default: default]
  -a, --model-alias <MODEL_ALIAS>
          Model aliases for chat and embedding models [default: default,embedding]
  -c, --ctx-size <CTX_SIZE>
          Sets context sizes for chat and/or embedding models. To run both chat and embedding models, the sizes should be separated by comma without space, for example, '--ctx-size 4096,384'. The first value is for the chat model, and the second is for the embedding model [default: 4096,384]
  -b, --batch-size <BATCH_SIZE>
          Sets batch sizes for chat and/or embedding models. To run both chat and embedding models, the sizes should be separated by comma without space, for example, '--batch-size 128,64'. The first value is for the chat model, and the second is for the embedding model [default: 512,512]
  -p, --prompt-template <PROMPT_TEMPLATE>
          Sets prompt templates for chat and/or embedding models, respectively. To run both chat and embedding models, the prompt templates should be separated by comma without space, for example, '--prompt-template llama-2-chat,embedding'. The first value is for the chat model, and the second is for the embedding model [possible values: llama-2-chat, llama-3-chat, llama-3-tool, mistral-instruct, mistral-tool, mistrallite, openchat, codellama-instruct, codellama-super-instruct, human-assistant, vicuna-1.0-chat, vicuna-1.1-chat, vicuna-llava, chatml, chatml-tool, internlm-2-tool, baichuan-2, wizard-coder, zephyr, stablelm-zephyr, intel-neural, deepseek-chat, deepseek-coder, deepseek-chat-2, solar-instruct, phi-2-chat, phi-2-instruct, phi-3-chat, phi-3-instruct, gemma-instruct, octopus, glm-4-chat, groq-llama3-tool, mediatek-breeze, embedding, none]
  -r, --reverse-prompt <REVERSE_PROMPT>
          Halt generation at PROMPT, return control
  -n, --n-predict <N_PREDICT>
          Number of tokens to predict [default: 1024]
  -g, --n-gpu-layers <N_GPU_LAYERS>
          Number of layers to run on the GPU [default: 100]
      --no-mmap <NO_MMAP>
          Disable memory mapping for file access of chat models [possible values: true, false]
      --temp <TEMP>
          Temperature for sampling [default: 1.0]
      --top-p <TOP_P>
          An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. 1.0 = disabled [default: 1.0]
      --repeat-penalty <REPEAT_PENALTY>
          Penalize repeat sequence of tokens [default: 1.1]
      --presence-penalty <PRESENCE_PENALTY>
          Repeat alpha presence penalty. 0.0 = disabled [default: 0.0]
      --frequency-penalty <FREQUENCY_PENALTY>
          Repeat alpha frequency penalty. 0.0 = disabled [default: 0.0]
      --llava-mmproj <LLAVA_MMPROJ>
          Path to the multimodal projector file
      --socket-addr <SOCKET_ADDR>
          Socket address of LlamaEdge API Server instance [default:]
      --web-ui <WEB_UI>
          Root path for the Web UI files [default: chatbot-ui]
          Deprecated. Print prompt strings to stdout
          Deprecated. Print statistics to stdout
          Deprecated. Print all log information to stdout
      --max-search-results <MAX_SEARCH_RESULTS>
          Maximum number search results to use [default: 5]
      --size-limit-per-result <SIZE_LIMIT_PER_RESULT>
          Size to clip every result to [default: 300]
      --api-key <API_KEY>
          API key to be supplied to the endpoint, if supported [default: ]
      --search-prompt <SEARCH_PROMPT>
          System prompt explaining to the LLM how to interpret search results [default: "You found the following search results on the internet. Use them to answer the user's query.\n\n"]
          Whether to summarize search results before passing them onto the LLM, as opposed to passing the raw results themselves
  -h, --help
          Print help
  -V, --version
          Print version


LlamaEdge-Search API server supports 2 models: chat and embedding. The chat model is used for generating responses to user queries, while the embedding model is used for computing embeddings for user queries or file chunks. The Search API Server requires at least a chat model

For the purpose of demonstration, we use the Llama-2-7b-chat-hf-Q5_K_M.gguf chat model as an example. Download this model and place it in the root directory of the repository.

The Search API Server uses Tavily as its default search API. Please remember to supply a tavily API key with the --api-key CLI parameter.

  • Start an instance of LlamaEdge-Search API server

    wasmedge --dir .:.  --env LLAMA_LOG="info" \
      --nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf \
      search-api-server.wasm \
      --ctx-size 4096,384 \
      --prompt-template llama-2-chat \
      --model-name Llama-2-7b-chat-hf-Q5_K_M \
      --api-key <YOUR_API_KEY> # if required by an endpoint. The server uses Tavily by default, hence requires a Tavily API Key to function.

Usage Example

  • Execute the server

  • Ask a question. Search results from the backend in use will be automatically fetched and used.

    curl -X POST http://localhost:8080/v1/chat/completions \
        -H 'accept:application/json' \
        -H 'Content-Type: application/json' \
        -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What\'s the current news?"}], "model":"Llama-2-7b-chat-hf-Q5_K_M"}'


The crux of the search-api-server is struct SearchConfig from the llama-core crate when compiled with the search feature.

This is how it works:

  1. Decide the search API (JSON based, supports HTTP) to use. There are many out there to choose from. We currently use Tavily by default.

  2. Crate a new file for the search endpoint and place it in it's own file under src/search/. Don't forget to mod <filename> it in src/search/

  3. Next, we'll make a fn in the new file that converts the raw JSON output of the search endpoint for a given query to a struct SearchOutput object.

pub fn custom_search_parser(
    raw_results: &serde_json::Value,
) -> Result<SearchOutput, Box<dyn std::error::Error>> {

  // conversion logic that converts the raw results from the server into a SearchOutput. 
  let search_output: SearchOutput {
    url: String = <assign>
    site_name: String = <assign>
    text_content: String = <assign>
  1. Next, we'll define a struct CustomSearchInput. This struct must be Serialize-able, as the fields will be converted directly to JSON. Later, we'll pass an instance of this struct to the &SearchConfig.perform_search() function to actually perform the search:
struct CustomSearchInput { 
  // sample fields. Change according to your search endpoint.
  term: String,
  max_search_results: u8,
  depth: String,
  api_key: String
  1. To ensure the search results are included in the conversation, create an object out of your CustomSearchInput in the insert_search_results() function in src/search/ and ensure its included in all calls to perform_search() and summarize_search().
let search_input = CustomSearchInput {
  // assign fields
  match search_config.summarize_search(&search_input).await {
  match search_config.perform_search(&search_input).await {
  1. The we need to place the SearchConfig object in setup_search() function inside of src/ with our own, and then update the global SEARCH_CONFIG.
let search_config = SearchConfig {
   // fields
   parser: custom_search_parser()
  .map_err(|_| ServerError::Operation("Failed to set `SEARCH_CONFIG`.".to_owned()))?;
  1. That's it! Now, upon recompiling the server and executing it, try asking the LLM a question.
  2. (optional) Additionally, if your custom endpoint requires a extra fields to be included in the SeachInput that aren't already present in server, you can edit the struct SearchArguments to include them, and access them through the SEARCH_ARGUMENTS..


An extended version of the LlamaEdge API Server with web search. In development.






No releases published


No packages published
