letta unable to query ollama - [ReadTimeout: timed out] #2357

ar5entum · 2025-01-17T07:22:45Z

letta is unable to query to local ollama. I am using the letta-python sdk with ollama both on the same local server. The ollama server becomes unresponsive after requesting through letta and I have to restart the ollama server.

from letta_client import Letta
client = Letta(base_url="http://localhost:8283")

letta sees ollama when listing models using
client.models.list_llms()

[LlmConfig(model='letta-free', model_endpoint_type='openai', model_endpoint='https://inference.memgpt.ai/', model_wrapper=None, context_window=16384, put_inner_thoughts_in_kwargs=True, handle='letta/letta-free'),
 LlmConfig(model='llama3.2:latest', model_endpoint_type='ollama', model_endpoint='http://localhost:11434/', model_wrapper='chatml', context_window=131072, put_inner_thoughts_in_kwargs=True, handle='ollama/llama3.2:latest')

I then created my agent using following code.

from letta_client import LlmConfig, EmbeddingConfig

llm_cfg = LlmConfig(
    model="llama3.2",
    model_endpoint_type="ollama",
    model_endpoint="http://localhost:11434",
    context_window=131072
)

embedding_cfg = EmbeddingConfig(
        embedding_endpoint_type="ollama",
        # embedding_endpoint=None,
        embedding_model="llama3.2",
        embedding_dim=3072,
        # embedding_chunk_size=300
    )
agent = client.agents.create(name='test', memory_blocks=[], llm_config=llm_cfg, embedding_config=embedding_cfg)

and then send a message by doing

from letta_client import MessageCreate
client.agents.messages.send(
    agent_id=agent.id,
    messages=[
        MessageCreate(
            role="user",
            text="why is the sky blue?",
        )  
    ],
)

The error trace I get is as follows:

{
	"name": "ReadTimeout",
	"message": "timed out",
	"stack": "---------------------------------------------------------------------------
ReadTimeout                               Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:101, in map_httpcore_exceptions()
    100 try:
--> 101     yield
    102 except Exception as exc:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:250, in HTTPTransport.handle_request(self, request)
    249 with map_httpcore_exceptions():
--> 250     resp = self._pool.handle_request(req)
    252 assert isinstance(resp.stream, typing.Iterable)

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py:256, in ConnectionPool.handle_request(self, request)
    255     self._close_connections(closing)
--> 256     raise exc from None
    258 # Return the response. Note that in this case we still have to manage
    259 # the point at which the response is closed.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py:236, in ConnectionPool.handle_request(self, request)
    234 try:
    235     # Send the request on the assigned connection.
--> 236     response = connection.handle_request(
    237         pool_request.request
    238     )
    239 except ConnectionNotAvailable:
    240     # In some cases a connection may initially be available to
    241     # handle a request, but then become unavailable.
    242     #
    243     # In this case we clear the connection and try again.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection.py:103, in HTTPConnection.handle_request(self, request)
    101     raise exc
--> 103 return self._connection.handle_request(request)

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:136, in HTTP11Connection.handle_request(self, request)
    135         self._response_closed()
--> 136 raise exc

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:106, in HTTP11Connection.handle_request(self, request)
     97 with Trace(
     98     \"receive_response_headers\", logger, request, kwargs
     99 ) as trace:
    100     (
    101         http_version,
    102         status,
    103         reason_phrase,
    104         headers,
    105         trailing_data,
--> 106     ) = self._receive_response_headers(**kwargs)
    107     trace.return_value = (
    108         http_version,
    109         status,
    110         reason_phrase,
    111         headers,
    112     )

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:177, in HTTP11Connection._receive_response_headers(self, request)
    176 while True:
--> 177     event = self._receive_event(timeout=timeout)
    178     if isinstance(event, h11.Response):

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:217, in HTTP11Connection._receive_event(self, timeout)
    216 if event is h11.NEED_DATA:
--> 217     data = self._network_stream.read(
    218         self.READ_NUM_BYTES, timeout=timeout
    219     )
    221     # If we feed this case through h11 we'll raise an exception like:
    222     #
    223     #     httpcore.RemoteProtocolError: can't handle event type
   (...)
    227     # perspective. Instead we handle this case distinctly and treat
    228     # it as a ConnectError.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_backends/sync.py:126, in SyncStream.read(self, max_bytes, timeout)
    125 exc_map: ExceptionMapping = {socket.timeout: ReadTimeout, OSError: ReadError}
--> 126 with map_exceptions(exc_map):
    127     self._sock.settimeout(timeout)

File ~/anaconda3/lib/python3.11/contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the \"with\" statement from being suppressed.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_exceptions.py:14, in map_exceptions(map)
     13     if isinstance(exc, from_exc):
---> 14         raise to_exc(exc) from exc
     15 raise

ReadTimeout: timed out

The above exception was the direct cause of the following exception:

ReadTimeout                               Traceback (most recent call last)
Cell In[100], line 2
      1 from letta_client import MessageCreate
----> 2 client.agents.messages.send(
      3     agent_id=agent.id,
      4     messages=[
      5         MessageCreate(
      6             role=\"user\",
      7             text=\"hello\",
      8         )  
      9     ],
     10 )

File ~/anaconda3/lib/python3.11/site-packages/letta_client/agents/messages/client.py:171, in MessagesClient.send(self, agent_id, messages, config, request_options)
    124 def send(
    125     self,
    126     agent_id: str,
   (...)
    130     request_options: typing.Optional[RequestOptions] = None,
    131 ) -> LettaResponse:
    132     \"\"\"
    133     Process a user message and return the agent's response.
    134     This endpoint accepts a message from a user and processes it through the agent.
   (...)
    169     )
    170     \"\"\"
--> 171     _response = self._client_wrapper.httpx_client.request(
    172         f\"v1/agents/{jsonable_encoder(agent_id)}/messages\",
    173         method=\"POST\",
    174         json={
    175             \"messages\": convert_and_respect_annotation_metadata(
    176                 object_=messages, annotation=typing.Sequence[MessageCreate], direction=\"write\"
    177             ),
    178             \"config\": convert_and_respect_annotation_metadata(
    179                 object_=config, annotation=LettaRequestConfig, direction=\"write\"
    180             ),
    181         },
    182         request_options=request_options,
    183         omit=OMIT,
    184     )
    185     try:
    186         if 200 <= _response.status_code < 300:

File ~/anaconda3/lib/python3.11/site-packages/letta_client/core/http_client.py:198, in HttpClient.request(self, path, method, base_url, params, json, data, content, files, headers, request_options, retries, omit)
    190 timeout = (
    191     request_options.get(\"timeout_in_seconds\")
    192     if request_options is not None and request_options.get(\"timeout_in_seconds\") is not None
    193     else self.base_timeout()
    194 )
    196 json_body, data_body = get_request_body(json=json, data=data, request_options=request_options, omit=omit)
--> 198 response = self.httpx_client.request(
    199     method=method,
    200     url=urllib.parse.urljoin(f\"{base_url}/\", path),
    201     headers=jsonable_encoder(
    202         remove_none_from_dict(
    203             {
    204                 **self.base_headers(),
    205                 **(headers if headers is not None else {}),
    206                 **(request_options.get(\"additional_headers\", {}) or {} if request_options is not None else {}),
    207             }
    208         )
    209     ),
    210     params=encode_query(
    211         jsonable_encoder(
    212             remove_none_from_dict(
    213                 remove_omit_from_dict(
    214                     {
    215                         **(params if params is not None else {}),
    216                         **(
    217                             request_options.get(\"additional_query_parameters\", {}) or {}
    218                             if request_options is not None
    219                             else {}
    220                         ),
    221                     },
    222                     omit,
    223                 )
    224             )
    225         )
    226     ),
    227     json=json_body,
    228     data=data_body,
    229     content=content,
    230     files=(
    231         convert_file_dict_to_httpx_tuples(remove_omit_from_dict(remove_none_from_dict(files), omit))
    232         if (files is not None and files is not omit)
    233         else None
    234     ),
    235     timeout=timeout,
    236 )
    238 max_retries: int = request_options.get(\"max_retries\", 0) if request_options is not None else 0
    239 if _should_retry(response=response):

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:825, in Client.request(self, method, url, content, data, files, json, params, headers, cookies, auth, follow_redirects, timeout, extensions)
    810     warnings.warn(message, DeprecationWarning, stacklevel=2)
    812 request = self.build_request(
    813     method=method,
    814     url=url,
   (...)
    823     extensions=extensions,
    824 )
--> 825 return self.send(request, auth=auth, follow_redirects=follow_redirects)

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:914, in Client.send(self, request, stream, auth, follow_redirects)
    910 self._set_timeout(request)
    912 auth = self._build_request_auth(request, auth)
--> 914 response = self._send_handling_auth(
    915     request,
    916     auth=auth,
    917     follow_redirects=follow_redirects,
    918     history=[],
    919 )
    920 try:
    921     if not stream:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:942, in Client._send_handling_auth(self, request, auth, follow_redirects, history)
    939 request = next(auth_flow)
    941 while True:
--> 942     response = self._send_handling_redirects(
    943         request,
    944         follow_redirects=follow_redirects,
    945         history=history,
    946     )
    947     try:
    948         try:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:979, in Client._send_handling_redirects(self, request, follow_redirects, history)
    976 for hook in self._event_hooks[\"request\"]:
    977     hook(request)
--> 979 response = self._send_single_request(request)
    980 try:
    981     for hook in self._event_hooks[\"response\"]:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:1014, in Client._send_single_request(self, request)
   1009     raise RuntimeError(
   1010         \"Attempted to send an async request with a sync Client instance.\"
   1011     )
   1013 with request_context(request=request):
-> 1014     response = transport.handle_request(request)
   1016 assert isinstance(response.stream, SyncByteStream)
   1018 response.request = request

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:249, in HTTPTransport.handle_request(self, request)
    235 import httpcore
    237 req = httpcore.Request(
    238     method=request.method,
    239     url=httpcore.URL(
   (...)
    247     extensions=request.extensions,
    248 )
--> 249 with map_httpcore_exceptions():
    250     resp = self._pool.handle_request(req)
    252 assert isinstance(resp.stream, typing.Iterable)

File ~/anaconda3/lib/python3.11/contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    153     value = typ()
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the \"with\" statement from being suppressed.
    160     return exc is not value

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:118, in map_httpcore_exceptions()
    115     raise
    117 message = str(exc)
--> 118 raise mapped_exc(message) from exc

ReadTimeout: timed out"
}

My ollama is serving on port 11434 and I can test it on a fresh serve by doing

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt":"Why is the sky blue?"
}'

But when I request from letta I don't get any response back and nothing is shown on letta logs. It seems the process somewhere fails in ollama request I'm guessing because of some issues in message formatting.

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-17T12:45:08.841+05:30 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  1918.35 MiB
llm_load_tensors:        CUDA0 model buffer size =  1096.05 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: kv_size = 131072, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  4608.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  9728.00 MiB
llama_new_context_with_model: KV self size  = 14336.00 MiB, K (f16): 7168.00 MiB, V (f16): 7168.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  7197.06 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   262.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 104 (with bs=512), 3 (with bs=1)
time=2025-01-17T12:45:11.604+05:30 level=INFO source=server.go:594 msg="llama runner started in 3.02 seconds"

It freezes at this point and I can't call ollama anymore. If I kill the ollama process by Ctrl+C then I get

[GIN] 2025/01/17 - 12:48:03 | 500 |         2m55s |       127.0.0.1 | POST     "/api/generate"

OS is a Ubuntu server
Letta is running by command sudo docker run --network=host -v ~/.letta/.persist/pgdata:/var/lib/postgresql/data -e OLLAMA_BASE_URL="http://localhost:11434" letta/letta:latest which worked best for me since I wasn't able to link ollama to letta otherwise.

The text was updated successfully, but these errors were encountered:

ar5entum · 2025-01-17T12:29:47Z

Inside https://github.com/letta-ai/letta/blob/main/letta/local_llm/ollama/api.py I found out that the letta was making this request. It seemed to me that "options" parameter was causing the trouble. I commented out that part and it seemed to work but llama3.2 was unable to produce any responses (stuck in perpetual thought). Changing the model to Gemma solved the issue then I reverted the code back and it worked fine.

request = {
        ## base parameters
        "model": model,
        "prompt": prompt,
        # "images": [],  # TODO eventually support
        ## advanced parameters
        # "format": "json",  # TODO eventually support
        "stream": False,
        "options": settings,
        "raw": True,  # no prompt formatting
        # "raw mode does not support template, system, or context"
        # "system": "",  # no prompt formatting
        # "template": "{{ .Prompt }}",  # no prompt formatting
        # "context": None,  # no memory via prompt formatting
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

letta unable to query ollama - [ReadTimeout: timed out] #2357

letta unable to query ollama - [ReadTimeout: timed out] #2357

ar5entum commented Jan 17, 2025

ar5entum commented Jan 17, 2025

letta unable to query ollama - [ReadTimeout: timed out] #2357

letta unable to query ollama - [ReadTimeout: timed out] #2357

Comments

ar5entum commented Jan 17, 2025

ar5entum commented Jan 17, 2025