Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

letta unable to query ollama - [ReadTimeout: timed out] #2357

Open
ar5entum opened this issue Jan 17, 2025 · 1 comment
Open

letta unable to query ollama - [ReadTimeout: timed out] #2357

ar5entum opened this issue Jan 17, 2025 · 1 comment

Comments

@ar5entum
Copy link

letta is unable to query to local ollama. I am using the letta-python sdk with ollama both on the same local server. The ollama server becomes unresponsive after requesting through letta and I have to restart the ollama server.

from letta_client import Letta
client = Letta(base_url="http://localhost:8283")

letta sees ollama when listing models using
client.models.list_llms()

[LlmConfig(model='letta-free', model_endpoint_type='openai', model_endpoint='https://inference.memgpt.ai/', model_wrapper=None, context_window=16384, put_inner_thoughts_in_kwargs=True, handle='letta/letta-free'),
 LlmConfig(model='llama3.2:latest', model_endpoint_type='ollama', model_endpoint='http://localhost:11434/', model_wrapper='chatml', context_window=131072, put_inner_thoughts_in_kwargs=True, handle='ollama/llama3.2:latest')

I then created my agent using following code.

from letta_client import LlmConfig, EmbeddingConfig

llm_cfg = LlmConfig(
    model="llama3.2",
    model_endpoint_type="ollama",
    model_endpoint="http://localhost:11434",
    context_window=131072
)

embedding_cfg = EmbeddingConfig(
        embedding_endpoint_type="ollama",
        # embedding_endpoint=None,
        embedding_model="llama3.2",
        embedding_dim=3072,
        # embedding_chunk_size=300
    )
agent = client.agents.create(name='test', memory_blocks=[], llm_config=llm_cfg, embedding_config=embedding_cfg)

and then send a message by doing

from letta_client import MessageCreate
client.agents.messages.send(
    agent_id=agent.id,
    messages=[
        MessageCreate(
            role="user",
            text="why is the sky blue?",
        )  
    ],
)

The error trace I get is as follows:

{
	"name": "ReadTimeout",
	"message": "timed out",
	"stack": "---------------------------------------------------------------------------
ReadTimeout                               Traceback (most recent call last)
File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:101, in map_httpcore_exceptions()
    100 try:
--> 101     yield
    102 except Exception as exc:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:250, in HTTPTransport.handle_request(self, request)
    249 with map_httpcore_exceptions():
--> 250     resp = self._pool.handle_request(req)
    252 assert isinstance(resp.stream, typing.Iterable)

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py:256, in ConnectionPool.handle_request(self, request)
    255     self._close_connections(closing)
--> 256     raise exc from None
    258 # Return the response. Note that in this case we still have to manage
    259 # the point at which the response is closed.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py:236, in ConnectionPool.handle_request(self, request)
    234 try:
    235     # Send the request on the assigned connection.
--> 236     response = connection.handle_request(
    237         pool_request.request
    238     )
    239 except ConnectionNotAvailable:
    240     # In some cases a connection may initially be available to
    241     # handle a request, but then become unavailable.
    242     #
    243     # In this case we clear the connection and try again.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/connection.py:103, in HTTPConnection.handle_request(self, request)
    101     raise exc
--> 103 return self._connection.handle_request(request)

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:136, in HTTP11Connection.handle_request(self, request)
    135         self._response_closed()
--> 136 raise exc

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:106, in HTTP11Connection.handle_request(self, request)
     97 with Trace(
     98     \"receive_response_headers\", logger, request, kwargs
     99 ) as trace:
    100     (
    101         http_version,
    102         status,
    103         reason_phrase,
    104         headers,
    105         trailing_data,
--> 106     ) = self._receive_response_headers(**kwargs)
    107     trace.return_value = (
    108         http_version,
    109         status,
    110         reason_phrase,
    111         headers,
    112     )

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:177, in HTTP11Connection._receive_response_headers(self, request)
    176 while True:
--> 177     event = self._receive_event(timeout=timeout)
    178     if isinstance(event, h11.Response):

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_sync/http11.py:217, in HTTP11Connection._receive_event(self, timeout)
    216 if event is h11.NEED_DATA:
--> 217     data = self._network_stream.read(
    218         self.READ_NUM_BYTES, timeout=timeout
    219     )
    221     # If we feed this case through h11 we'll raise an exception like:
    222     #
    223     #     httpcore.RemoteProtocolError: can't handle event type
   (...)
    227     # perspective. Instead we handle this case distinctly and treat
    228     # it as a ConnectError.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_backends/sync.py:126, in SyncStream.read(self, max_bytes, timeout)
    125 exc_map: ExceptionMapping = {socket.timeout: ReadTimeout, OSError: ReadError}
--> 126 with map_exceptions(exc_map):
    127     self._sock.settimeout(timeout)

File ~/anaconda3/lib/python3.11/contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the \"with\" statement from being suppressed.

File ~/anaconda3/lib/python3.11/site-packages/httpcore/_exceptions.py:14, in map_exceptions(map)
     13     if isinstance(exc, from_exc):
---> 14         raise to_exc(exc) from exc
     15 raise

ReadTimeout: timed out

The above exception was the direct cause of the following exception:

ReadTimeout                               Traceback (most recent call last)
Cell In[100], line 2
      1 from letta_client import MessageCreate
----> 2 client.agents.messages.send(
      3     agent_id=agent.id,
      4     messages=[
      5         MessageCreate(
      6             role=\"user\",
      7             text=\"hello\",
      8         )  
      9     ],
     10 )

File ~/anaconda3/lib/python3.11/site-packages/letta_client/agents/messages/client.py:171, in MessagesClient.send(self, agent_id, messages, config, request_options)
    124 def send(
    125     self,
    126     agent_id: str,
   (...)
    130     request_options: typing.Optional[RequestOptions] = None,
    131 ) -> LettaResponse:
    132     \"\"\"
    133     Process a user message and return the agent's response.
    134     This endpoint accepts a message from a user and processes it through the agent.
   (...)
    169     )
    170     \"\"\"
--> 171     _response = self._client_wrapper.httpx_client.request(
    172         f\"v1/agents/{jsonable_encoder(agent_id)}/messages\",
    173         method=\"POST\",
    174         json={
    175             \"messages\": convert_and_respect_annotation_metadata(
    176                 object_=messages, annotation=typing.Sequence[MessageCreate], direction=\"write\"
    177             ),
    178             \"config\": convert_and_respect_annotation_metadata(
    179                 object_=config, annotation=LettaRequestConfig, direction=\"write\"
    180             ),
    181         },
    182         request_options=request_options,
    183         omit=OMIT,
    184     )
    185     try:
    186         if 200 <= _response.status_code < 300:

File ~/anaconda3/lib/python3.11/site-packages/letta_client/core/http_client.py:198, in HttpClient.request(self, path, method, base_url, params, json, data, content, files, headers, request_options, retries, omit)
    190 timeout = (
    191     request_options.get(\"timeout_in_seconds\")
    192     if request_options is not None and request_options.get(\"timeout_in_seconds\") is not None
    193     else self.base_timeout()
    194 )
    196 json_body, data_body = get_request_body(json=json, data=data, request_options=request_options, omit=omit)
--> 198 response = self.httpx_client.request(
    199     method=method,
    200     url=urllib.parse.urljoin(f\"{base_url}/\", path),
    201     headers=jsonable_encoder(
    202         remove_none_from_dict(
    203             {
    204                 **self.base_headers(),
    205                 **(headers if headers is not None else {}),
    206                 **(request_options.get(\"additional_headers\", {}) or {} if request_options is not None else {}),
    207             }
    208         )
    209     ),
    210     params=encode_query(
    211         jsonable_encoder(
    212             remove_none_from_dict(
    213                 remove_omit_from_dict(
    214                     {
    215                         **(params if params is not None else {}),
    216                         **(
    217                             request_options.get(\"additional_query_parameters\", {}) or {}
    218                             if request_options is not None
    219                             else {}
    220                         ),
    221                     },
    222                     omit,
    223                 )
    224             )
    225         )
    226     ),
    227     json=json_body,
    228     data=data_body,
    229     content=content,
    230     files=(
    231         convert_file_dict_to_httpx_tuples(remove_omit_from_dict(remove_none_from_dict(files), omit))
    232         if (files is not None and files is not omit)
    233         else None
    234     ),
    235     timeout=timeout,
    236 )
    238 max_retries: int = request_options.get(\"max_retries\", 0) if request_options is not None else 0
    239 if _should_retry(response=response):

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:825, in Client.request(self, method, url, content, data, files, json, params, headers, cookies, auth, follow_redirects, timeout, extensions)
    810     warnings.warn(message, DeprecationWarning, stacklevel=2)
    812 request = self.build_request(
    813     method=method,
    814     url=url,
   (...)
    823     extensions=extensions,
    824 )
--> 825 return self.send(request, auth=auth, follow_redirects=follow_redirects)

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:914, in Client.send(self, request, stream, auth, follow_redirects)
    910 self._set_timeout(request)
    912 auth = self._build_request_auth(request, auth)
--> 914 response = self._send_handling_auth(
    915     request,
    916     auth=auth,
    917     follow_redirects=follow_redirects,
    918     history=[],
    919 )
    920 try:
    921     if not stream:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:942, in Client._send_handling_auth(self, request, auth, follow_redirects, history)
    939 request = next(auth_flow)
    941 while True:
--> 942     response = self._send_handling_redirects(
    943         request,
    944         follow_redirects=follow_redirects,
    945         history=history,
    946     )
    947     try:
    948         try:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:979, in Client._send_handling_redirects(self, request, follow_redirects, history)
    976 for hook in self._event_hooks[\"request\"]:
    977     hook(request)
--> 979 response = self._send_single_request(request)
    980 try:
    981     for hook in self._event_hooks[\"response\"]:

File ~/anaconda3/lib/python3.11/site-packages/httpx/_client.py:1014, in Client._send_single_request(self, request)
   1009     raise RuntimeError(
   1010         \"Attempted to send an async request with a sync Client instance.\"
   1011     )
   1013 with request_context(request=request):
-> 1014     response = transport.handle_request(request)
   1016 assert isinstance(response.stream, SyncByteStream)
   1018 response.request = request

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:249, in HTTPTransport.handle_request(self, request)
    235 import httpcore
    237 req = httpcore.Request(
    238     method=request.method,
    239     url=httpcore.URL(
   (...)
    247     extensions=request.extensions,
    248 )
--> 249 with map_httpcore_exceptions():
    250     resp = self._pool.handle_request(req)
    252 assert isinstance(resp.stream, typing.Iterable)

File ~/anaconda3/lib/python3.11/contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    153     value = typ()
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the \"with\" statement from being suppressed.
    160     return exc is not value

File ~/anaconda3/lib/python3.11/site-packages/httpx/_transports/default.py:118, in map_httpcore_exceptions()
    115     raise
    117 message = str(exc)
--> 118 raise mapped_exc(message) from exc

ReadTimeout: timed out"
}

My ollama is serving on port 11434 and I can test it on a fresh serve by doing

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt":"Why is the sky blue?"
}'

But when I request from letta I don't get any response back and nothing is shown on letta logs. It seems the process somewhere fails in ollama request I'm guessing because of some issues in message formatting.

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-17T12:45:08.841+05:30 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  1918.35 MiB
llm_load_tensors:        CUDA0 model buffer size =  1096.05 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: kv_size = 131072, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  4608.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  9728.00 MiB
llama_new_context_with_model: KV self size  = 14336.00 MiB, K (f16): 7168.00 MiB, V (f16): 7168.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  7197.06 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   262.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 104 (with bs=512), 3 (with bs=1)
time=2025-01-17T12:45:11.604+05:30 level=INFO source=server.go:594 msg="llama runner started in 3.02 seconds"

It freezes at this point and I can't call ollama anymore. If I kill the ollama process by Ctrl+C then I get

[GIN] 2025/01/17 - 12:48:03 | 500 |         2m55s |       127.0.0.1 | POST     "/api/generate"
  • OS is a Ubuntu server
  • Letta is running by command sudo docker run --network=host -v ~/.letta/.persist/pgdata:/var/lib/postgresql/data -e OLLAMA_BASE_URL="http://localhost:11434" letta/letta:latest which worked best for me since I wasn't able to link ollama to letta otherwise.
@ar5entum
Copy link
Author

Inside https://github.com/letta-ai/letta/blob/main/letta/local_llm/ollama/api.py I found out that the letta was making this request. It seemed to me that "options" parameter was causing the trouble. I commented out that part and it seemed to work but llama3.2 was unable to produce any responses (stuck in perpetual thought). Changing the model to Gemma solved the issue then I reverted the code back and it worked fine.

request = {
        ## base parameters
        "model": model,
        "prompt": prompt,
        # "images": [],  # TODO eventually support
        ## advanced parameters
        # "format": "json",  # TODO eventually support
        "stream": False,
        "options": settings,
        "raw": True,  # no prompt formatting
        # "raw mode does not support template, system, or context"
        # "system": "",  # no prompt formatting
        # "template": "{{ .Prompt }}",  # no prompt formatting
        # "context": None,  # no memory via prompt formatting
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant