Skip to content

Conversation

@wbtek
Copy link

@wbtek wbtek commented Dec 22, 2025

llama-server has a hardcoded HTTP timeout of 300 seconds (the library default). Time to First Token (TTFT) frequently exceeds this on consumer hardware or when processing attached files, etc. It is basically impossible to load any reasonable amount of code for analysis with CPU-only hardware and a large enough model to be useful.

When this happens, the connection is dropped, the WebUI gets out of sync, etc.

This PR extends the timeout to 90 days, effectively hiding the problem and allowing llama-server to actually be used.

I would call this a stopgap but urgently needed measure.

Note: Manually dropping a model using the WebUI can get stuck if the model is busy. Perhaps llama-server's model-dropping code should be made more forceful, with or without this simple fix in place.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix is correct but the comment is wrong: TTFT refers to time it takes from sending request to first token generated. it is made of:

  • time to read HTTP body
  • time to parse the request and do any pre-processing like tokenizer
  • time to wait for the LLM inference to start, stop, etc
  • time to send back the HTTP request

set_read_timeout is simply the timeout for reading HTTP body. it has nothing to do with LLM

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work as-is because there is also a timeout imposed by the model instance in server-http.cpp

instead, please set the set_read_timeout and set_write_timeout according to the input params (do NOT hard-code it to 90 days)

srv->set_read_timeout (params.timeout_read);
srv->set_write_timeout(params.timeout_write);

Edit: for client cli, the set_write_timeout use be set to params.timeout_read (reversed to server)

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the code vertically aligned

@wbtek
Copy link
Author

wbtek commented Dec 22, 2025

@ngxson There ya go! :)

90day timeout is ugly whether hardcoded or on the command line. It is a band-aid.

That timeout is for communications. It shouldn't be waiting for the LLM, but it is. If the LLM doesn't send its first response token before that timeout, it fails.

Maybe:

  1. If the timeout happens, it should check if everything is ok and wait again.
    or
  2. A "continue" message should be sent periodically to reset the timer.

Also, model unload won't complete until cli->send() finishes. I think model unload needs to try harder.

@wbtek
Copy link
Author

wbtek commented Dec 22, 2025

keep the code vertically aligned

Yeah, that sux. Sry 'bout that. It looked ok before it got to github. Spacing fixed.

@wbtek
Copy link
Author

wbtek commented Dec 23, 2025

--- Testing ---
Model: Qwen3-Next-80B-A3B-Instruct-Q5_K_L.gguf (57G)
Attachments: 58k of llama-server C++ source code
Hardware: CPU only, Intel, 64G ram (It runs!)
Prompt: "What is 1+1?" (short answer wanted, test only requires one char to succeed)

  1. Shorten timeout using -to 77 proving code is using the command-line argument. Fails correctly at (77 seconds).

  2. Prove original and related problems are still corrected with prompt / attachments requiring >600 seconds before first response. Default timeout was from http library (300sec), but is now from llama-server (600sec). Set the timeout to a practical infinity with -to 7777777 (~90 days). The model responds after 780 seconds with no failure.

  3. Show default is now coming from llama-server (600sec). Same as test 2 with -to removed completely. The model fails correctly at exactly 600 seconds (10 minutes).

@wbtek wbtek requested a review from ngxson December 23, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants