-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Prevent crash if TTFT >300sec, boosted to 90 days #18279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fix is correct but the comment is wrong: TTFT refers to time it takes from sending request to first token generated. it is made of:
- time to read HTTP body
- time to parse the request and do any pre-processing like tokenizer
- time to wait for the LLM inference to start, stop, etc
- time to send back the HTTP request
set_read_timeout is simply the timeout for reading HTTP body. it has nothing to do with LLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work as-is because there is also a timeout imposed by the model instance in server-http.cpp
instead, please set the set_read_timeout and set_write_timeout according to the input params (do NOT hard-code it to 90 days)
llama.cpp/tools/server/server-http.cpp
Lines 105 to 106 in 147a521
| srv->set_read_timeout (params.timeout_read); | |
| srv->set_write_timeout(params.timeout_write); |
Edit: for client cli, the set_write_timeout use be set to params.timeout_read (reversed to server)
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the code vertically aligned
|
@ngxson There ya go! :) 90day timeout is ugly whether hardcoded or on the command line. It is a band-aid. That timeout is for communications. It shouldn't be waiting for the LLM, but it is. If the LLM doesn't send its first response token before that timeout, it fails. Maybe:
Also, model unload won't complete until cli->send() finishes. I think model unload needs to try harder. |
Yeah, that sux. Sry 'bout that. It looked ok before it got to github. Spacing fixed. |
d49cd26 to
5c037f5
Compare
|
--- Testing ---
|
llama-server has a hardcoded HTTP timeout of 300 seconds (the library default). Time to First Token (TTFT) frequently exceeds this on consumer hardware or when processing attached files, etc. It is basically impossible to load any reasonable amount of code for analysis with CPU-only hardware and a large enough model to be useful.
When this happens, the connection is dropped, the WebUI gets out of sync, etc.
This PR extends the timeout to 90 days, effectively hiding the problem and allowing llama-server to actually be used.
I would call this a stopgap but urgently needed measure.
Note: Manually dropping a model using the WebUI can get stuck if the model is busy. Perhaps llama-server's model-dropping code should be made more forceful, with or without this simple fix in place.