Agent search node timeout #3937

pablonyx · 2025-02-07T23:56:30Z

Description

Timeouts for Agent Search. Goal is to i) avoid hanging because of a hanging validation step and ii) avoid the user not knowing whether the process is broken if the LLM hangs.

Strategy:

added optional timeouts to all LLM calls (.invoke and .stream) in Agent Search
modified all calls accordingly down to litellm call
rely on litellm to use it properly with the LLM

Behavior:

if validation LLM call times out, assume 'True' and continue
if answering of a subquestion times out, the subquestion will effectively be ignored downstream
if initial/refined answer generation step fails, show error message but keep the results thus far

How Has This Been Tested?

locally, by artificially setting timeouts to 0.2

Still in works: behavior if question generation hangs.

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-02-07T23:56:34Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 11, 2025 1:18am

yuhongsun96

Two main points, the rest are nits:

We should have a single point where we are catching the timeouts instead of handling openai and other ones directly up the stack
We should fail loudly in other exception cases to avoid hidden errors

yuhongsun96 · 2025-02-10T00:17:49Z

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py

+            quality_str: str = merge_message_runs(response, chunk_separator="")[
+                0
+            ].content
+            answer_quality = "yes" in quality_str.lower()


We should not be using string matches for logic like these, or wrap it in a utility function that is well encapsulated

Agreed, created a function called binary_string_test(text, positive_value="yes") for that now.

yuhongsun96 · 2025-02-10T00:20:17Z

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py


-    quality_str: str = merge_message_runs(response, chunk_separator="")[0].content
-    answer_quality = "yes" in quality_str.lower()
+    except openai.APITimeoutError:


Since we support a lot of different LLMs and they may raise different errors, should there be a central place to catch these library specific timeout errors (like around the LLM layer) and raise our own general LLMTimeoutError?

That way the rest of the code can use the centralized exception.

Done. LiteLLM errors caught in chatllm.py and then raised as our own errors LLMTimeoutError and LLMRateLimitError.

yuhongsun96 · 2025-02-10T00:22:49Z

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py

+    agent_error: AgentError | None = None
+    response: list | None = None
+    try:
+        response = list(


If we don't need to capture individual tokens, there's an invoke function which gathers all of the tokens. Should be more suitable for this.

yuhongsun96 · 2025-02-10T00:23:39Z

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py

@@ -1,6 +1,7 @@
 from datetime import datetime
 from typing import cast

+import openai


See comment below, it would allow us to not import openai here or LLM specifics in other modules when it can be correctly centralized.

yuhongsun96 · 2025-02-10T00:45:55Z

...agent_search/deep_search/initial/generate_individual_sub_answer/nodes/generate_sub_answer.py

+            cited_documents = [
+                context_docs[id] for id in answer_citation_ids if id < len(context_docs)
+            ]
+            log_results = ""


Probably some explicit null makes more sense here

Just for logging. But done.

yuhongsun96 · 2025-02-10T00:59:32Z

backend/onyx/agents/agent_search/shared_graph_utils/utils.py

+        )
+    except openai.APITimeoutError:
+        return (
+            history  # this is what is done at this point anyway, so wwe default to this


nit: typo here and below

yuhongsun96 · 2025-02-10T01:00:14Z

backend/onyx/agents/agent_search/shared_graph_utils/utils.py

+            history  # this is what is done at this point anyway, so wwe default to this
+        )
+
+    except Exception:


I wonder if it's better to fail loudly and propagate up the stack if not an exception we are explicitly handling. Here and in other places. It prevents silent failures from staying around for a long time

In general, I agree. Here however it is only a nice to have.

fwiw except Exception make me very sad :'(

yuhongsun96 · 2025-02-10T01:04:25Z

backend/onyx/configs/agent_configs.py

@@ -13,6 +13,20 @@
 AGENT_DEFAULT_MAX_ANSWER_CONTEXT_DOCS = 10
 AGENT_DEFAULT_MAX_STATIC_HISTORY_WORD_LENGTH = 2000

+AGENT_DEFAULT_TIMEOUT_OVERWRITE_LLM_GENERAL_GENERATION = 1  # in seconds


I think in these cases, it's actually better to just use the number in the code below directly 😰. But whatever, doesn't matter

I prefer as is to see the parameters in one go. In fact, I think we should do that in general IMO.

yuhongsun96 · 2025-02-10T01:05:17Z

backend/onyx/llm/chat_llm.py

    ) -> BaseMessage:
        if LOG_DANSWER_MODEL_INTERACTIONS:
            self.log_model_configs()

        response = cast(
            litellm.ModelResponse,
            self._completion(
-                prompt, tools, tool_choice, False, structured_response_format
+                prompt,


Not that this matters and I'll stop pointing these out, but slight preference for named args in the case where there are many.

I was there beforehand. But done.

yuhongsun96 · 2025-02-10T01:07:32Z

backend/onyx/prompts/agent_search.py

@@ -5,8 +5,9 @@
 NO_RECOVERED_DOCS = "No relevant information recovered"
 YES = "yes"
 NO = "no"
-
-
+AGENT_LLM_TIMEOUT_MESSAGE = "The agent timed out. Please try again."


These aren't prompts though, pref a constants.py file under agents directory

evan-danswer

A few small streaming things need changes

evan-danswer · 2025-02-10T21:33:48Z

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py

            prompt=msg,
+            timeout_overwrite=AGENT_TIMEOUT_OVERWRITE_LLM_SUBANSWER_CHECK,


prefer timeout_override

...ts/agent_search/deep_search/initial/generate_individual_sub_answer/nodes/check_sub_answer.py

...ts/agent_search/deep_search/initial/generate_initial_answer/nodes/generate_initial_answer.py

...end/onyx/agents/agent_search/deep_search/shared/expanded_retrieval/nodes/verify_documents.py

evan-danswer · 2025-02-10T21:51:02Z

backend/onyx/agents/agent_search/shared_graph_utils/utils.py

+            history  # this is what is done at this point anyway, so wwe default to this
+        )
+
+    except Exception:


fwiw except Exception make me very sad :'(

backend/onyx/llm/chat_llm.py

evan-danswer · 2025-02-10T21:52:57Z

backend/onyx/llm/chat_llm.py

+                tools,
+                tool_choice,
+                structured_response_format,
+                timeout_overwrite,


evan-danswer · 2025-02-10T21:54:17Z

web/src/app/chat/message/AgenticMessage.tsx

-                    <></>
+                  ) : isComplete ? (
+                    error && (
+                      <p className="mt-2 mx-4 text-red-700 text-sm my-auto">


maybe could have a cute lil error component here? Would make me happier by slightly reducing indentation :')

- overwrite -> override - enums for error types - some nits

evan-danswer · 2025-02-14T01:56:49Z

Rolled into #3994

joachim-danswer added 2 commits February 7, 2025 18:19

Removal of defaults from various input states + removal of bas

7684566

timeout prep backend

dd73fdc

joachim-danswer force-pushed the agent-search-node-timeout branch from f4ae944 to dd73fdc Compare February 8, 2025 02:22

vercel bot deployed to Preview February 8, 2025 02:27 View deployment

quick ux update

6a73245

vercel bot deployed to Preview February 8, 2025 07:43 View deployment

quick update

8b20fd3

vercel bot deployed to Preview February 9, 2025 01:16 View deployment

joachim-danswer requested review from evan-danswer and yuhongsun96 February 9, 2025 01:46

yuhongsun96 reviewed Feb 10, 2025

View reviewed changes

YS comments

89c0b1a

vercel bot deployed to Preview February 10, 2025 21:34 View deployment

evan-danswer requested changes Feb 10, 2025

View reviewed changes

EL comments

188a5f0

- overwrite -> override - enums for error types - some nits

vercel bot deployed to Preview February 10, 2025 22:41 View deployment

EL - OVERRIDE

dade11a

vercel bot deployed to Preview February 10, 2025 23:00 View deployment

joachim-danswer added 3 commits February 10, 2025 16:21

remove execs

02b4b4b

reused error strings & BaseMessage_Content

0ccf78a

Nits

1c23cf5

vercel bot deployed to Preview February 11, 2025 01:18 View deployment

evan-danswer closed this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent search node timeout #3937

Agent search node timeout #3937

pablonyx commented Feb 7, 2025 •

edited by joachim-danswer

Loading

vercel bot commented Feb 7, 2025 •

edited

Loading

yuhongsun96 left a comment

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

evan-danswer Feb 10, 2025

joachim-danswer Feb 11, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

yuhongsun96 Feb 10, 2025

joachim-danswer Feb 10, 2025

evan-danswer left a comment

evan-danswer Feb 10, 2025

evan-danswer Feb 10, 2025

evan-danswer Feb 10, 2025

joachim-danswer Feb 10, 2025

evan-danswer Feb 10, 2025

evan-danswer Feb 10, 2025

evan-danswer commented Feb 14, 2025

		prompt=msg,
		timeout_overwrite=AGENT_TIMEOUT_OVERWRITE_LLM_SUBANSWER_CHECK,

Agent search node timeout #3937

Agent search node timeout #3937

Conversation

pablonyx commented Feb 7, 2025 • edited by joachim-danswer Loading

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

vercel bot commented Feb 7, 2025 • edited Loading

yuhongsun96 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-danswer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-danswer commented Feb 14, 2025

pablonyx commented Feb 7, 2025 •

edited by joachim-danswer

Loading

vercel bot commented Feb 7, 2025 •

edited

Loading