Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent search node timeout #3937

Closed
wants to merge 10 commits into from
Closed

Agent search node timeout #3937

wants to merge 10 commits into from

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Feb 7, 2025

Description

Timeouts for Agent Search. Goal is to i) avoid hanging because of a hanging validation step and ii) avoid the user not knowing whether the process is broken if the LLM hangs.

Strategy:

  • added optional timeouts to all LLM calls (.invoke and .stream) in Agent Search
  • modified all calls accordingly down to litellm call
  • rely on litellm to use it properly with the LLM

Behavior:

  • if validation LLM call times out, assume 'True' and continue
  • if answering of a subquestion times out, the subquestion will effectively be ignored downstream
  • if initial/refined answer generation step fails, show error message but keep the results thus far

How Has This Been Tested?

  • locally, by artificially setting timeouts to 0.2

Still in works: behavior if question generation hangs.

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented Feb 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 11, 2025 1:18am

Copy link
Contributor

@yuhongsun96 yuhongsun96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main points, the rest are nits:

  • We should have a single point where we are catching the timeouts instead of handling openai and other ones directly up the stack
  • We should fail loudly in other exception cases to avoid hidden errors

quality_str: str = merge_message_runs(response, chunk_separator="")[
0
].content
answer_quality = "yes" in quality_str.lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be using string matches for logic like these, or wrap it in a utility function that is well encapsulated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, created a function called binary_string_test(text, positive_value="yes") for that now.


quality_str: str = merge_message_runs(response, chunk_separator="")[0].content
answer_quality = "yes" in quality_str.lower()
except openai.APITimeoutError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we support a lot of different LLMs and they may raise different errors, should there be a central place to catch these library specific timeout errors (like around the LLM layer) and raise our own general LLMTimeoutError?

That way the rest of the code can use the centralized exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. LiteLLM errors caught in chatllm.py and then raised as our own errors LLMTimeoutError and LLMRateLimitError.

agent_error: AgentError | None = None
response: list | None = None
try:
response = list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need to capture individual tokens, there's an invoke function which gathers all of the tokens. Should be more suitable for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1,6 +1,7 @@
from datetime import datetime
from typing import cast

import openai
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, it would allow us to not import openai here or LLM specifics in other modules when it can be correctly centralized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cited_documents = [
context_docs[id] for id in answer_citation_ids if id < len(context_docs)
]
log_results = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably some explicit null makes more sense here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for logging. But done.

)
except openai.APITimeoutError:
return (
history # this is what is done at this point anyway, so wwe default to this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo here and below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

history # this is what is done at this point anyway, so wwe default to this
)

except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to fail loudly and propagate up the stack if not an exception we are explicitly handling. Here and in other places. It prevents silent failures from staying around for a long time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree. Here however it is only a nice to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw except Exception make me very sad :'(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE!!

@@ -13,6 +13,20 @@
AGENT_DEFAULT_MAX_ANSWER_CONTEXT_DOCS = 10
AGENT_DEFAULT_MAX_STATIC_HISTORY_WORD_LENGTH = 2000

AGENT_DEFAULT_TIMEOUT_OVERWRITE_LLM_GENERAL_GENERATION = 1 # in seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in these cases, it's actually better to just use the number in the code below directly 😰. But whatever, doesn't matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer as is to see the parameters in one go. In fact, I think we should do that in general IMO.

) -> BaseMessage:
if LOG_DANSWER_MODEL_INTERACTIONS:
self.log_model_configs()

response = cast(
litellm.ModelResponse,
self._completion(
prompt, tools, tool_choice, False, structured_response_format
prompt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this matters and I'll stop pointing these out, but slight preference for named args in the case where there are many.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was there beforehand. But done.

@@ -5,8 +5,9 @@
NO_RECOVERED_DOCS = "No relevant information recovered"
YES = "yes"
NO = "no"


AGENT_LLM_TIMEOUT_MESSAGE = "The agent timed out. Please try again."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't prompts though, pref a constants.py file under agents directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@evan-danswer evan-danswer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small streaming things need changes

prompt=msg,
timeout_overwrite=AGENT_TIMEOUT_OVERWRITE_LLM_SUBANSWER_CHECK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer timeout_override

history # this is what is done at this point anyway, so wwe default to this
)

except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw except Exception make me very sad :'(

backend/onyx/llm/chat_llm.py Show resolved Hide resolved
tools,
tool_choice,
structured_response_format,
timeout_overwrite,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<></>
) : isComplete ? (
error && (
<p className="mt-2 mx-4 text-red-700 text-sm my-auto">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe could have a cute lil error component here? Would make me happier by slightly reducing indentation :')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 - overwrite -> override
 - enums for error types
 - some nits
@evan-danswer
Copy link
Contributor

Rolled into #3994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants