Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor robustness in handling <links> output. #534

Open
OverrideTuring opened this issue Dec 26, 2024 · 3 comments
Open

Poor robustness in handling <links> output. #534

OverrideTuring opened this issue Dec 26, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@OverrideTuring
Copy link

Describe the bug

I frequently encounter the error message: "Error at generating documents from links: Invalid URL" in the console. Although it doesn't happen every time, it occurs frequently enough. After debugging the relevant source code, I traced the issue to how the system handles LLM outputs when rephrasing questions.

To Reproduce

I called the search API to do my task. My LLM is fine-tuned Llama-70B. And my prompt is:

Briefly introduce the publication named Pattern Recognition within one paragraph, indicating whether it is a journal or conference, its organizing body or publisher, its primary focus or fields of research, and its commonly used abbreviation (if any).

Steps to reproduce the issue:

  1. Insert debug code in search.metaSearchAgent.ts, as shown here:
    insert_debug
  2. Fill in the above prompt in the request body.
  3. Send a POST request to "http://localhost:3001/api/search" (i.e., the server API).
  4. Check the console for errors. Occasionally, you will see something like this:
    console_info

Expected behavior

The system should process the query correctly, send it to SearXNG, and return the desired result. Instead, the issue occurs because the output parser mistakenly interprets explanatory text, such as "no <links> block included," as the beginning of a <links> tag. This leads to invalid parsing and, ultimately, failure in URL validation.

Additional context

Suggestions for Improvements:

  1. Replace the logical operator && in the condition (startKeyIndex === -1 && endKeyIndex === -1) with || to ensure the closedness of tags. Alternatively, you could implement post-checks for the validity of the generated <links> and their content.
  2. I also recommend modifying some details of the webSearchRetrieverPrompt:
  • <question> of the first example should align with others: "What is the Capital of france?", rather than a simple "Capital of France".
  • The format of the second example should align with others: add "Follow up question: " before "Hi, how are you?".
  • Add a question mark "?" after every real question, for example: "What is Docker?" instead of "What is Docker" in the third example.
  1. In search.metaSearchAgent.ts, I suggest switch the order of these two tags:
            <query>
            ${question}
            </query>

            <text>
            ${doc.pageContent}
            </text>

Put the <text> before <query> as other examples do.

@OverrideTuring OverrideTuring added the bug Something isn't working label Dec 26, 2024
@ItzCrazyKns
Copy link
Owner

Hi, I've been working on the prompt and from what I am up to now. Perplexica is able to work correctly with a 3B model as well. This prompt will be released pretty soon after some final touches. Stay tuned for it!
image

@OverrideTuring
Copy link
Author

Hi, I've been working on the prompt and from what I am up to now. Perplexica is able to work correctly with a 3B model as well. This prompt will be released pretty soon after some final touches. Stay tuned for it! image

Try this complete prompt:

Briefly introduce the publication named Pattern Recognition within one paragraph, indicating whether it is a journal or conference, its organizing body or publisher, its primary focus or fields of research, and its commonly used abbreviation (if any). Return the answer in the following JSON format:
```json
{"introduction": "<concise description of the publication>"}
```

For example, the introduction of IEEE Conference on Computer Vision and Pattern Recognition should be:
```json
{"introduction": "A premier annual conference in the field of computer vision and pattern recognition. It is organized by the IEEE and the Computer Vision Foundation (CVF). It is widely known as CVPR."}
```

Don't add any reference information (e.g. [1], [2], etc.) in the JSON-format answer. You can do it after the answer.

I tell the LLM to return a JSON-format answer. This will cause the problem more often. However, my fine-tuned LLM counts, too.

@thefux
Copy link

thefux commented Jan 1, 2025

Hi there, any news on this, I'm facing the same issue as in here #533
and quite often

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants