Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor content extraction from certain domains #4

Open
wwjCMP opened this issue Feb 6, 2025 · 8 comments
Open

Poor content extraction from certain domains #4

wwjCMP opened this issue Feb 6, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@wwjCMP
Copy link

wwjCMP commented Feb 6, 2025

Ollama log

level=WARN source=server.go:762 msg="format is neither a schema or "json"" format=""""

Will this affect the final result? Because in my actual usage, the response effectiveness is very poor.

@benhaotang
Copy link
Owner

benhaotang commented Feb 6, 2025

sorry, may I get more context on:

  • What models are you using? what are the context lengths that you set for them?
  • You see this warning at what stage? at planning? searching? deciding if url is useful or writing?
  • what does response effectiveness mean?

So this warning is just if you request via the REST API but don't set what output format you want. Since we are using ollama-python instead of cURL here, it will not affect the return message.

If you say the finial result is not good as for "effectiveness", I suspect is that your context length is too small. Because plus planning at 5 iterations per 4 searches per query, so 80 searches, if 50% is useful, the finial writing instruction can easily go upward 100K tokens, yes we can do some pre-summarization (and we already did on per result bases) to reduce token number, but if we do on per iteration-level, small models will hallucinate hard without full context.

That is also why I cannot personally test that at my set up, I don't have a GPU that can fit all these, so for me since I am using mistral-small with max ctx at 32K on my RX7800XT, I only use 2 iteration at 3 searches without planning, the result I get seems decent, the citation is proper without much hallucination.

@benhaotang
Copy link
Owner

benhaotang commented Feb 6, 2025

Well, maybe this is also a good reminder for me to add RAG support:) I will do after I add tool-calling support

@benhaotang benhaotang self-assigned this Feb 6, 2025
@benhaotang benhaotang added the enhancement New feature or request label Feb 6, 2025
@wwjCMP
Copy link
Author

wwjCMP commented Feb 7, 2025

sorry, may I get more context on:

  • What models are you using? what are the context lengths that you set for them?
  • You see this warning at what stage? at planning? searching? deciding if url is useful or writing?
  • what does response effectiveness mean?

So this warning is just if you request via the REST API but don't set what output format you want. Since we are using ollama-python instead of cURL here, it will not affect the return message.

If you say the finial result is not good as for "effectiveness", I suspect is that your context length is too small. Because plus planning at 5 iterations per 4 searches per query, so 80 searches, if 50% is useful, the finial writing instruction can easily go upward 100K tokens, yes we can do some pre-summarization (and we already did on per result bases) to reduce token number, but if we do on per iteration-level, small models will hallucinate hard without full context.

That is also why I cannot personally test that at my set up, I don't have a GPU that can fit all these, so for me since I am using mistral-small with max ctx at 32K on my RX7800XT, I only use 2 iteration at 3 searches without planning, the result I get seems decent, the citation is proper without much hallucination.

DEFAULT_MODEL = "mistral-small:22b-instruct-2409-q5_K_M"
REASON_MODEL = "deepseek-r1:14b-qwen-distill-q4_K_M"
TEMP_PDF_DIR = Path("./temp_pdf") # Directory for storing downloaded PDFs
BROWSE_LITE = 1 # whether to parse webpage with reader-lm and parse pdf with docling or not
PDF_MAX_PAGES=1
PDF_MAX_FILESIZE=20971520
TIMEOUT_PDF=2
MAX_HTML_LENGTH = 10120
MAX_EVAL_TIME = 15 # Maximum seconds for JavaScript execution to clean HTML

async def call_ollama_async(session, messages, model=DEFAULT_MODEL, max_tokens=20000)
call_ollama_async(session, messages, model="reader-lm", max_tokens=int(4*MAX_HTML_LENGTH))
{"role": "user", "content": f"User Query: {user_query}\n\nWeb URL: {page_url}\n\nWebpage Content (first 20000 characters):\n{page_text[:20000]}\n\n{prompt}"}

@benhaotang
Copy link
Owner

benhaotang commented Feb 7, 2025

Looks like it is during parsing? I added some verbose output and tested myself, parsing result seems reasonable to me, of cause, result may degrade on some ads-packed websites, but it shouldn't loss much context.

Also what you show me is not context length, max_perdict is just max output, that'swhy I set 1.25, becauseit should not be much longer than the input html source code.

For ollama, either you set it with num_ctx when requesting or you set a default one in Modelfile when importing, otherwise it will default at 2k, if you have never changed num_ctx, that might be the problem. I will also add an option to change that per model bases over the weekend.

@wwjCMP
Copy link
Author

wwjCMP commented Feb 7, 2025

Looks like it is during parsing? I added some verbose output and tested myself, parsing result seems reasonable to me, of cause, result may degrade on some ads-packed websites, but it shouldn't loss much context.

Also what you show me is not context length, max_perdict is just max output, that'swhy I set 1.25, becauseit should not be much longer than the input html source code.

For ollama, either you set it with num_ctx when requesting or you set a default one in Modelfile when importing, otherwise it will default at 2k, if you have never changed num_ctx, that might be the problem. I will also add an option to change that per model bases over the weekend.

I set num_ctx to 25000.

@benhaotang
Copy link
Owner

benhaotang commented Feb 7, 2025 via email

@benhaotang benhaotang changed the title format is neither a schema or \"json\"" format="\"\ Poor content extraction from certain domains Feb 7, 2025
@benhaotang
Copy link
Owner

benhaotang commented Feb 8, 2025

I am trying to do a rewrite with DSPy for structured output and RAG for reducing token usage and improving data flow efficiency. I think that will improve current situation and leverage ollama's ability.

Hope to have things delivered before next weekend.

@benhaotang benhaotang added this to the Rewrite with DSPy milestone Feb 8, 2025
@benhaotang
Copy link
Owner

benhaotang commented Feb 15, 2025

@wwjCMP I think I find the root cause, currently the code will try to sanitize the html code before giving reader-lm to parse(otherwise it will take even longer to get useful info), current implementation will result in context loss if the website embed their context in iframe or in the same div as advertisements, or have tricky style sheet.

I think the ultimate solution is to have a agent, based on website situation, determine a parse method, ranging from screenshot the website and ocr(if it tries to hide context in html code) or jina html2md or newpaper3k or bs4, this may take some extra time to implement.

There can be a quick fix if you want to try, use BROWSE_LITE=true and remove all the html tags using regex, but I will try to impletement DSPy and the dynamic parsing method described above before this last resort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants