Poor content extraction from certain domains #4

wwjCMP · 2025-02-06T14:12:16Z

Ollama log

level=WARN source=server.go:762 msg="format is neither a schema or "json"" format=""""

Will this affect the final result? Because in my actual usage, the response effectiveness is very poor.

benhaotang · 2025-02-06T20:49:11Z

sorry, may I get more context on:

What models are you using? what are the context lengths that you set for them?
You see this warning at what stage? at planning? searching? deciding if url is useful or writing?
what does response effectiveness mean?

So this warning is just if you request via the REST API but don't set what output format you want. Since we are using ollama-python instead of cURL here, it will not affect the return message.

If you say the finial result is not good as for "effectiveness", I suspect is that your context length is too small. Because plus planning at 5 iterations per 4 searches per query, so 80 searches, if 50% is useful, the finial writing instruction can easily go upward 100K tokens, yes we can do some pre-summarization (and we already did on per result bases) to reduce token number, but if we do on per iteration-level, small models will hallucinate hard without full context.

That is also why I cannot personally test that at my set up, I don't have a GPU that can fit all these, so for me since I am using mistral-small with max ctx at 32K on my RX7800XT, I only use 2 iteration at 3 searches without planning, the result I get seems decent, the citation is proper without much hallucination.

benhaotang · 2025-02-06T20:55:50Z

Well, maybe this is also a good reminder for me to add RAG support:) I will do after I add tool-calling support

wwjCMP · 2025-02-07T03:41:10Z

sorry, may I get more context on:

What models are you using? what are the context lengths that you set for them?

You see this warning at what stage? at planning? searching? deciding if url is useful or writing?

what does response effectiveness mean?

So this warning is just if you request via the REST API but don't set what output format you want. Since we are using ollama-python instead of cURL here, it will not affect the return message.

If you say the finial result is not good as for "effectiveness", I suspect is that your context length is too small. Because plus planning at 5 iterations per 4 searches per query, so 80 searches, if 50% is useful, the finial writing instruction can easily go upward 100K tokens, yes we can do some pre-summarization (and we already did on per result bases) to reduce token number, but if we do on per iteration-level, small models will hallucinate hard without full context.

That is also why I cannot personally test that at my set up, I don't have a GPU that can fit all these, so for me since I am using mistral-small with max ctx at 32K on my RX7800XT, I only use 2 iteration at 3 searches without planning, the result I get seems decent, the citation is proper without much hallucination.

DEFAULT_MODEL = "mistral-small:22b-instruct-2409-q5_K_M"
REASON_MODEL = "deepseek-r1:14b-qwen-distill-q4_K_M"
TEMP_PDF_DIR = Path("./temp_pdf") # Directory for storing downloaded PDFs
BROWSE_LITE = 1 # whether to parse webpage with reader-lm and parse pdf with docling or not
PDF_MAX_PAGES=1
PDF_MAX_FILESIZE=20971520
TIMEOUT_PDF=2
MAX_HTML_LENGTH = 10120
MAX_EVAL_TIME = 15 # Maximum seconds for JavaScript execution to clean HTML

async def call_ollama_async(session, messages, model=DEFAULT_MODEL, max_tokens=20000)
call_ollama_async(session, messages, model="reader-lm", max_tokens=int(4*MAX_HTML_LENGTH))
{"role": "user", "content": f"User Query: {user_query}\n\nWeb URL: {page_url}\n\nWebpage Content (first 20000 characters):\n{page_text[:20000]}\n\n{prompt}"}

benhaotang · 2025-02-07T07:45:09Z

Looks like it is during parsing? I added some verbose output and tested myself, parsing result seems reasonable to me, of cause, result may degrade on some ads-packed websites, but it shouldn't loss much context.

Also what you show me is not context length, max_perdict is just max output, that'swhy I set 1.25, becauseit should not be much longer than the input html source code.

For ollama, either you set it with num_ctx when requesting or you set a default one in Modelfile when importing, otherwise it will default at 2k, if you have never changed num_ctx, that might be the problem. I will also add an option to change that per model bases over the weekend.

wwjCMP · 2025-02-07T07:51:01Z

Looks like it is during parsing? I added some verbose output and tested myself, parsing result seems reasonable to me, of cause, result may degrade on some ads-packed websites, but it shouldn't loss much context.

Also what you show me is not context length, max_perdict is just max output, that'swhy I set 1.25, becauseit should not be much longer than the input html source code.

For ollama, either you set it with num_ctx when requesting or you set a default one in Modelfile when importing, otherwise it will default at 2k, if you have never changed num_ctx, that might be the problem. I will also add an option to change that per model bases over the weekend.

I set num_ctx to 25000.

benhaotang · 2025-02-07T08:03:30Z

I see, then I guess RAG on gathered context will be necessary. I will think a bit about how to implement that. But that would result in a product more like perplexica(also open source on github) as it use more traditional way, ig maybe local model is still not good at agentic and long context. Btw what fields are you mainly searching for? I only tested some research topic in my field and it does find relevant papers I expected with proper summarizing into a final report. That's why I think I may not see the full picture of the problem that you are talking about. 07.02.2025 08:51:27 wwjCMP ***@***.***>:

…

> Looks like it is during parsing? I added some verbose output and tested myself, parsing result seems reasonable to me, of cause, result may degrade on some ads-packed websites, but it shouldn't loss much context. > > Also what you show me is not context length, max_perdict is just max output, that'swhy I set 1.25, becauseit should not be much longer than the input html source code. > > For ollama, either you set it with num_ctx when requesting or you set a default one in Modelfile when importing, otherwise it will default at 2k, if you have never changed num_ctx, that might be the problem. I will also add an option to change that per model bases over the weekend. I set num_ctx to 25000. -- Reply to this email directly or view it on GitHub: #4 (comment) You are receiving this because you were assigned. Message ID: ***@***.***>

benhaotang · 2025-02-08T17:46:09Z

I am trying to do a rewrite with DSPy for structured output and RAG for reducing token usage and improving data flow efficiency. I think that will improve current situation and leverage ollama's ability.

Hope to have things delivered before next weekend.

benhaotang · 2025-02-15T13:58:12Z

@wwjCMP I think I find the root cause, currently the code will try to sanitize the html code before giving reader-lm to parse(otherwise it will take even longer to get useful info), current implementation will result in context loss if the website embed their context in iframe or in the same div as advertisements, or have tricky style sheet.

I think the ultimate solution is to have a agent, based on website situation, determine a parse method, ranging from screenshot the website and ocr(if it tries to hide context in html code) or jina html2md or newpaper3k or bs4, this may take some extra time to implement.

There can be a quick fix if you want to try, use BROWSE_LITE=true and remove all the html tags using regex, but I will try to impletement DSPy and the dynamic parsing method described above before this last resort.

benhaotang self-assigned this Feb 6, 2025

benhaotang added the enhancement New feature or request label Feb 6, 2025

benhaotang changed the title ~~format is neither a schema or \"json\"" format="\"\~~ Poor content extraction from certain domains Feb 7, 2025

benhaotang added this to the Rewrite with DSPy milestone Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor content extraction from certain domains #4

Poor content extraction from certain domains #4

wwjCMP commented Feb 6, 2025 •

edited

Loading

benhaotang commented Feb 6, 2025 •

edited

Loading

benhaotang commented Feb 6, 2025 •

edited

Loading

wwjCMP commented Feb 7, 2025

benhaotang commented Feb 7, 2025 •

edited

Loading

wwjCMP commented Feb 7, 2025

benhaotang commented Feb 7, 2025 via email

benhaotang commented Feb 8, 2025 •

edited

Loading

benhaotang commented Feb 15, 2025 •

edited

Loading

Poor content extraction from certain domains #4

Poor content extraction from certain domains #4

Comments

wwjCMP commented Feb 6, 2025 • edited Loading

benhaotang commented Feb 6, 2025 • edited Loading

benhaotang commented Feb 6, 2025 • edited Loading

wwjCMP commented Feb 7, 2025

benhaotang commented Feb 7, 2025 • edited Loading

wwjCMP commented Feb 7, 2025

benhaotang commented Feb 7, 2025 via email

benhaotang commented Feb 8, 2025 • edited Loading

benhaotang commented Feb 15, 2025 • edited Loading

wwjCMP commented Feb 6, 2025 •

edited

Loading

benhaotang commented Feb 6, 2025 •

edited

Loading

benhaotang commented Feb 6, 2025 •

edited

Loading

benhaotang commented Feb 7, 2025 •

edited

Loading

benhaotang commented Feb 8, 2025 •

edited

Loading

benhaotang commented Feb 15, 2025 •

edited

Loading