Token indices sequence length #856

saboor2632 · 2024-12-29T21:03:27Z

I am facing this issue whenever i run my code for scraping bbc or any other site
Error:
Token indices sequence length is longer than the specified maximum sequence length for this model (5102 > 1024). Running this sequence through the model will result in indexing errors

It does not give me complete results

Qunlexie · 2025-01-02T16:39:14Z

I have this same issue as well especially when working with Ollama LLAMA models

VinciGit00 · 2025-01-03T19:54:30Z

for big websites you should use openai

## [1.35.0-beta.4](v1.35.0-beta.3...v1.35.0-beta.4) (2025-01-06) ### Features * ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))

## [1.35.0](v1.34.2...v1.35.0) (2025-01-06) ### Features * ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a)) * ⛏️ enhanced contribution and precommit added ([fcbfe78](fcbfe78)) * add codequality workflow ([4380afb](4380afb)) * add timeout and retry_limit in loader_kwargs ([#865](#865) [#831](#831)) ([21147c4](21147c4)) * serper api search ([1c0141f](1c0141f)) ### Bug Fixes * browserbase integration ([752a885](752a885)) * local html handling ([2a15581](2a15581)) ### CI * **release:** 1.34.2-beta.1 [skip ci] ([f383e72](f383e72)), closes [#861](#861) [#861](#861) * **release:** 1.34.2-beta.2 [skip ci] ([93fd9d2](93fd9d2)) * **release:** 1.34.3-beta.1 [skip ci] ([013a196](013a196)), closes [#861](#861) [#861](#861) * **release:** 1.35.0-beta.1 [skip ci] ([c5630ce](c5630ce)), closes [#865](#865) [#831](#831) * **release:** 1.35.0-beta.2 [skip ci] ([f21c586](f21c586)) * **release:** 1.35.0-beta.3 [skip ci] ([cb54d5b](cb54d5b)) * **release:** 1.35.0-beta.4 [skip ci] ([6e375f5](6e375f5)), closes [#810](#810) [#856](#856) [#853](#853)

PeriniM · 2025-01-06T19:18:40Z

@Qunlexie @saboor2632 there is indeed an issue with the method to calculate chunks for ollama models, in tokenizer_ollama.py. We are using get_num_tokens from langchain lib but it is always limiting the model to 1024 tokens. Even if we change the num of tokens from the graph config using the model_tokens param we could only set it to 1024. It would be better to calculate the num of chunks ourself, using for example openai tokenizer from tiktoken and approximate the num of tokens per chunk. wdyt

Qunlexie · 2025-01-10T16:46:20Z

We are using get_num_tokens from langchain lib but it is always limiting the model to 1024 tokens. Even if we change the num of tokens from the graph config using the model_tokens param we could only set it to 1024.

Is this a bug that should be raised to langchain? I belive the limit to the tokens is the cause of the issue where only parts of the web page is being retrieved rather all of it.

It would be better to calculate the num of chunks ourself, using for example openai tokenizer from tiktoken and approximate the num of tokens per chunk. wdyt

How would this work? Do you have a practical example? for how this will work with ScrpapeGraph. I really believe that getting Ollama to work properly is key for open source.

Happy to get your thoughts

Pal-dont-want-to-work · 2025-01-11T05:15:28Z

I face the same problem, hoping author can fix it
Token indices sequence length is longer than the specified maximum sequence length for this model (1548 > 1024). Running this sequence through the model will result in indexing errors

PeriniM · 2025-01-12T12:25:11Z

Hey @saboor2632 @Qunlexie @Pal-dont-want-to-work the issue is now fixed in v1.36!
You can try with:

graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "temperature": 0,
        "model_tokens": 4096
    },
    "verbose": True,
    "headless": False
}

Qunlexie · 2025-01-13T04:26:13Z

Thanks @PeriniM for the fix. The import error seems to have been resolved.

It does not give me complete results

However this part of the original question is still true. I believe this is because the parser runs into several issues when parsing ollama models into json (which is more likely at higher model_tokens).

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/output_parsers/json.py", line 83, in parse_result
    return parse_json_markdown(text)
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 20, in <module>
  File "/usr/local/lib/python3.10/dist-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 292, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  .... 
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/output_parsers/json.py", line 86, in parse_result
    raise OutputParserException(msg, llm_output=text) from e
langchain_core.exceptions.OutputParserException: Invalid json output: Here is the text reformatted to be more readable:

You can recreate this by:

graph_config = {
   'llm': {
      'model': 'ollama/llama3.2',
      'temperature': 0.0,
      'model_tokens': 4096,
      'base_url': 'http://localhost:11434'
   },
   'verbose': True,
   'headless': False
}

smart_scraper_graph = SmartScraperGraph(
   prompt='List me all the titles and link to the story',
   source='https://www.vanguardngr.com/2023/page/1',
   config=graph_config
)

result = smart_scraper_graph.run()

I wonder if the new structured output by ollama could solve this?
Wdyt?

PeriniM · 2025-01-13T10:39:09Z

Hey @Qunlexie in the new version is already present the new structured output for ollama models, I guess merging many chunks (big websiites) into a valid json is difficult for a 3b model, can you try with a bigger model and lmk? or increase the model_tokens in the graph config

VinciGit00 · 2025-01-13T11:14:41Z

try with OpenAI, if it works with that is because of the model

PeriniM added a commit that referenced this issue Jan 6, 2025

feat: ⏰added graph timeout and fixed model_tokens param (#810 #856 #853)

01a331a

PeriniM added the bug Something isn't working label Jan 8, 2025

PeriniM mentioned this issue Jan 8, 2025

Tokenizer Import Error When Using Ollama Models #766

Closed

PeriniM linked a pull request Jan 12, 2025 that will close this issue

856 token indices sequence length #877

Merged

PeriniM mentioned this issue Jan 12, 2025

New Features, Examples Refactoring and Bug Fix #879

Merged

PeriniM closed this as completed Jan 12, 2025

PeriniM mentioned this issue Jan 12, 2025

ollama/llama3.2 maximum sequence length #882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length #856

Token indices sequence length #856

saboor2632 commented Dec 29, 2024

Qunlexie commented Jan 2, 2025 •

edited

Loading

VinciGit00 commented Jan 3, 2025

PeriniM commented Jan 6, 2025

Qunlexie commented Jan 10, 2025 •

edited

Loading

Pal-dont-want-to-work commented Jan 11, 2025

PeriniM commented Jan 12, 2025

Qunlexie commented Jan 13, 2025 •

edited

Loading

PeriniM commented Jan 13, 2025 •

edited

Loading

VinciGit00 commented Jan 13, 2025

Token indices sequence length #856

Token indices sequence length #856

Comments

saboor2632 commented Dec 29, 2024

Qunlexie commented Jan 2, 2025 • edited Loading

VinciGit00 commented Jan 3, 2025

PeriniM commented Jan 6, 2025

Qunlexie commented Jan 10, 2025 • edited Loading

Pal-dont-want-to-work commented Jan 11, 2025

PeriniM commented Jan 12, 2025

Qunlexie commented Jan 13, 2025 • edited Loading

PeriniM commented Jan 13, 2025 • edited Loading

VinciGit00 commented Jan 13, 2025

Qunlexie commented Jan 2, 2025 •

edited

Loading

Qunlexie commented Jan 10, 2025 •

edited

Loading

Qunlexie commented Jan 13, 2025 •

edited

Loading

PeriniM commented Jan 13, 2025 •

edited

Loading