Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stackoverflow caused by textspliter #165

Merged
merged 7 commits into from
Feb 16, 2024
Merged

fix: stackoverflow caused by textspliter #165

merged 7 commits into from
Feb 16, 2024

Conversation

whyiug
Copy link

@whyiug whyiug commented Feb 14, 2024

Pull Request Template

Description

When separators do not include "", textspliter may result in infinite recursive calls, i.e., stack overflow.
Here's an example:
The input parameters are:

rc := RecursiveCharacter{
    Separators: []string{"\n", "$"},
    ChunkSize:  20,
    ChunkOverlap: 1,
}
text := "Hi, Harrison. \nI am glad to meet you"

error like this:

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc02081e330 stack=[0xc02081e000, 0xc04081e000]
fatal error: stack overflow

runtime stack:
runtime.throw({0x358d8ae?, 0x0?})
/usr/local/go/src/runtime/panic.go:1023 +0x5c fp=0x7ff7bdc6e270 sp=0x7ff7bdc6e240 pc=0x32cab7c
runtime.newstack()
/usr/local/go/src/runtime/stack.go:1103 +0x5bd fp=0x7ff7bdc6e420 sp=0x7ff7bdc6e270 pc=0x32e6a1d
runtime.morestack()
/usr/local/go/src/runtime/asm_amd64.s:616 +0x7a fp=0x7ff7bdc6e428 sp=0x7ff7bdc6e420 pc=0x3302dda

Type of change

  • Bug fix

Add the newSeparators array to avoid entering an infinite loop call.
Inspired by https://github.com/langchain-ai/langchain/blob/00a09e1b7117f3bde14a44748510fcccc95f9de5/libs/langchain/langchain/text_splitter.py.

@henomis henomis changed the base branch from main to release-v0.1.1 February 14, 2024 08:36
@henomis
Copy link
Owner

henomis commented Feb 15, 2024

  • Please provide a real example. RecursiveCharacter doesn't exist, did you mean RecursiveCharacterTextSplitter?
  • RecursiveCharacterTextSplitter.separators is a private attribute and is intended to be initialized using the NewRecursiveCharacterTextSplitter where separators will be initialized with defaultSeparators
  • However IMO the problem should be fixed in WithSeparators method checking/adding the "" separator by default.

@whyiug
Copy link
Author

whyiug commented Feb 15, 2024

Sorry for the incorrect example. It's too causal.

I just want to provide an example, actually users will initialize throughNewRecursiveCharacterTextSplitter and modify the value using theWithSeparators method. One thing that is certain is that users can modify the value of separators.

As for how to fix this problem, I don't want to avoid this problem by simply checking. Instead, allow the user's input to determine how to split it, start on line 871 in the https://github.com/langchain-ai/langchain/blob/00a09e1b7117f3bde14a44748510fcccc95f9de5/libs/langchain/langchain/text_splitter.py.

@henomis
Copy link
Owner

henomis commented Feb 16, 2024

Thank you for your contribution! 🚀

@henomis henomis merged commit 16dd8b9 into henomis:release-v0.1.1 Feb 16, 2024
2 checks passed
@henomis henomis mentioned this pull request Feb 16, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants