-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding maxSubstringLength field #42
Conversation
I don't think I understand how this new parameter is supposed to work. I think arbitrarily splitting tokens into chunks of a certain length could cause things to match that shouldn't. I think the test you've added actually shows this happening. The |
Your right, it looks like we tokenize the search query as well. The goal of this new parameter is to prevent the creation of super large tokens being indexed. Do you think only doing the chunking in the |
Honestly, as the commit log shows, I haven't worked on (or thought about) this library much in a couple of years. 😄 I don't have much of a suggestion here off the top of my head...but I also don't feel good about merging the PR as it is now. |
@bvaughn I think I've made something that should be acceptable. It should limit the memory impact to O(k^2) if |
expect(await searchUtility.search("verylong")).toEqual([8]); | ||
expect(await searchUtility.search("without")).toEqual([8]); | ||
expect(await searchUtility.search("delimiter")).toEqual([8]); | ||
expect((await searchUtility.search("withoutdelimiter")).length).toBe(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is better... I'm still unsure about how maxSubstringLength
affects all substrings mode though.
Given a token "Verylongstringwithoutdelimiter ", what's the value of being able to search and match on e.g. "without" and "delimiter" rather than just "verylong"?
I think I would have expected the indexer to take a word like "Verylongstringwithoutdelimiter", see there's a max subtring length of 8, truncate it to "Verylong" and index that only (discarding the rest– because the rest is arbitrary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this is essentially doing is putting an upper limit on the input search value.
Internally in our product, a user might have some input like:
"1000-3000-1400-11100...." (assume very long input)
We caused a sev by trying to calculate all substrings of this. With a max substring length of 10, all the individual numbers here would still be searchable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Just to be clear, I understand the memory usage concerns of trying to pre-calculate all substring indexes for a document with longer strings. So I understand the motivation for putting some kind of cap on that.
My concern is that the way the cap is implemented, the resulting search becomes difficult to predict. For example, if we had three documents that contained the following strings:
- A: cat
- B: concatenate
- C: sophisticated
And we passed a maxSubstringLength
of 3. You'd be able to match A and B with the search string "cat" but not C because "sophisticated" would get broken down into separate chunks to be indexed: "sop", "his", "tic", "ate", "d". This seems too arbitrary to me to be a good user experience. So if you typed "t" and saw all three matches, then refined to "at" and saw all three, then refined to "cat" and now only see two– that's an expected behavior I think. At least if we only indexed the first maxSubstringLength
characters then finding A would seem more consistent.
I wonder if what you really want here is the ability to specify a custom tokenizer so that you can split your example string above ("1000-3000-1400-11100....") into more meaningful chunks of "1000", "3000", "1400", "11100", etc. (which might not all even be the same length).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your confusion. This implementation is not changing the tokens, so 'sophisticated' gets passed into SearchIndex.
SearchIndex is actually breaking down sophisticated into:
[s, so, sop, o, op, oph, h, hi, his, ...., c, ca, cat, a, at, ate, ...., ted]
e.g. all substrings less than maxSubstringLength
So 'cat' would actually match 'sophisticated'.
Agree about the custom tokenizer, but this is also a fallback because our users can enter any values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
Hm. I need to think about this some more.
I understand your use case but I'm not sure if I want to take on supporting this workaround.
js-worker-search
default indexing strategy is to create all substrings of a token. This takesO(n^2)
in memory which can cause the browser to crash.To prevent this, I added a new parameter called
maxLength
. It ensures there are no tokens larger than this substring by splitting at that length.Some additional notes in this PR:
Array.flat