Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding maxSubstringLength field #42

Closed
wants to merge 3 commits into from

Conversation

KamranAsif
Copy link

js-worker-search default indexing strategy is to create all substrings of a token. This takes O(n^2) in memory which can cause the browser to crash.

To prevent this, I added a new parameter called maxLength. It ensures there are no tokens larger than this substring by splitting at that length.

Some additional notes in this PR:

@bvaughn
Copy link
Owner

bvaughn commented Jan 26, 2021

I don't think I understand how this new parameter is supposed to work. I think arbitrarily splitting tokens into chunks of a certain length could cause things to match that shouldn't.

I think the test you've added actually shows this happening. The INDEX_MODES.EXACT_WORDS index mode should only return results that match exactly. So "Verylongst" should not match "Verylongstringwithout", but it does after this arbitrary truncation.

@KamranAsif
Copy link
Author

Your right, it looks like we tokenize the search query as well. The goal of this new parameter is to prevent the creation of super large tokens being indexed.

Do you think only doing the chunking in the indexDocument call would be acceptable?

@bvaughn
Copy link
Owner

bvaughn commented Jan 26, 2021

Honestly, as the commit log shows, I haven't worked on (or thought about) this library much in a couple of years. 😄 I don't have much of a suggestion here off the top of my head...but I also don't feel good about merging the PR as it is now.

@KamranAsif
Copy link
Author

@bvaughn I think I've made something that should be acceptable. It should limit the memory impact to O(k^2) if maxSubstringLength is defined.

Comment on lines +301 to +304
expect(await searchUtility.search("verylong")).toEqual([8]);
expect(await searchUtility.search("without")).toEqual([8]);
expect(await searchUtility.search("delimiter")).toEqual([8]);
expect((await searchUtility.search("withoutdelimiter")).length).toBe(0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is better... I'm still unsure about how maxSubstringLength affects all substrings mode though.

Given a token "Verylongstringwithoutdelimiter ", what's the value of being able to search and match on e.g. "without" and "delimiter" rather than just "verylong"?

I think I would have expected the indexer to take a word like "Verylongstringwithoutdelimiter", see there's a max subtring length of 8, truncate it to "Verylong" and index that only (discarding the rest– because the rest is arbitrary).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this is essentially doing is putting an upper limit on the input search value.
Internally in our product, a user might have some input like:

"1000-3000-1400-11100...." (assume very long input)

We caused a sev by trying to calculate all substrings of this. With a max substring length of 10, all the individual numbers here would still be searchable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Just to be clear, I understand the memory usage concerns of trying to pre-calculate all substring indexes for a document with longer strings. So I understand the motivation for putting some kind of cap on that.

My concern is that the way the cap is implemented, the resulting search becomes difficult to predict. For example, if we had three documents that contained the following strings:

  • A: cat
  • B: concatenate
  • C: sophisticated

And we passed a maxSubstringLength of 3. You'd be able to match A and B with the search string "cat" but not C because "sophisticated" would get broken down into separate chunks to be indexed: "sop", "his", "tic", "ate", "d". This seems too arbitrary to me to be a good user experience. So if you typed "t" and saw all three matches, then refined to "at" and saw all three, then refined to "cat" and now only see two– that's an expected behavior I think. At least if we only indexed the first maxSubstringLength characters then finding A would seem more consistent.

I wonder if what you really want here is the ability to specify a custom tokenizer so that you can split your example string above ("1000-3000-1400-11100....") into more meaningful chunks of "1000", "3000", "1400", "11100", etc. (which might not all even be the same length).

Copy link
Author

@KamranAsif KamranAsif Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your confusion. This implementation is not changing the tokens, so 'sophisticated' gets passed into SearchIndex.
SearchIndex is actually breaking down sophisticated into:
[s, so, sop, o, op, oph, h, hi, his, ...., c, ca, cat, a, at, ate, ...., ted] e.g. all substrings less than maxSubstringLength

So 'cat' would actually match 'sophisticated'.

Agree about the custom tokenizer, but this is also a fallback because our users can enter any values.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

Hm. I need to think about this some more.

I understand your use case but I'm not sure if I want to take on supporting this workaround.

@KamranAsif KamranAsif changed the title Adding maxLength field Adding maxSubstringLength field Feb 2, 2021
@KamranAsif KamranAsif closed this Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants