Adding maxSubstringLength field #42

KamranAsif · 2021-01-26T19:36:08Z

js-worker-search default indexing strategy is to create all substrings of a token. This takes O(n^2) in memory which can cause the browser to crash.

To prevent this, I added a new parameter called maxLength. It ensures there are no tokens larger than this substring by splitting at that length.

Some additional notes in this PR:

Upgraded flow to the latest version to support Array.flat
This caused a flow error in immutable (Upgrade Immutable to 3.8.1 facebookarchive/draft-js#950) requiring changing our flowconfig

bvaughn · 2021-01-26T19:58:27Z

I don't think I understand how this new parameter is supposed to work. I think arbitrarily splitting tokens into chunks of a certain length could cause things to match that shouldn't.

I think the test you've added actually shows this happening. The INDEX_MODES.EXACT_WORDS index mode should only return results that match exactly. So "Verylongst" should not match "Verylongstringwithout", but it does after this arbitrary truncation.

KamranAsif · 2021-01-26T20:39:05Z

Your right, it looks like we tokenize the search query as well. The goal of this new parameter is to prevent the creation of super large tokens being indexed.

Do you think only doing the chunking in the indexDocument call would be acceptable?

bvaughn · 2021-01-26T20:49:11Z

Honestly, as the commit log shows, I haven't worked on (or thought about) this library much in a couple of years. 😄 I don't have much of a suggestion here off the top of my head...but I also don't feel good about merging the PR as it is now.

KamranAsif · 2021-01-26T21:55:25Z

@bvaughn I think I've made something that should be acceptable. It should limit the memory impact to O(k^2) if maxSubstringLength is defined.

bvaughn · 2021-01-27T14:00:15Z

src/util/SearchUtility.test.js

+  expect(await searchUtility.search("verylong")).toEqual([8]);
+  expect(await searchUtility.search("without")).toEqual([8]);
+  expect(await searchUtility.search("delimiter")).toEqual([8]);
+  expect((await searchUtility.search("withoutdelimiter")).length).toBe(0);


I think this change is better... I'm still unsure about how maxSubstringLength affects all substrings mode though.

Given a token "Verylongstringwithoutdelimiter ", what's the value of being able to search and match on e.g. "without" and "delimiter" rather than just "verylong"?

I think I would have expected the indexer to take a word like "Verylongstringwithoutdelimiter", see there's a max subtring length of 8, truncate it to "Verylong" and index that only (discarding the rest– because the rest is arbitrary).

What this is essentially doing is putting an upper limit on the input search value.
Internally in our product, a user might have some input like:

"1000-3000-1400-11100...." (assume very long input)

We caused a sev by trying to calculate all substrings of this. With a max substring length of 10, all the individual numbers here would still be searchable.

Right. Just to be clear, I understand the memory usage concerns of trying to pre-calculate all substring indexes for a document with longer strings. So I understand the motivation for putting some kind of cap on that.

My concern is that the way the cap is implemented, the resulting search becomes difficult to predict. For example, if we had three documents that contained the following strings:

A: cat

B: concatenate

C: sophisticated

And we passed a maxSubstringLength of 3. You'd be able to match A and B with the search string "cat" but not C because "sophisticated" would get broken down into separate chunks to be indexed: "sop", "his", "tic", "ate", "d". This seems too arbitrary to me to be a good user experience. So if you typed "t" and saw all three matches, then refined to "at" and saw all three, then refined to "cat" and now only see two– that's an expected behavior I think. At least if we only indexed the first maxSubstringLength characters then finding A would seem more consistent.

I wonder if what you really want here is the ability to specify a custom tokenizer so that you can split your example string above ("1000-3000-1400-11100....") into more meaningful chunks of "1000", "3000", "1400", "11100", etc. (which might not all even be the same length).

I see your confusion. This implementation is not changing the tokens, so 'sophisticated' gets passed into SearchIndex.
SearchIndex is actually breaking down sophisticated into:
[s, so, sop, o, op, oph, h, hi, his, ...., c, ca, cat, a, at, ate, ...., ted] e.g. all substrings less than maxSubstringLength

So 'cat' would actually match 'sophisticated'.

Agree about the custom tokenizer, but this is also a fallback because our users can enter any values.

Gotcha.

Hm. I need to think about this some more.

I understand your use case but I'm not sure if I want to take on supporting this workaround.

Adding maxLength field

0a7aad3

Switching implementation to maxSubstringLength

fd97220

updating jsdoc and undoing testing changes

3fc1f55

bvaughn reviewed Jan 27, 2021

View reviewed changes

KamranAsif changed the title ~~Adding maxLength field~~ Adding maxSubstringLength field Feb 2, 2021

KamranAsif closed this Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding maxSubstringLength field #42

Adding maxSubstringLength field #42

KamranAsif commented Jan 26, 2021

bvaughn commented Jan 26, 2021

KamranAsif commented Jan 26, 2021

bvaughn commented Jan 26, 2021

KamranAsif commented Jan 26, 2021

bvaughn Jan 27, 2021

KamranAsif Jan 27, 2021

bvaughn Jan 27, 2021

KamranAsif Jan 27, 2021 •

edited

Loading

bvaughn Jan 27, 2021

Adding maxSubstringLength field #42

Adding maxSubstringLength field #42

Conversation

KamranAsif commented Jan 26, 2021

bvaughn commented Jan 26, 2021

KamranAsif commented Jan 26, 2021

bvaughn commented Jan 26, 2021

KamranAsif commented Jan 26, 2021

bvaughn Jan 27, 2021

Choose a reason for hiding this comment

KamranAsif Jan 27, 2021

Choose a reason for hiding this comment

bvaughn Jan 27, 2021

Choose a reason for hiding this comment

KamranAsif Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

bvaughn Jan 27, 2021

Choose a reason for hiding this comment

KamranAsif Jan 27, 2021 •

edited

Loading