-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
Conversation
For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #607 +/- ##
============================================
+ Coverage 82.62% 84.19% +1.56%
- Complexity 666 743 +77
============================================
Files 52 59 +7
Lines 2072 2309 +237
Branches 334 370 +36
============================================
+ Hits 1712 1944 +232
- Misses 212 214 +2
- Partials 148 151 +3 ☔ View full report in Codecov by Sentry. |
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java
Show resolved
Hide resolved
30fd0eb
to
57a4a20
Compare
Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.
- Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
- Lets create a meta issue with design (I can create one and link it)
- We will move forward with the changes
Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for? |
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java
Outdated
Show resolved
Hide resolved
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java
Show resolved
Hide resolved
private static final Set<String> WORD_TOKENIZERS = Set.of( | ||
"standard", | ||
"letter", | ||
"lowercase", | ||
"whitespace", | ||
"uax_url_email", | ||
"classic", | ||
"thai" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.
throw new IllegalStateException( | ||
String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.
…elimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]> (cherry picked from commit eea53aa)
…en length and delimiter algorithm (#644) * feat: implement text chunking processor with fixed token length and delimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]> (cherry picked from commit eea53aa) * bug fix: fix compile error in integration test (#645) Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Co-authored-by: Yuye Zhu <[email protected]>
// chunk the object when target key is of leaf type (null, string and list of string) | ||
Object chunkObject = sourceAndMetadataMap.get(originalKey); | ||
List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters); | ||
sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourceAndMetadataMap
contains some metadata fields such as _index
, _routing
and _id
, if the targetKey
equals the name of the metadata field, may cause accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple solution is to prohibiting targetKey starting with "_".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check the behavior of other ingestion processors.
Description
This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:
And then obtain the response:
You can refer to the RFC for detailed parameter description.
User Cases
Text Embedding
After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:
And we obtain the following results:
Cascaded Chunking Processors
Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:
Issues Resolved
Implement document chunking processor and fixed token length algorithm
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.