feat: implement text chunking processor with fixed token length and delimiter algorithm #607

yuye-aws · 2024-02-18T12:31:55Z

Description

This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And then obtain the response:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

You can refer to the RFC for detailed parameter description.

User Cases

Text Embedding

After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "IYMBDo4BwlxmLrDqUr0a",
          "field_map": {
            "body_chunk": "body_chunk_embedding"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And we obtain the following results:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body_chunk_embedding": [
            {
              "knn": [...]
            },
            {
              "knn": [...]
            },
            {
              "knn": [...]
            }
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

Cascaded Chunking Processors

Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:

PUT _ingest/pipeline/chunking-pipeline
{
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "body": "body_chunk1"
        }
      }
    },
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "body_chunk1": "body_chunk2"
        }
      }
    }
  ]
}

Issues Resolved

Implement document chunking processor and fixed token length algorithm

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

yuye-aws · 2024-02-18T12:40:28Z

For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests.

codecov · 2024-02-18T12:41:11Z

Codecov Report

Attention: Patch coverage is 97.89916% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (e41fba7) to head (68fef4f).
Report is 2 commits behind head on main.

Files	Patch %	Lines
.../neuralsearch/processor/TextChunkingProcessor.java	96.03%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #607      +/-   ##
============================================
+ Coverage     82.62%   84.19%   +1.56%     
- Complexity      666      743      +77     
============================================
  Files            52       59       +7     
  Lines          2072     2309     +237     
  Branches        334      370      +36     
============================================
+ Hits           1712     1944     +232     
- Misses          212      214       +2     
- Partials        148      151       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

yuye-aws · 2024-02-22T15:25:08Z

Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code.

sam-herman

thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.

Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
Lets create a meta issue with design (I can create one and link it)
We will move forward with the changes

yuye-aws · 2024-02-26T01:42:32Z

Lets create a meta issue with design (I can create one and link it)

Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for?

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java

Signed-off-by: yuye-aws <[email protected]>

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

model-collapse · 2024-03-18T00:03:37Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

+    private static final Set<String> WORD_TOKENIZERS = Set.of(
+        "standard",
+        "letter",
+        "lowercase",
+        "whitespace",
+        "uax_url_email",
+        "classic",
+        "thai"
+    );


Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.

model-collapse · 2024-03-18T00:07:17Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

+            throw new IllegalStateException(
+                String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),


It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.

…elimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]> (cherry picked from commit eea53aa)

…en length and delimiter algorithm (#644) * feat: implement text chunking processor with fixed token length and delimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]> (cherry picked from commit eea53aa) * bug fix: fix compile error in integration test (#645) Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Co-authored-by: Yuye Zhu <[email protected]>

gaobinlong · 2024-04-26T10:51:20Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

+                // chunk the object when target key is of leaf type (null, string and list of string)
+                Object chunkObject = sourceAndMetadataMap.get(originalKey);
+                List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters);
+                sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult);


sourceAndMetadataMap contains some metadata fields such as _index, _routing and _id, if the targetKey equals the name of the metadata field, may cause accident.

A simple solution is to prohibiting targetKey starting with "_".

Let me check the behavior of other ingestion processors.

yuye-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es and vibrantvarun as code owners February 18, 2024 12:31

yuye-aws marked this pull request as draft February 18, 2024 12:32

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Show resolved Hide resolved

yuye-aws force-pushed the feature/documentChunkingProcessor branch from 30fd0eb to 57a4a20 Compare February 22, 2024 15:21

yuye-aws requested a review from zane-neo February 22, 2024 15:25

sam-herman suggested changes Feb 23, 2024

View reviewed changes

yuye-aws mentioned this pull request Feb 26, 2024

[META] Chunking and querying of long passages for vector search #612

Closed

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

yuye-aws added 12 commits March 15, 2024 16:36

tune exception type in parameter parser

63bbae9

Signed-off-by: yuye-aws <[email protected]>

tune comment

aaee028

Signed-off-by: yuye-aws <[email protected]>

tune comment

ab2a151

Signed-off-by: yuye-aws <[email protected]>

include max chunk limit in both algorithms

1eb12aa

Signed-off-by: yuye-aws <[email protected]>

tune comment

40991a3

Signed-off-by: yuye-aws <[email protected]>

allow 0 for max chunk limit

ea4bbb8

Signed-off-by: yuye-aws <[email protected]>

update runtime max chunk limit in text chunking processor

f0dfb57

Signed-off-by: yuye-aws <[email protected]>

tune code for chunker

cb4b39b

Signed-off-by: yuye-aws <[email protected]>

implement test for multiple field max chunk limit exceed

98dd886

Signed-off-by: yuye-aws <[email protected]>

tune methods name in text chunking proceesor unit tests

d245a04

Signed-off-by: yuye-aws <[email protected]>

add unit tests for both algorithms with max chunk limit

ad7ba25

Signed-off-by: yuye-aws <[email protected]>

optimize code

9702168

Signed-off-by: yuye-aws <[email protected]>

yuye-aws requested review from navneet1v and zane-neo March 16, 2024 01:26

yuye-aws added 4 commits March 17, 2024 13:05

extract max chunk limit check to util class

3d8c030

Signed-off-by: yuye-aws <[email protected]>

resolve code review comments

9931fae

Signed-off-by: yuye-aws <[email protected]>

fix unit tests

fb6a961

Signed-off-by: yuye-aws <[email protected]>

bug fix: only update runtime max chunk limit when enabled

68fef4f

Signed-off-by: yuye-aws <[email protected]>

zane-neo approved these changes Mar 18, 2024

View reviewed changes

model-collapse approved these changes Mar 18, 2024

View reviewed changes

model-collapse merged commit eea53aa into opensearch-project:main Mar 18, 2024
60 checks passed

model-collapse added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 18, 2024

model-collapse assigned yuye-aws Mar 18, 2024

opensearch-trigger-bot bot mentioned this pull request Mar 18, 2024

[Backport 2.x] feat: implement text chunking processor with fixed token length and delimiter algorithm #644

Merged

vibrantvarun mentioned this pull request Mar 18, 2024

[Infrastructure] BWC tests for Chunking Processor #647

Closed

yuye-aws deleted the feature/documentChunkingProcessor branch March 26, 2024 02:19

yuye-aws mentioned this pull request Apr 2, 2024

Test: bwc test for text chunking processor #661

Merged

5 tasks

gaobinlong reviewed Apr 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

yuye-aws commented Feb 18, 2024 •

edited

Loading

yuye-aws commented Feb 18, 2024

codecov bot commented Feb 18, 2024 •

edited

Loading

yuye-aws commented Feb 22, 2024

sam-herman left a comment

yuye-aws commented Feb 26, 2024

model-collapse Mar 18, 2024

model-collapse Mar 18, 2024

gaobinlong Apr 26, 2024

yuye-aws Apr 26, 2024

yuye-aws Apr 26, 2024

		throw new IllegalStateException(
		String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

Conversation

yuye-aws commented Feb 18, 2024 • edited Loading

Description

User Cases

Text Embedding

Cascaded Chunking Processors

Issues Resolved

Check List

yuye-aws commented Feb 18, 2024

codecov bot commented Feb 18, 2024 • edited Loading

Codecov Report

yuye-aws commented Feb 22, 2024

sam-herman left a comment

Choose a reason for hiding this comment

yuye-aws commented Feb 26, 2024

model-collapse Mar 18, 2024

Choose a reason for hiding this comment

model-collapse Mar 18, 2024

Choose a reason for hiding this comment

gaobinlong Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws commented Feb 18, 2024 •

edited

Loading

codecov bot commented Feb 18, 2024 •

edited

Loading