Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
feat: implement text chunking processor with fixed token length and delimiter algorithm #607
Changes from 176 commits
cbc5423
3e2d365
89584a9
596fbf7
636f907
2ffd6b0
2195353
bdd418e
458420b
02420d7
f8f60a1
ff0587c
afc3189
159e426
2405952
b930222
ecb8297
d6d31fa
fafae93
d23c1fb
39c6162
5714d1e
2f23c30
41cff0c
b0fda97
b16e7c4
11e6a4b
81000f3
f3b468f
eea6fc8
ec6bf49
3ae94e4
c8dc66c
1e1ce1b
b425122
31bf921
11d8f53
0662278
5e75e04
962ed32
9487de5
04043ca
08bf2d1
c7cc59f
0721f7a
d2bc576
120fae8
0af3024
e69bbe1
f21f40f
4babd4d
24f4980
0b4036a
9ff6645
0e464fe
d6b68ed
a7a9260
39e8df5
2ee1923
ca534ab
eedd58d
b9bf3ef
2ac2f60
6067044
98d1ab3
79a637c
6da6395
105d4a0
cd4eda7
ceaa7d2
715c145
75663e1
2e5dc00
d711390
98124ee
353e88e
de554e6
2453a79
5f00107
fc94955
388fd43
bb35c79
b4d5fda
453dd35
8c8fbaf
b588983
23dd769
3ad78da
abb9bde
82aa219
3158e28
38d6e60
584bc59
cbea5df
c3c8ff2
da055e7
d32840c
830f665
1275bd6
4e2f5d4
5c20b9b
addd37e
7469153
ad00b88
d4673d4
7a589c6
e1f6c79
bb372e6
2538ab3
9c9172d
d05b246
04fc7d3
7fe93c0
cefb0a6
16038af
d1d88dc
eb439bd
bc7f70c
bb941cd
77d4101
92f587f
94b1967
98944d1
8799fd0
5cda870
c942b17
6461b32
2a0a879
fbb4edb
050f163
e61f295
4f87008
c651b3e
0f45782
3d1b792
5600b36
3d962ca
42de900
6d4fe8c
e666f17
958cc3b
183e928
09fccc1
8ad1e51
489fe7b
d67880e
666e7b9
9161c93
0f9c140
a574980
87679ad
08dcd19
f16882d
a969a60
34348b3
4153988
06ca1c7
3cf671d
5fe5eef
f3decb4
3b8a3af
89c465c
e7dffe0
463de71
0a04012
a524954
10f6568
3f41f37
e4bdabc
9e37171
18ba1b1
2ce9840
2aea7a5
63bbae9
aaee028
ab2a151
1eb12aa
40991a3
ea4bbb8
f0dfb57
cb4b39b
98dd886
d245a04
ad7ba25
9702168
3d8c030
9931fae
fb6a961
68fef4f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add check on algorithmKey here instead of check/get null constructor in the ChunkerFactory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already updated.
Check warning on line 164 in src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java
Codecov / codecov/patch
src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java#L162-L164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
else
is not needed, we can just do assignmentThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not getting your point. This
else
here can make the processor more efficiently. When executing on a document, the processor either get the environment setting or the index setting.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it can be something like:
not a blocker though, but if you're going to push more commits please address this one too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the sample code. This part will be updated according to your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check of chunker type damages the abstraction you've built with the CunckerFactory. Why can't you set this parameters in all cases and then decide on how to use it later, in a chunker implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid there would be some parameter conflict from other chunking algorithms. I need to recover this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess for say delimiter checking algo it may erase correct value of max token count. can we pass source and map data to a factory so this
if
became part of the factory?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the factory should be an independent component, which is only responsible for creating an instance of chunking algorithm and return the instance to the text chunking processor. That's why we are validating parameters in the text chunking processor,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourceAndMetadataMap
contains some metadata fields such as_index
,_routing
and_id
, if thetargetKey
equals the name of the metadata field, may cause accident.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple solution is to prohibiting targetKey starting with "_".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check the behavior of other ingestion processors.