[DRAFT] custom partial upsert row merger #11584

rohityadav1993 · 2023-09-13T17:35:05Z

feature
release-notes

Problem: The PR addresses the feature gap of conditional column merger in partial upsert. With this, a table can be configured using a groovy script/custom class implementation to merge previous and new row column values based on a user-specified logic between previous and new row columns values.

Solution:

A new row abstraction class called LazyRow, enables reading of the previous row's column's value from disk only when getValue() on LazyRow is called and caches the reads.
Defines a new interface called PartialUpsertRowMergeEvaluator.merge() that takes previous and new Row and generates merged columns as Map<column_name, value>
Moves creation of PartialUpsertHandler per table to per partition to avoid concurrent modification of LazyRow defined in PartialUpsertHandler by multiple consuming partitions.

TODO:

Add implementation for PartialUpsertRowMergeEvaluator for groovy merger.
Add groovy row merger configs
Add test plan

Design doc: https://docs.google.com/document/d/1bBTCYZFP2stvzc6xZUOEh-XweVgC9WfD7uiSPbKtaZY/edit

deemoliu · 2023-09-14T00:09:01Z

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/readers/LazyRow.java

+    private int _docId;
+    private GenericRow _row;
+
+    private HashMap<String, Object> _fieldToValueMap = new HashMap<>();


concurrentHashMap ?

At least in the current usage, I don't see a requirement to have a ConcurrentHashMap as the merger happens per row per segment in a single thread manner. Do you see any usage where we want to use LazyRow and there can be multiple threads referencing it?
I am taking reference from GenericRow which also using hashmap.

deemoliu · 2023-09-14T00:48:42Z

...ocal/src/main/java/org/apache/pinot/segment/local/upsert/BaseTableUpsertMetadataManager.java

+        if (upsertConfig.getRowMergerCustomImplementation() != null) {
+          // In case of custom merger, the PUH is dependent on LazyRow object to be reused. LazyRow is stateful and
+          // cause concurrent modification issues, hence a new PUH is created per partition
+          return new PartialUpsertHandler(_schema, upsertConfig, _comparisonColumns);


Got it. currently the partial upsert handler are initialized in the upsertTableManager and shared by partitionUpsertManager. After lazyRow passed in the partialUpsertHandler it become stateful, and multiple partitions can access the partial upsert handler. this leads to concurrent modification issue. So we have to avoid sharing the partial upsert handler among partitions.

@Jackie-Jiang do you see any improvement on this approach?

...-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/PartialUpsertHandler.java

...java/org/apache/pinot/segment/local/upsert/merger/PartialUpsertRowMergeEvaluatorFactory.java

codecov-commenter · 2023-10-05T22:12:12Z

Codecov Report

Merging #11584 (8951cc9) into master (d1021df) will increase coverage by 0.03%.
Report is 1 commits behind head on master.
The diff coverage is 54.94%.

@@             Coverage Diff              @@
##             master   #11584      +/-   ##
============================================
+ Coverage     63.09%   63.13%   +0.03%     
  Complexity     1117     1117              
============================================
  Files          2342     2344       +2     
  Lines        125900   125986      +86     
  Branches      19362    19375      +13     
============================================
+ Hits          79437    79536      +99     
+ Misses        40817    40790      -27     
- Partials       5646     5660      +14

Flag	Coverage Δ
integration	`<0.01% <0.00%> (ø)`
integration1	`<0.01% <0.00%> (ø)`
integration2	`0.00% <0.00%> (ø)`
java-11	`63.06% <54.94%> (+12.92%)`	⬆️
java-17	`62.97% <54.94%> (+0.01%)`	⬆️
java-20	`62.96% <54.94%> (+0.02%)`	⬆️
temurin	`63.13% <54.94%> (+0.03%)`	⬆️
unittests	`63.12% <54.94%> (+0.03%)`	⬆️
unittests1	`67.25% <54.94%> (+<0.01%)`	⬆️
unittests2	`14.43% <0.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...psert/ConcurrentMapTableUpsertMetadataManager.java	`38.46% <100.00%> (+5.12%)`	⬆️
...rg/apache/pinot/spi/config/table/UpsertConfig.java	`82.69% <33.33%> (-3.03%)`	⬇️
...t/local/upsert/BaseTableUpsertMetadataManager.java	`73.14% <41.66%> (-3.39%)`	⬇️
...not/segment/local/upsert/PartialUpsertHandler.java	`78.12% <73.33%> (-5.66%)`	⬇️
.../merger/PartialUpsertRowMergeEvaluatorFactory.java	`0.00% <0.00%> (ø)`
...e/pinot/segment/local/segment/readers/LazyRow.java	`60.00% <60.00%> (ø)`

... and 15 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang

I don't think this is a clean abstraction. We should not mix value based merger and row based merger in the same class because it will be very hard to maintain, and the sequence of applying the mergers can be very confusing.

PartialUpsertHandler already provides the row level method void merge(IndexSegment indexSegment, int docId, GenericRow newRecord), but PartialUpsertMerger has limitation that it can only work on a single value.
So we have 2 options:

Make PartialUpsertHandler pluggable and make the current one the default implementation
Modify PartialUpsertMerger to take row (something like Object merge(IndexSegment indexSegment, int docId, GenericRow newRecord)) and make it pluggable. Since each column can reference the value from other columns, the contract should be to not change the column value before all the mergers are applied

I'm leaning towards the second approach because that allows us to reuse the existing mergers.

Jackie-Jiang · 2023-10-06T22:07:12Z

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/readers/LazyRow.java

+ * There isn't any advantage to have a LazyRow wrap a GenericRow but has been kept for syntactic sugar.
+ */
+public class LazyRow {
+    private IndexSegment _segment;


(code format) Please follow Pinot Style

rohityadav1993 · 2023-10-09T09:04:12Z

Thanks for the feedback @Jackie-Jiang

We should not mix value based merger and row based merger in the same class because it will be very hard to maintain, and the sequence of applying the mergers can be very confusing.

I agree with the maintenance perspective. There could be a requirement to have both row merger and column level merger for values which were not computed in row merger. Hence, there was the motivation to keep them in the same class.
In the current implementation, PUH is responsible for computing row merger(if specified) followed by column level merger for columns not part of row merger result.

I'm leaning towards the second approach because that allows us to reuse the existing mergers.

This is a cleaner approach if we can stick to a good contract for merge. from the design doc Approach 2, if we can define merge as merge(GenericRow previousRecord, GenericRow newRecord, Map<String, Object> reuseMergerResult) and pass an instance of LazyRow extends GenericRow

Since each column can reference the value from other columns, the contract should be to not change the column value before all the mergers are applied

reuseMergerResult will be used to store the intermediate merger results and avoid modification to newRecord until all mergers are applied. We can ensure that reuseMergerResult is applied to newRecord after column-level mergers are executed.

If we don't want to have Map<String, Object> reuseMergerResult in the merge method then we have to initialise PartialUpsertHandler with reuseMergerResult for each PartitionUpsertMetadataManager instead of TableUpsertMetadataManager to avoid concurrent modification.

Jackie-Jiang · 2023-10-10T17:06:27Z

IIUC, LazyRow is introduced to avoid reading the same value from the segment multiple times. I think it is a good abstraction (suggestion making it a wrapper over IndexSegment and allow setting a docId so that the internal fieldToValueMap can be reused, no need to allow initializing it with GenericRow), and we can make PartialUpsertMerger merge API: void merge(LazyRow previousRow, GenericRow currentRow, Map<String, Object> mergedValues). This way we can even support generating multiple values in one merger

rohityadav1993 · 2023-10-10T18:28:50Z

IIUC, LazyRow is introduced to avoid reading the same value from the segment multiple times. I think it is a good abstraction (suggestion making it a wrapper over IndexSegment and allow setting a docId so that the internal fieldToValueMap can be reused, no need to allow initializing it with GenericRow), and we can make PartialUpsertMerger merge API: void merge(LazyRow previousRow, GenericRow currentRow, Map<String, Object> mergedValues). This way we can even support generating multiple values in one merger

It is also used for defining new interface:

public interface PartialUpsertRowMergeEvaluator {
    void evaluate(LazyRow previousRow, LazyRow newRow, Map<String, Object> result);
}

which represents the row merge logic, which requires LazyRow to wrap GenericRow.
So this contract should change as well similar to void merge(LazyRow previousRow, GenericRow currentRow, Map<String, Object> mergedValues)

Jackie-Jiang · 2023-10-11T23:32:40Z

My suggestion is to not add a new interface, but modify the existing PartialUpsertMerger interface

deemoliu reviewed Sep 14, 2023

View reviewed changes

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) upsert labels Sep 17, 2023

rohityadav1993 force-pushed the masterry branch 2 times, most recently from eb129c0 to 6ae06a9 Compare October 3, 2023 07:11

rohityadav1993 marked this pull request as ready for review October 3, 2023 07:50

custom partial upsert row merger

8951cc9

rohityadav1993 force-pushed the masterry branch from 6ae06a9 to 8951cc9 Compare October 5, 2023 19:51

Jackie-Jiang reviewed Oct 6, 2023

View reviewed changes

rohityadav1993 mentioned this pull request Oct 18, 2023

add LazyRow abstraction for previously indexed record #11826

Merged

rohityadav1993 changed the title ~~custom partial upsert row merger~~ [DRAFT] custom partial upsert row merger Oct 23, 2023

rohityadav1993 mentioned this pull request Mar 9, 2024

pluggable partial upsert merger #11983

Merged

rohityadav1993 closed this Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] custom partial upsert row merger #11584

[DRAFT] custom partial upsert row merger #11584

rohityadav1993 commented Sep 13, 2023 •

edited

Loading

deemoliu Sep 14, 2023

rohityadav1993 Sep 19, 2023 •

edited

Loading

deemoliu Sep 14, 2023

codecov-commenter commented Oct 5, 2023 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Oct 6, 2023

rohityadav1993 commented Oct 9, 2023 •

edited

Loading

Jackie-Jiang commented Oct 10, 2023

rohityadav1993 commented Oct 10, 2023 •

edited

Loading

Jackie-Jiang commented Oct 11, 2023

[DRAFT] custom partial upsert row merger #11584

[DRAFT] custom partial upsert row merger #11584

Conversation

rohityadav1993 commented Sep 13, 2023 • edited Loading

deemoliu Sep 14, 2023

Choose a reason for hiding this comment

rohityadav1993 Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

deemoliu Sep 14, 2023

Choose a reason for hiding this comment

codecov-commenter commented Oct 5, 2023 • edited Loading

Codecov Report

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Oct 6, 2023

Choose a reason for hiding this comment

rohityadav1993 commented Oct 9, 2023 • edited Loading

Jackie-Jiang commented Oct 10, 2023

rohityadav1993 commented Oct 10, 2023 • edited Loading

Jackie-Jiang commented Oct 11, 2023

rohityadav1993 commented Sep 13, 2023 •

edited

Loading

rohityadav1993 Sep 19, 2023 •

edited

Loading

codecov-commenter commented Oct 5, 2023 •

edited

Loading

rohityadav1993 commented Oct 9, 2023 •

edited

Loading

rohityadav1993 commented Oct 10, 2023 •

edited

Loading