Skip to content

2.25.0.0-b56

@jaki jaki tagged this 28 Sep 07:12
Summary:
At a high level, INSERT ON CONFLICT works as follows:

- For each value:
  - For each index:
    - If the value being inserted conflicts with a value in the index,
      run the ON CONFLICT part (either DO NOTHING or DO UPDATE).  Move
      on to the next value.
    - Else, continue.
  - Since all indexes do not conflict, INSERT normally.  (In case of
    upstream PG, if it fails to INSERT due to concurrent changes, it
    retries, but YB does not have that logic yet.)

This is bad performance for YB because for each value, it does index
reads and then a write (unless DO NOTHING is hit).  The alternating
reads and writes prevents buffering of write requests, so a lot of
back-and-forth RPCs are made.

Solve by implementing batching of the index reads during an INSERT ON
CONFLICT.  This allows writes to buffer up.  Add a GUC
yb_insert_on_conflict_read_batch_size to control how many rows to buffer
for each table.  1 disables and is the default.  Note that for
partitioned tables, this is the batch size for each partition.

Largely borrow off upstream PG's foreign table INSERT batching
implementation.

The flow goes as follows:

- For each value:
  - If batch size is reached, trigger batch flush
  - Store slot into in-memory list resultRelInfo->ri_Slots and similar
- Trigger batch flush for remainder slots

Batch flush goes as follows:

- For each index:
  - Read RPC to get all values matching the slots in this batch.  Store
    into in-memory map resultRelInfo->ri_YbConflictMaps[i].
- For each slot:
  - For each index:
    - If this slot matches something in the map, run the ON CONFLICT
      part.  Move on to the next slot.
    - Else, continue.
  - Since all indexes do not conflict, INSERT normally.

The map needs to be updated on ON CONFLICT DO UPDATE or normal INSERT
cases.  This involves changes to ExecInsertIndexTuples and
ExecDeleteIndexTuples, particularly to support the map updates for
primary key indexes.  Also, the map tracks rows that were just inserted
so that a double-insert error can be thrown, similar to upstream PG.
This is only done when detected within the same batch.  Otherwise, the
behavior matches non-batched YB behavior where it silently succeeds.

This feature is currently disabled in the following cases:

- non-YB relations
- catalog relations
- row triggers
- RETURNING clause

Detailed flow:

- ExecModifyTable
  - for (;;)
    - ExecProcNode (get a slot from input)
    - ExecInsert
      - switch to this slot's child resultRelInfo (for partitioned
        tables)
      - calculate generated columns
      - check permissions
      - if there's an ON CONFLICT clause
        - YbAddSlotToBatch
          - if batch is full
            - YbFlushSlotsFromBatch
          - add slot to ri_Slots, ri_PlanSlots
  - ExecPendingInserts
    - for each
      es_insert_pending_result_relations/es_insert_pending_modifytables
       - YbFlushSlotsFromBatch

This function is called in three places above:

- YbFlushSlotsFromBatch
  - if we just entered flushing mode
    - YbBatchFetchConflictingRows
      - ExecCheckIndexConstraints
        - for each index
          - if the index is not applicable (e.g. invalid, not part of
            arbiterIndexes)
            - continue
          - yb_batch_fetch_conflicting_rows
            - build map resultRelInfo->ri_YbConflictMaps[i]
  - while there are still slots to flush
    - YbExecCheckIndexConstraints
      - for each index
        - lookup map resultRelInfo->ri_YbConflictMaps[i]
        - if no match
          - continue
        - if match with just-inserted row
          - error
        - if match with existing row
          - return that there's a conflict
    - if the above check says there's a conflict
      - if DO UPDATE
        - ExecOnConflictUpdate
          - ExecUpdate
            - YBExecUpdateAct
              - (ExecCrossPartitionUpdate is disallowed)
              - YBCExecuteUpdateReplace/YBCExecuteUpdate
            - ExecUpdateEpilogue
              - ExecDeleteIndexTuples
                - for each index
                  - if the index is not applicable
                    - continue
                  - yb_index_delete (except PK index)
                  - update map resultRelInfo->ri_YbConflictMaps[i]
              - ExecInsertIndexTuples
                - for each index
                  - if the index is not applicable
                    - continue
                  - update map resultRelInfo->ri_YbConflictMaps[i]
                  - index_insert (except PK index)
              - AR triggers
      - else (DO NOTHING)
        - (nothing)
      - continue
    - YBCHeapInsert
    - ExecInsertIndexTuples
      - for each index
        - if the index is not applicable
          - continue
        - update map resultRelInfo->ri_YbConflictMaps[i]
        - index_insert (except PK index)
    - AR triggers
  - exit flushing mode
  - destroy all maps resultRelInfo->ri_YbConflictMaps

There are some behavior differences with batching enabled.  Within a
batch, when two rows map to the same key, we follow the PG semantics of
throwing an error.  Across batches, we follow the YB semantics of
silently applying both changes.  Moreover, when dealing with WITH
statements that modify the same table inside and outside of it, the ON
CONFLICT decision taking can vary depending on the batch size (see the
regress tests).

This is originally authored on the pg15 branch.  So some of the
dependencies there are copied here, not in the cleanest way but at least
in a way consistent with the pg15 branch.  For example, code that either
does YbFlushSlotsFromBatch or ExecBatchInsert is still structured the
same way, but ExecBatchInsert is turned into a no-op since
ExecBatchInsert is related to FDW batching of PG 15, which we don't need
in master.  Also, ModifyTableContext is partially taken from PG 15,
keeping only the relevant fields, because it reduces differences between
this and the original change on pg15.

A few other things:

- ExecPendingInserts is taken
- Callers of ExecPendingInserts from before row trigger related code is
  not taken because we don't support this batching in that case anyway
  and not dealing with it here means not having to resolve conflicts.
- ResultRelInfo fields ri_Slot and similar are taken
- EState fields es_insert_pending_result_relations and similar are taken
- forboth and yb_forboth_delete_current are replaced with while loop due
  to API differences
- Some hash function code is imported partially
- The original change mostly copies the latter half of ExecInsert to
  YbFlushSlotsFromBatch.  Since that same code is slightly different
  between master and pg15, recopy portions.

Note that an existing line

    oldtuple = ExecMaterializeSlot(ybConflictSlot);

was changed to

    oldtuple = ExecCopySlotTuple(ybConflictSlot);

since oldtuple is tied to ybConflictSlot which is tied to a slot in the
in-memory map, and when the map deletes that entry, the slot is dropped
which frees the tuple, but this oldtuple may used in other places
afterwards.  pg15 doesn't have this issue since it doesn't use
ExecMaterializeSlot, so just fix it here.
Jira: DB-13064

Test Plan:
On Almalinux 8:

    #!/usr/bin/env bash
    set -euo pipefail
    ./yb_build.sh fastdebug --gcc11
    find java/yb-pgsql/src/test/java/org/yb/pgsql -name 'TestPgRegressInsertOnConflict*' \
    | grep -oE 'TestPgRegress\w+' \
    | while read -r testname; do
      ./yb_build.sh fastdebug --gcc11 --java-test "$testname" --sj
    done

Reviewers: amartsinchyk, kramanathan

Reviewed By: amartsinchyk, kramanathan

Subscribers: smishra, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38354
Assets 2
Loading