*: support only deserialize necessary rows #9678

Lloyd-Pottiger · 2024-11-28T07:32:17Z

What problem does this PR solve?

Issue Number: ref #9699

Problem Summary:

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

ti-chi-bot · 2024-11-28T07:32:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lloyd-pottiger, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JaySon-Huang · 2024-12-04T07:27:24Z

dbms/src/DataTypes/DataTypeDecimal.cpp

@@ -79,13 +80,50 @@ void DataTypeDecimal<T>::deserializeBinaryBulk(
    IColumn & column,
    ReadBuffer & istr,
    size_t limit,
-    double /*avg_value_size_hint*/) const
+    double /*avg_value_size_hint*/,
+    const IColumn::Filter * filter) const


Add unit test about deserializeBinaryBulk(..., filter) to ensure the correctness for DataTypeDecimal/DataTypeEnum/DataTypeNumberBase and DataTypeString

Signed-off-by: Lloyd-Pottiger <[email protected]>

JaySon-Huang · 2024-12-05T09:26:02Z

dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileSetInputStream.cpp

+        if (block)
+        {
+            block.setStartOffset(read_rows);
+            read_rows += filter.size();


Use read_rows += block.rows() is more reasonable?

No, block.rows = passed_count < filter.size()

JaySon-Huang · 2024-12-05T10:15:41Z

dbms/src/Storages/DeltaMerge/File/DMFileReader.cpp

        DMFileReaderPool::instance().set(*this, cd.id, start_pack_id, pack_count, column);
        // Delete column from local cache since it is not used anymore.
        data_sharing_col_data_cache->delColumn(cd.id, next_pack_id);
+        return column;


Do we need to apply column->filter(filter) here?

JaySon-Huang · 2024-12-05T10:19:56Z

dbms/src/Storages/DeltaMerge/File/DMFileReader.cpp

+            Block block = read(&block_filter);
+            size_t passed_count = countBytesInFilter(block_filter);
+            for (size_t i = 0; i < block.columns(); ++i)
            {
-                std::vector<size_t> positions;
-                positions.reserve(passed_count);
-                for (size_t p = offset; p < offset + rows; ++p)
-                {
-                    if (filter[p])
-                        positions.push_back(p - offset);
-                }
-                for (size_t i = 0; i < block.columns(); ++i)
-                {
-                    columns[i]->insertDisjunctFrom(*block.getByPosition(i).column, positions);
-                }
+                auto col = block.getByPosition(i).column;
+                // Some columns may only deserialize the passed rows.
+                if (col->size() != passed_count)
+                    col = col->filter(block_filter, passed_count);


We'd better ensure all the columns return by read(IColumn::Filter * filter) has the same number of rows. But not handle it in this for-loop.

I have addressed in #9687, since we will rewrite this soon, so just keep it in this PR.

ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Nov 28, 2024

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2024

JaySon-Huang reviewed Dec 4, 2024

View reviewed changes

JaySon-Huang mentioned this pull request Dec 4, 2024

Optimize the performance of filtering data #9699

Closed

2 tasks

Lloyd-Pottiger added 7 commits December 5, 2024 12:47

support filter in deserializeBinaryBulk

edb8500

Signed-off-by: Lloyd-Pottiger <[email protected]>

last

0413ffa

Signed-off-by: Lloyd-Pottiger <[email protected]>

fix tidy

538e0f0

Signed-off-by: Lloyd-Pottiger <[email protected]>

delta

016238c

Signed-off-by: Lloyd-Pottiger <[email protected]>

fix

6a0b6a3

Signed-off-by: Lloyd-Pottiger <[email protected]>

refine

b7f787e

Signed-off-by: Lloyd-Pottiger <[email protected]>

refine

46b4b61

Signed-off-by: Lloyd-Pottiger <[email protected]>

Lloyd-Pottiger force-pushed the read-with-filter-deserial branch from 228880d to 46b4b61 Compare December 5, 2024 04:47

Lloyd-Pottiger requested review from JaySon-Huang, JinheLin and CalvinNeo December 5, 2024 04:51

JaySon-Huang reviewed Dec 5, 2024

View reviewed changes

JaySon-Huang mentioned this pull request Dec 6, 2024

Vector: optimize read performance #9687

Closed

12 tasks

Lloyd-Pottiger closed this Dec 10, 2024

Lloyd-Pottiger deleted the read-with-filter-deserial branch December 10, 2024 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: support only deserialize necessary rows #9678

*: support only deserialize necessary rows #9678

Lloyd-Pottiger commented Nov 28, 2024 •

edited by JaySon-Huang

Loading

ti-chi-bot bot commented Nov 28, 2024

JaySon-Huang Dec 4, 2024

JaySon-Huang Dec 5, 2024

Lloyd-Pottiger Dec 5, 2024

JaySon-Huang Dec 5, 2024

JaySon-Huang Dec 5, 2024

Lloyd-Pottiger Dec 6, 2024

*: support only deserialize necessary rows #9678

*: support only deserialize necessary rows #9678

Conversation

Lloyd-Pottiger commented Nov 28, 2024 • edited by JaySon-Huang Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot bot commented Nov 28, 2024

JaySon-Huang Dec 4, 2024

Choose a reason for hiding this comment

JaySon-Huang Dec 5, 2024

Choose a reason for hiding this comment

Lloyd-Pottiger Dec 5, 2024

Choose a reason for hiding this comment

JaySon-Huang Dec 5, 2024

Choose a reason for hiding this comment

JaySon-Huang Dec 5, 2024

Choose a reason for hiding this comment

Lloyd-Pottiger Dec 6, 2024

Choose a reason for hiding this comment

Lloyd-Pottiger commented Nov 28, 2024 •

edited by JaySon-Huang

Loading