Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add columns from in-memory table #337

Merged
merged 20 commits into from
Dec 1, 2022
Merged

[C++] Add columns from in-memory table #337

merged 20 commits into from
Dec 1, 2022

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Nov 30, 2022

Provide a Dataset::AddColumns(const Table& table, const string& on) interface in C++.

Closes #320

@eddyxu eddyxu requested review from changhiskhan and removed request for changhiskhan November 30, 2022 05:31
@eddyxu eddyxu self-assigned this Nov 30, 2022
@eddyxu eddyxu added c++ C++ issues WIP work in progress donotmerge Do not merge labels Nov 30, 2022
@eddyxu eddyxu changed the title [C++] Add column with in-memory table [C++] Add columns from in-memory table Nov 30, 2022
@eddyxu eddyxu requested a review from changhiskhan November 30, 2022 16:50
@eddyxu eddyxu marked this pull request as ready for review November 30, 2022 17:04
@eddyxu eddyxu removed WIP work in progress donotmerge Do not merge labels Dec 1, 2022
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this is fancy stuff.
really blurring the line between data format and db lol

cpp/include/lance/arrow/dataset.h Show resolved Hide resolved
@@ -130,6 +130,46 @@ class LanceDataset : public ::arrow::dataset::Dataset {
::arrow::Result<std::shared_ptr<UpdaterBuilder>> NewUpdate(
const std::shared_ptr<::arrow::Field>& new_field) const;

::arrow::Result<std::shared_ptr<UpdaterBuilder>> NewUpdate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does either this or the single-field addition work for new subfield of nested field? If yes is there a test? If not, should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand your question. You mean add some tests for a column with nested data? or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm just asking what the behavior is (i forgot)


/// HashMerger constructor.
explicit HashMerger(std::shared_ptr<::arrow::Table> table,
std::string index_column,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to leave room for more complex join behavior? Or is the idea that we would resolve that at a higher level and construct some kind of virtual join column with combined values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that should be done in the application / DB level?


/// Collect the batch records with the same keys in the column.
::arrow::Result<std::shared_ptr<::arrow::RecordBatch>> Collect(
const std::shared_ptr<::arrow::Array>& on_col);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the diff between the index_column and the on_col?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, no difference. on_col is for the column specified by the on keyword. , which is the same.

Which one do you prefer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either is fine as long as we're consistent. Oic one is the column name and the other is the data.

index_column and index_array?

cpp/src/lance/arrow/hash_merger.cc Outdated Show resolved Hide resolved
cpp/src/lance/arrow/hash_merger.cc Outdated Show resolved Hide resolved
@@ -382,4 +391,52 @@ ::arrow::Result<::arrow::dataset::FragmentIterator> LanceDataset::GetFragmentsIm
return ::arrow::MakeVectorIterator(fragments);
}

::arrow::Result<std::shared_ptr<LanceDataset>> LanceDataset::AddColumns(
const std::shared_ptr<::arrow::Table>& other,
const std::string& on,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we'll always have the luxury of the same name on both. Can change this later tho i guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change the signature to AddColumns(other, left_on, right_on)? I tried, but could not come up with a good function signature.

Suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_on and right_on seems fine. In python we can have optional on= if the column name is indeed the same ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cpp/src/lance/arrow/dataset.cc Show resolved Hide resolved

// Second phase
auto table_schema = other->schema();
ARROW_ASSIGN_OR_RAISE(auto merged_schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged_schema usually means the combined schema. It looks like this is the "schema that is being merged in" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema_for_merging_columns?

any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incoming_schema maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
assert(batch->schema()->Equals(::arrow::schema({left_column})));
auto index_arr = batch->GetColumnByName(on);
ARROW_ASSIGN_OR_RAISE(auto right_batch, merger.Collect(index_arr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fancypants

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending the join item you're working on and CI

@eddyxu eddyxu merged commit 1592cbd into main Dec 1, 2022
@eddyxu eddyxu deleted the lei/hash_merge branch December 1, 2022 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ C++ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Append column takes a already-computed array.
2 participants