-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add columns from in-memory table #337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. this is fancy stuff.
really blurring the line between data format and db lol
@@ -130,6 +130,46 @@ class LanceDataset : public ::arrow::dataset::Dataset { | |||
::arrow::Result<std::shared_ptr<UpdaterBuilder>> NewUpdate( | |||
const std::shared_ptr<::arrow::Field>& new_field) const; | |||
|
|||
::arrow::Result<std::shared_ptr<UpdaterBuilder>> NewUpdate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does either this or the single-field addition work for new subfield of nested field? If yes is there a test? If not, should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i understand your question. You mean add some tests for a column with nested data? or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm just asking what the behavior is (i forgot)
|
||
/// HashMerger constructor. | ||
explicit HashMerger(std::shared_ptr<::arrow::Table> table, | ||
std::string index_column, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to leave room for more complex join behavior? Or is the idea that we would resolve that at a higher level and construct some kind of virtual join column with combined values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that should be done in the application / DB level?
cpp/src/lance/arrow/hash_merger.h
Outdated
|
||
/// Collect the batch records with the same keys in the column. | ||
::arrow::Result<std::shared_ptr<::arrow::RecordBatch>> Collect( | ||
const std::shared_ptr<::arrow::Array>& on_col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the diff between the index_column and the on_col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, no difference. on_col
is for the column specified by the on
keyword. , which is the same.
Which one do you prefer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either is fine as long as we're consistent. Oic one is the column name and the other is the data.
index_column
and index_array
?
@@ -382,4 +391,52 @@ ::arrow::Result<::arrow::dataset::FragmentIterator> LanceDataset::GetFragmentsIm | |||
return ::arrow::MakeVectorIterator(fragments); | |||
} | |||
|
|||
::arrow::Result<std::shared_ptr<LanceDataset>> LanceDataset::AddColumns( | |||
const std::shared_ptr<::arrow::Table>& other, | |||
const std::string& on, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we'll always have the luxury of the same name on both. Can change this later tho i guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change the signature to AddColumns(other, left_on, right_on)
? I tried, but could not come up with a good function signature.
Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left_on
and right_on
seems fine. In python we can have optional on=
if the column name is indeed the same ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/lance/arrow/dataset.cc
Outdated
|
||
// Second phase | ||
auto table_schema = other->schema(); | ||
ARROW_ASSIGN_OR_RAISE(auto merged_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged_schema
usually means the combined schema. It looks like this is the "schema that is being merged in" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema_for_merging_columns
?
any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incoming_schema
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
assert(batch->schema()->Equals(::arrow::schema({left_column}))); | ||
auto index_arr = batch->GetColumnByName(on); | ||
ARROW_ASSIGN_OR_RAISE(auto right_batch, merger.Collect(index_arr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fancypants
f58c13b
to
db6d628
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending the join item you're working on and CI
Provide a
Dataset::AddColumns(const Table& table, const string& on)
interface in C++.Closes #320