Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge a pre-computed column / array. #824

Merged
merged 16 commits into from
Jun 6, 2023
Merged

Merge a pre-computed column / array. #824

merged 16 commits into from
Jun 6, 2023

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented May 4, 2023

Closes #587

@eddyxu eddyxu force-pushed the lei/add_column_rust branch from dac6795 to ce1a476 Compare May 4, 2023 18:55
@eddyxu eddyxu self-assigned this May 15, 2023
@wjones127 wjones127 force-pushed the lei/add_column_rust branch from ce1a476 to 949dbab Compare June 2, 2023 20:52
@wjones127 wjones127 self-assigned this Jun 5, 2023
Comment on lines 315 to 324
pub(crate) fn lance_supports_nulls(datatype: &DataType) -> bool {
match datatype {
DataType::Utf8
| DataType::LargeUtf8
| DataType::Binary
| DataType::FixedSizeBinary(_)
| DataType::FixedSizeList(_, _) => true,
_ => false,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there types I am missing?

Copy link
Contributor Author

@eddyxu eddyxu Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct and list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes list. struct I am not sure. How do we encode structs? They don't have offsets so I don't think we support nulls there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm . good point, we think we don't really encode struct itself, but for each field, it encodes separately.

Let's get it going for now.

@wjones127 wjones127 force-pushed the lei/add_column_rust branch from aebd0e0 to 6b2e770 Compare June 5, 2023 23:00
arrow-schema = "37.0"
arrow-select = "37.0"
async-recursion = "1.0"
async-trait = "0.1.60"
byteorder = "1.4.3"
chrono = "0.4.23"
clap = { version = "4.1.1", features = ["derive"], optional = true }
# This is already used by datafusion
dashmap = "5"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what is benefit of this lib over std::HashMap in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dashmap allows multiple threads to insert concurrently, whereas std::HashMap doesn't allow concurrent inserts. This is just useful for building up the hashmap very fast, given we have the RHS data in memory already for most cases. And as mentioned in the comment, this is already used by DataFusion, so there's no incremental compilation cost from adding it to our dependencies.

Copy link
Contributor Author

@eddyxu eddyxu Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i am fine with that.

Minor question: we need to keep dashmap version sync with datafusion, right? Will cargo pull two different packages if we used the different versions? Not blocker for sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. They use 5 right now. We should check all of our dependencies for duplicates periodically, including tokio.

rust/src/dataset.rs Show resolved Hide resolved
rust/src/dataset.rs Show resolved Hide resolved
@wjones127 wjones127 marked this pull request as ready for review June 6, 2023 16:39
@wjones127 wjones127 changed the title [WIP] Merge a pre-computed column / array. Merge a pre-computed column / array. Jun 6, 2023
.duration_since(SystemTime::UNIX_EPOCH)
.unwrap()
.as_nanos();
self.timestamp_nanos = nanos as u128;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this persisted in protobuf?
do we need nano resolution ?

Also might want to have a timezone, so that this is reliable for either "system clock" or cross-datacenter writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this persisted in protobuf?

Yes. This is just the existing behavior moved over to this file so it's easier to work with.

But I'll note this in our doc on OCC, since we might want to update when considering concurrent writers.

Copy link
Contributor

@gsilvestrin gsilvestrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved per Lei's review

@wjones127 wjones127 merged commit 2197458 into main Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a new column from already computed arrow Table / Arrow
3 participants