-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Fix reading dictionary values from manifest files #314
Conversation
/// The file offset for storing the dictionary value. | ||
/// It is only valid if encoding is DICTIONARY. | ||
/// | ||
/// The logic type presents the value type of the column, i.e., string value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's the logical type stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic type is stored within the protobuf Field
message.
@@ -81,6 +81,16 @@ ::arrow::Status FileWriter::Write(const std::shared_ptr<::arrow::RecordBatch>& b | |||
return ::arrow::Status::OK(); | |||
} | |||
|
|||
::arrow::Status FileWriter::WriteManifest(const std::shared_ptr<::arrow::io::OutputStream>& destination, | |||
const lance::format::Manifest& manifest) { | |||
// Write dictionary values first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do these have to be written first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After writing dictionary values, the offset where the values is stored becomes available in the manifest.
cpp/src/lance/io/reader.cc
Outdated
@@ -106,6 +142,7 @@ Status FileReader::Open() { | |||
|
|||
auto num_batches = metadata_->num_batches(); | |||
auto num_columns = manifest_->schema()->GetFieldsCount(); | |||
fmt::print("num_columns: {}\n", num_columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to keep this debug print? or is there a way to change the log level?
ARROW_RETURN_NOT_OK(reader->Open()); | ||
return reader; | ||
} | ||
|
||
::arrow::Result<std::shared_ptr<::lance::format::Manifest>> FileReader::OpenManifest( | ||
const std::shared_ptr<::arrow::io::RandomAccessFile>& in) { | ||
constexpr auto kBufReadBytes = 8 * 1024 * 1024; // Read 8 MB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 8 MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an arbitrarily not-to-small
number. Setting this too small, it could result to more than one IO to read manifest file.
Has a length here was due to arrow RandomAccessFile
does not the function of "real the full file" API. It can be larger tho.
Writes the dictionary values in the manifest files. Only read manifest once and pass it through FileReaders.