Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Fix reading dictionary values from manifest files #314

Merged
merged 10 commits into from
Nov 16, 2022
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Nov 15, 2022

Writes the dictionary values in the manifest files. Only read manifest once and pass it through FileReaders.

@eddyxu eddyxu requested a review from changhiskhan November 15, 2022 22:04
@eddyxu eddyxu marked this pull request as ready for review November 15, 2022 22:04
/// The file offset for storing the dictionary value.
/// It is only valid if encoding is DICTIONARY.
///
/// The logic type presents the value type of the column, i.e., string value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where's the logical type stored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic type is stored within the protobuf Field message.

@@ -81,6 +81,16 @@ ::arrow::Status FileWriter::Write(const std::shared_ptr<::arrow::RecordBatch>& b
return ::arrow::Status::OK();
}

::arrow::Status FileWriter::WriteManifest(const std::shared_ptr<::arrow::io::OutputStream>& destination,
const lance::format::Manifest& manifest) {
// Write dictionary values first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these have to be written first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After writing dictionary values, the offset where the values is stored becomes available in the manifest.

@@ -106,6 +142,7 @@ Status FileReader::Open() {

auto num_batches = metadata_->num_batches();
auto num_columns = manifest_->schema()->GetFieldsCount();
fmt::print("num_columns: {}\n", num_columns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep this debug print? or is there a way to change the log level?

ARROW_RETURN_NOT_OK(reader->Open());
return reader;
}

::arrow::Result<std::shared_ptr<::lance::format::Manifest>> FileReader::OpenManifest(
const std::shared_ptr<::arrow::io::RandomAccessFile>& in) {
constexpr auto kBufReadBytes = 8 * 1024 * 1024; // Read 8 MB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 8 MB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an arbitrarily not-to-small number. Setting this too small, it could result to more than one IO to read manifest file.

Has a length here was due to arrow RandomAccessFile does not the function of "real the full file" API. It can be larger tho.

@eddyxu eddyxu merged commit 2e85bdc into main Nov 16, 2022
@eddyxu eddyxu deleted the lei/read_bug branch November 16, 2022 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants