Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only load manifest once within the dataset and share Manifest amount the readers #155

Merged
merged 5 commits into from
Sep 13, 2022

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 9, 2022

Additionally it only reads the dictionary columns once , to prevent repeated read dictionary on each open file.

@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 9, 2022

After this PR, seems the FileReader::Open in s3sdk is the most expensive operation

Before the PR
Screenshot 2022-09-08 at 4 54 21 PM

After this PR
Screenshot 2022-09-09 at 3 42 33 PM

@eddyxu eddyxu requested a review from changhiskhan September 9, 2022 23:20
@@ -12,7 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

#include "lance/arrow/reader.h"
#include "lance/io/reader.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the public FileReader interface. We did not use it publicly before, and it only acts a proxy to lance::io::FileReader.

So this PR needs to pass Schema / Manifest to FileReader, if we are still using lance::arrow::FileReader it means that we need to public Schema / Manifest as well. I feel that it is not necessary in the moment.

@@ -232,6 +232,7 @@ class Field final {
friend class FieldVisitor;
friend class ToArrowVisitor;
friend class WriteDictionaryVisitor;
friend class LoadDictionaryVisitor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between Read and Load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is loading the string values of the dictionary. I am fine to rename it. wdyt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just be consistent if it's doing the same thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -345,6 +346,7 @@ std::shared_ptr<Field> Field::Copy(bool include_children) const {
new_field->logical_type_ = logical_type_;
new_field->extension_name_ = extension_name_;
new_field->encoding_ = encoding_;
new_field->dictionary_ = dictionary_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for dictionary arrays? or is this some other dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the dictionary value array (i.e., strings), do not contain indices.

@eddyxu eddyxu added arrow Apache Arrow related issues benchmark labels Sep 12, 2022
@eddyxu eddyxu self-assigned this Sep 12, 2022
@eddyxu eddyxu marked this pull request as ready for review September 12, 2022 16:48
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small nit in the comments

@eddyxu eddyxu merged commit 4e316a8 into main Sep 13, 2022
@eddyxu eddyxu deleted the lei/shared_manifest branch September 13, 2022 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues benchmark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants