Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add HashStringAllocator::InputStream #12364

Closed
wants to merge 1 commit into from

Conversation

Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Feb 17, 2025

Summary:
When we get ByteInputStream from HashStringAllocator, we used to
have to materialize all the byte ranges in a vector, which is not efficient.
This change improve the efficiency by creating a ByteInputStream directly over
the linked list of a multi-part allocation.

Differential Revision: D69750088

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69750088

Copy link

netlify bot commented Feb 17, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 6b23ab6
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67be03cfe3437b0008a02d9b

Yuhta added a commit to Yuhta/velox that referenced this pull request Feb 18, 2025
Summary:

When we get `ByteInputStream` from `HashStringAllocator`, we used to
have to materialize all the byte ranges in a vector, which is not efficient.
This change improves the efficiency by creating a `ByteInputStream` directly over
the linked list of a multi-part allocation.

Differential Revision: D69750088
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69750088

return range_.position == range_.size && !header_->isContinued();
}

std::streampos tellp() const final {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we end up calling this often? It would probably be cheaper to just maintain a counter instead of counting it on the fly.

Copy link
Contributor Author

@Yuhta Yuhta Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't even see this got called in real query. It's doing the same thing as BufferedOutputStream::tellp so at least there will be no regression.

}

void seekp(std::streampos pos) final {
header_ = begin_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updates to header_ and range_ are tightly bound, everywhere you call resetRange(), the line above sets header_ to some new value.

it looks like it'd be easier/safer to make resetRange into resetHeader (or something like that)

void resetHeader(const Header* header) {
  VELOX_DCHECK_GT(header->usableSize(), 0);
  header_ = header;
  range_.buffer = reinterpret_cast<uint8_t*>(header_->begin());
  range_.size = header_->usableSize();
  range_.position = 0;
}

HashStringAllocator::headerOf(value.data()));
stream->readBytes(dest, copySize);
stream.ByteInputStream::readBytes(dest, copySize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to add this prefix because the InputStream class was forward declared inside the HashStringAllocator, right?

Could you inline the definition in side of HashStringAllocator, or define it outside the class in the same header? HashStringAllocatorInputStream would be just as clear as HashStringAllocator::InputStream, slightly shorter, and then we don't need all these ByteInputStream:: prefixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we need to prefix this because we are calling readBytes<T> (a templated helper) instead of the virtual readBytes, they are 2 different methods, and because they have same name, the virtual one is shadowing the templated one if called from subclass. Moving the subclass to different location cannot fix this.

Summary:

When we get `ByteInputStream` from `HashStringAllocator`, we used to
have to materialize all the byte ranges in a vector, which is not efficient.
This change improves the efficiency by creating a `ByteInputStream` directly over
the linked list of a multi-part allocation.

Reviewed By: kevinwilfong

Differential Revision: D69750088
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69750088

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c560aaf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants