feat: Add HashStringAllocator::InputStream #12364

Yuhta · 2025-02-17T23:09:08Z

Summary:
When we get ByteInputStream from HashStringAllocator, we used to
have to materialize all the byte ranges in a vector, which is not efficient.
This change improve the efficiency by creating a ByteInputStream directly over
the linked list of a multi-part allocation.

Differential Revision: D69750088

facebook-github-bot · 2025-02-17T23:09:19Z

This pull request was exported from Phabricator. Differential Revision: D69750088

netlify · 2025-02-17T23:09:28Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`6b23ab6`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67be03cfe3437b0008a02d9b

Summary: When we get `ByteInputStream` from `HashStringAllocator`, we used to have to materialize all the byte ranges in a vector, which is not efficient. This change improves the efficiency by creating a `ByteInputStream` directly over the linked list of a multi-part allocation. Differential Revision: D69750088

facebook-github-bot · 2025-02-18T15:06:51Z

This pull request was exported from Phabricator. Differential Revision: D69750088

kevinwilfong · 2025-02-24T18:55:54Z

velox/common/memory/HashStringAllocator.h

+    return range_.position == range_.size && !header_->isContinued();
+  }
+
+  std::streampos tellp() const final {


nit: Do we end up calling this often? It would probably be cheaper to just maintain a counter instead of counting it on the fly.

No I don't even see this got called in real query. It's doing the same thing as BufferedOutputStream::tellp so at least there will be no regression.

kevinwilfong · 2025-02-24T18:56:23Z

velox/common/memory/HashStringAllocator.h

+  }
+
+  void seekp(std::streampos pos) final {
+    header_ = begin_;


updates to header_ and range_ are tightly bound, everywhere you call resetRange(), the line above sets header_ to some new value.

it looks like it'd be easier/safer to make resetRange into resetHeader (or something like that)

void resetHeader(const Header* header) { VELOX_DCHECK_GT(header->usableSize(), 0); header_ = header; range_.buffer = reinterpret_cast<uint8_t*>(header_->begin()); range_.size = header_->usableSize(); range_.position = 0; }

kevinwilfong · 2025-02-24T18:56:53Z

velox/exec/prefixsort/PrefixSortEncoder.h

        HashStringAllocator::headerOf(value.data()));
-    stream->readBytes(dest, copySize);
+    stream.ByteInputStream::readBytes(dest, copySize);


You have to add this prefix because the InputStream class was forward declared inside the HashStringAllocator, right?

Could you inline the definition in side of HashStringAllocator, or define it outside the class in the same header? HashStringAllocatorInputStream would be just as clear as HashStringAllocator::InputStream, slightly shorter, and then we don't need all these ByteInputStream:: prefixes.

No we need to prefix this because we are calling readBytes<T> (a templated helper) instead of the virtual readBytes, they are 2 different methods, and because they have same name, the virtual one is shadowing the templated one if called from subclass. Moving the subclass to different location cannot fix this.

Summary: When we get `ByteInputStream` from `HashStringAllocator`, we used to have to materialize all the byte ranges in a vector, which is not efficient. This change improves the efficiency by creating a `ByteInputStream` directly over the linked list of a multi-part allocation. Reviewed By: kevinwilfong Differential Revision: D69750088

facebook-github-bot · 2025-02-25T17:54:41Z

This pull request was exported from Phabricator. Differential Revision: D69750088

facebook-github-bot · 2025-02-26T06:22:40Z

This pull request has been merged in c560aaf.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2025

facebook-github-bot added the fb-exported label Feb 17, 2025

Yuhta force-pushed the export-D69750088 branch from 2be69b0 to d8f1814 Compare February 18, 2025 15:06

kevinwilfong approved these changes Feb 24, 2025

View reviewed changes

Yuhta force-pushed the export-D69750088 branch from d8f1814 to 6b23ab6 Compare February 25, 2025 17:54

facebook-github-bot closed this in c560aaf Feb 26, 2025

facebook-github-bot added the Merged label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add HashStringAllocator::InputStream #12364

feat: Add HashStringAllocator::InputStream #12364

Yuhta commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 •

edited

Loading

facebook-github-bot commented Feb 18, 2025

kevinwilfong Feb 24, 2025

Yuhta Feb 25, 2025 •

edited

Loading

kevinwilfong Feb 24, 2025

kevinwilfong Feb 24, 2025

Yuhta Feb 25, 2025

facebook-github-bot commented Feb 25, 2025

facebook-github-bot commented Feb 26, 2025

feat: Add HashStringAllocator::InputStream #12364

feat: Add HashStringAllocator::InputStream #12364

Conversation

Yuhta commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Feb 18, 2025

kevinwilfong Feb 24, 2025

Choose a reason for hiding this comment

Yuhta Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

kevinwilfong Feb 24, 2025

Choose a reason for hiding this comment

kevinwilfong Feb 24, 2025

Choose a reason for hiding this comment

Yuhta Feb 25, 2025

Choose a reason for hiding this comment

facebook-github-bot commented Feb 25, 2025

facebook-github-bot commented Feb 26, 2025

netlify bot commented Feb 17, 2025 •

edited

Loading

Yuhta Feb 25, 2025 •

edited

Loading