Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AUTO_RANDOM mode #1242

Open
wants to merge 3 commits into
base: branch-2.2.x
Choose a base branch
from
Open

Add AUTO_RANDOM mode #1242

wants to merge 3 commits into from

Conversation

singhravidutt
Copy link
Contributor

No description provided.

@singhravidutt
Copy link
Contributor Author

/gcbrun

Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 84.48276% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.77%. Comparing base (e956f85) to head (205dee0).

Files Patch % Lines
...oop/gcsio/GoogleCloudStorageClientReadChannel.java 82.00% 3 Missing and 6 partials ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##             branch-2.2.x    #1242      +/-   ##
==================================================
- Coverage           80.83%   80.77%   -0.06%     
+ Complexity           2417     2416       -1     
==================================================
  Files                 167      167              
  Lines               10815    10861      +46     
  Branches             1197     1211      +14     
==================================================
+ Hits                 8742     8773      +31     
- Misses               1544     1552       +8     
- Partials              529      536       +7     
Flag Coverage Δ
hadoop2integrationtest 63.64% <36.20%> (-0.24%) ⬇️
hadoop2unittest 67.21% <84.48%> (-0.05%) ⬇️
hadoop3integrationtest ?
hadoop3unittest 67.27% <84.48%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@singhravidutt singhravidutt marked this pull request as ready for review August 26, 2024 16:53
@@ -210,14 +212,41 @@ private class ContentReadChannel {
// in-place seeks.
private byte[] skipBuffer = null;
private ReadableByteChannel byteChannel = null;
// Keeps track of distance between last 2 consecutive request.
private LimitedFifoQueue<Long> requestDistance = new LimitedFifoQueue<Long>(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: final. Also at other places.

}

@Override
public boolean add(E o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can vectorized IO call this concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, VectoredRead don't use GcsReadChannel concurrently using multiple threads. GcsReadChannel class in not thread safe so is the FifoQueue.

* `AUTO_RANDOM` - in this mode connector starts with bounded range
requests when reading non gzip-encoded object and switches to streaming
request, bounded by `fs.gs.block.size`, if previous two requests follows
sequential read pattern i.e. forward seeks which are within
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is scope for improvement in the documentation. I did not get the gist of this flag by reading the documentation.

It is explaining WHAT the feature is doing. If you can also add WHEN this flag makes, sense, it would be useful the future readers of the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


private boolean isSequentialAccessPattern() {
if (servedRequestLastIndex != -1) {
requestDistance.add(currentPosition - servedRequestLastIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit of a code smell. A "get" method updating the state. Is there a way to avoid it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not updating the sate of content but it's internal state about how to access data. This is the feature we are offering with AUTO and AUTO_RANDOM.
Similarly, read operation updates the currentPosition pointer in file, that is also a state change.

@@ -210,14 +212,41 @@ private class ContentReadChannel {
// in-place seeks.
private byte[] skipBuffer = null;
private ReadableByteChannel byteChannel = null;
// Keeps track of distance between last 2 consecutive request.
private LimitedFifoQueue<Long> requestDistance = new LimitedFifoQueue<Long>(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create this only for AUTO_RANDOM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but it adds multiple if calls in regular path. Given it's limited to just 2 long I am not too much worried about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, it wasn't a big headache. Removed it and also made it configurable.

@@ -43,6 +44,7 @@ public enum Fadvise {
public static final boolean DEFAULT_FAST_FAIL_ON_NOT_FOUND = true;
public static final boolean DEFAULT_SUPPORT_GZIP_ENCODING = true;
public static final long DEFAULT_INPLACE_SEEK_LIMIT = 8 * 1024 * 1024;
public static final long BLOCK_SIZE = 64 * 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we take the connector block_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we define it's own constants. Defaults getting picked in [GoogleHadoopFileSystemConfiguration.java] (https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/1242/files#diff-f06c91b66e47300ff6c940ca14f152898b99e6e48033502fd4c1dd69c07f0c68) is using the connector BLOCK_SIZE.

long endPosition = objectSize;

if (sequentialAccess) {
endPosition = objectSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this line required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants