Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade GCS client to 2.2.8 #3507

Merged
merged 1 commit into from
Sep 22, 2022
Merged

Conversation

elonazoulay
Copy link
Member

This fixes the issue with performance cache: GoogleCloudDataproc/hadoop-connectors#342

@cla-bot cla-bot bot added the cla-signed label Apr 22, 2020
@findepi findepi removed their request for review April 25, 2020 21:37
pom.xml Outdated
@@ -1010,6 +1030,20 @@
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using GCS connector only through HDFS/HCFS API you can depend on shaded JAR instead (<classifier>shaded</classifier>). This way you will have only one dependency instead of 4 dependencies now.

pom.xml Outdated
@@ -51,7 +51,7 @@
<dep.jdbi3.version>3.4.0</dep.jdbi3.version>
<dep.drift.version>1.14</dep.drift.version>
<dep.tempto.version>174</dep.tempto.version>
<dep.gcs.version>2.0.0</dep.gcs.version>
<dep.gcs.version>2.1.1</dep.gcs.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.1.3 is the latest, should we use that instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks!

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated
<dependency>
<groupId>com.google.http-client</groupId>
<artifactId>google-http-client</artifactId>
<version>1.34.2</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest is 1.35.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed with the shaded gcs-connector jar.

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
@elonazoulay elonazoulay force-pushed the update_gcs211 branch 2 times, most recently from 7092b42 to 29fb8ce Compare May 14, 2020 10:12
pom.xml Outdated
Comment on lines 1020 to 1503
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util-hadoop</artifactId>
<version>hadoop2-${dep.gcs.version}</version>
</dependency>

<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
<version>${dep.gcs.version}</version>
</dependency>

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.11</version>
</dependency>

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.13</version>
</dependency>

<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.6</version>
</dependency>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these dependencies should be redundant if you are using shaded GCS connector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are there due to enforcer errors. Without those dependencies amazon-aws-sdk and a few other artifacts complain about upper bounds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the versions for gson, httpclient and httpcore dependencies are not explicitly defined in dependency management, maven-enforcer fails with the following upper bound dependency violations:

[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireUpperBoundDeps failed with message:
Failed while enforcing RequireUpperBoundDeps. The error(s) are [
Require upper bound dependencies error for org.apache.httpcomponents:httpclient:4.5.9 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.amazonaws:aws-java-sdk-core:1.11.749
    +-org.apache.httpcomponents:httpclient:4.5.9
and
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
    +-com.google.oauth-client:google-oauth-client:1.30.6
      +-com.google.http-client:google-http-client:1.34.2
        +-org.apache.httpcomponents:httpclient:4.5.11
, 
Require upper bound dependencies error for org.apache.httpcomponents:httpcore:4.4.11 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.amazonaws:aws-java-sdk-core:1.11.749
    +-org.apache.httpcomponents:httpclient:4.5.9
      +-org.apache.httpcomponents:httpcore:4.4.11
and
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
    +-com.google.oauth-client:google-oauth-client:1.30.6
      +-com.google.http-client:google-http-client:1.34.2
        +-org.apache.httpcomponents:httpcore:4.4.13
, 
Require upper bound dependencies error for com.google.code.gson:gson:2.2.4 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcsio:2.1.3
    +-com.google.code.gson:gson:2.2.4
and
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcsio:2.1.3
    +-com.google.protobuf:protobuf-java-util:3.11.4
      +-com.google.code.gson:gson:2.8.6
and
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcsio:2.1.3
    +-io.grpc:grpc-alts:1.29.0
      +-io.grpc:grpc-core:1.29.0
        +-com.google.code.gson:gson:2.8.6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the gcsio version is not explicitly defined in dependencyManagement, maven-enforcer fails with the following upper bound dependency violations:

[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireUpperBoundDeps failed with message:
Failed while enforcing RequireUpperBoundDeps. The error(s) are [
Require upper bound dependencies error for io.grpc:grpc-context:1.22.1 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
    +-com.google.oauth-client:google-oauth-client:1.30.6
      +-com.google.http-client:google-http-client:1.34.2
        +-io.opencensus:opencensus-api:0.24.0
          +-io.grpc:grpc-context:1.22.1
and
+-io.prestosql:presto-hive:334-SNAPSHOT
  +-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
    +-com.google.cloud.bigdataoss:gcsio:2.1.3
      +-io.grpc:grpc-auth:1.29.0
        +-io.grpc:grpc-api:1.29.0
          +-io.grpc:grpc-context:1.29.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If util-hadoop is not explicitly defined then presto-hive fails with an used but undeclared dependency error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on next Presto release date, if it scheduled after new GCS connector release in 2 weeks or so, then it makes sense to wait and update straight to shaded GCS connector 2.1.4. But if next Presto release is earlier, then we can go ahead with unshaded artifacts for now, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the pr with non-shaded artifacts and save this one, can always go back to it - we'll see which one happens first:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @medb!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS connector 2.1.4 was just released, so you can update this PR to use it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@elonazoulay elonazoulay force-pushed the update_gcs211 branch 5 times, most recently from 7bfe57c to d398a5a Compare May 20, 2020 14:58
@elonazoulay
Copy link
Member Author

elonazoulay commented Aug 5, 2020

Hi @medb, update: tried 2.1.4 locally, will post issues - one of them is that gcsio.GoogleCloudStorageFileSystem is not repackaged in com.google.cloud.hadoop.fs.gcs so it still needs to be referred to as com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem

@medb
Copy link
Member

medb commented Aug 5, 2020

Hi @medb, update: tried 2.1.4 locally, will post issues - one of them is that gcsio.GoogleCloudStorageFileSystem is not repackaged in com.google.cloud.hadoop.fs.gcs so it still needs to be referred to as com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem

Yep, if you directly depend on gcsio package then you should not use shaded GCS connector. I thought that Presto depends on HCFS/HDFS for interface in GCS connector for GCS access.

Why there a need to directly use classes in gcsio? Usually if you use classes in gcsio then you should use only gcsio library, not GCS connector that implements HDFS/HCFS API on top of it (i.e. you should depend directly only on either gcs-connector or gcsio library, not on both of them).

@elonazoulay
Copy link
Member Author

elonazoulay commented Aug 6, 2020

It's only. used to get the scheme in GcsConfigurationProvider, i.e. GoogleCloudStorageFileSystem.SCHEME.

Also, I noticed that if I select with a limit then I see the following stack trace in the logs, but the query succeeds, is there some config that would avoid the stack trace?

Here it is:

java.nio.channels.ClosedByInterruptException
	at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
	at java.base/java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:390)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:385)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
	at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:161)
	at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at io.prestosql.plugin.hive.parquet.HdfsParquetDataSource.readFully(HdfsParquetDataSource.java:114)
	at io.prestosql.plugin.hive.parquet.HdfsParquetDataSource$ReferenceCountedReader.read(HdfsParquetDataSource.java:290)
	at io.prestosql.plugin.hive.parquet.HdfsParquetDataSource$1.read(HdfsParquetDataSource.java:187)
	at io.prestosql.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:249)
	at io.prestosql.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:310)
	at io.prestosql.parquet.reader.ParquetReader.readBlock(ParquetReader.java:293)
	at io.prestosql.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:164)
	at io.prestosql.spi.block.LazyBlock$LazyData.load(LazyBlock.java:381)
	at io.prestosql.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:360)
	at io.prestosql.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:276)
	at io.prestosql.spi.Page.getLoadedPage(Page.java:273)
	at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:304)
	at io.prestosql.operator.Driver.processInternal(Driver.java:379)
	at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
	at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
	at io.prestosql.operator.Driver.processFor(Driver.java:276)
	at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
	at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.prestosql.$gen.Presto_unknown____20200806_135605_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

@elonazoulay
Copy link
Member Author

elonazoulay commented Aug 6, 2020

Looks like the stack trace is ok, it does this for the current version as well. It definitely doesn't result in query failures.
@medb, @ebyhr when you have a chance let me know if this is ok.

@elonazoulay elonazoulay changed the title Upgrade gcs to 2.1.1 Upgrade gcs to 2.1.4 Aug 6, 2020
@medb
Copy link
Member

medb commented Aug 7, 2020

Yeah, we don't have SCHEME constant in GCS connector directly, added it in GoogleCloudDataproc/hadoop-connectors#426, so it will be in 2.1.5 release.

@elonazoulay
Copy link
Member Author

Looks like 2.1.5 was just released, will update shortly.

@ebyhr
Copy link
Member

ebyhr commented Sep 2, 2022

Sent #13969 within the repository.

@elonazoulay
Copy link
Member Author

elonazoulay commented Sep 15, 2022

Hey @ebyhr @brokenjacobs @medb - I did some testing and cross cloud (from trino in azure -> gcs) there was a drastic latency increase in gcs calls due to keep alive cache. I added the Connection:close custom header and the latency disappeared but http requests would fail and queries were unreliable, tried setting http retries to higher (20k) values but it was still unreliable.
Here's a thread dump - all threads but 1 are blocked due to trying to put the client back in the keep alive cache:
The keep alive cache uses a synchronized method to put the client back in the cache.

 java.lang.Thread.State: BLOCKED (on object monitor)
        at sun.net.www.http.KeepAliveCache.put([email protected]/KeepAliveCache.java:89)
        - waiting to lock <0x00007fa656a70650> (a sun.net.www.http.KeepAliveCache)
        at sun.net.www.protocol.https.HttpsClient.putInKeepAliveCache([email protected]/HttpsClient.java:663)
        at sun.net.www.http.HttpClient.finished([email protected]/HttpClient.java:440)
        at sun.net.www.http.KeepAliveStream.close([email protected]/KeepAliveStream.java:99)
        at sun.net.www.MeteredStream.justRead([email protected]/MeteredStream.java:93)
        at sun.net.www.MeteredStream.read([email protected]/MeteredStream.java:135)
        - locked <0x00007fb6533639e0> (a sun.net.www.http.KeepAliveStream)
        at java.io.FilterInputStream.read([email protected]/FilterInputStream.java:133)
        at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read([email protected]/HttpURLConnection.java:3525)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:164)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.cacheFooter(GoogleCloudStorageReadChannel.java:802)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:980)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:712)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:703)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:306)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:118)

Would you know if there is a way to disable the keepalive cache - found this - but I tried 2.1.6 and it had similar behavior.

Anyone else familiar with this issue?

This affected metadata calls most, not reads - and the use case was for reading tons of tiny files that make up hudi timelines (on the coordinator). For reading data there was no degradation.

@brokenjacobs
Copy link
Contributor

Is it http or tcp keepalives causing your issue? That link seems to be referring to tcp keepalives and the sslsocketfactory not http keepalives. (Http 1.1 feature). It seems odd that http keepalives (implied by the custom header) would be causing delays?

@elonazoulay
Copy link
Member Author

By disabling http keepalive (custom header) the connection is closed after 1 request. This causes the next request to fail and the code will create a new http client which eliminates the latency but is unreliable. Keep alive cache is causing locking which results in delays (specifically for the hudi use case w tons of tiny metadata files).

@elonazoulay
Copy link
Member Author

So main issue is that all threads which use the http client are blocked except one due to the keep alive cache. The custom headers, etc. was just for testing but showed that the latency is due to the synchronized method which puts connections back in the keep alive cache

@elonazoulay
Copy link
Member Author

Also, to be clear: since the behavior seems to be in older versions also, it is not a regresssion wtih 2.2.7.

@alexjo2144
Copy link
Member

Also, to be clear: since the behavior seems to be in older versions also, it is not a regresssion wtih 2.2.7.

Maybe we split this conversation into a separate GH Issue then?

@elonazoulay
Copy link
Member Author

@alexjo2144 absolutely, I will create an issue for it. Might not be something people encounter outside of this specific use case (tons of tiny files, firewall, cross cloud, etc.).

@findepi findepi requested review from electrum and removed request for electrum September 16, 2022 12:23
@kokosing kokosing self-requested a review September 16, 2022 12:47
pom.xml Outdated
@@ -60,7 +60,7 @@
<dep.oracle.version>19.3.0.0</dep.oracle.version>
<dep.drift.version>1.14</dep.drift.version>
<dep.tempto.version>189</dep.tempto.version>
<dep.gcs.version>2.0.0</dep.gcs.version>
<dep.gcs.version>2.2.7</dep.gcs.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2.8 is already out

@@ -1062,6 +1062,7 @@
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop2-${dep.gcs.version}</version>
<classifier>shaded</classifier>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because we can have issues with other dependencies without shading.

@@ -1952,6 +1973,44 @@
<useNativeGit>true</useNativeGit>
</configuration>
</plugin>
<plugin>
<groupId>org.basepom.maven</groupId>
<artifactId>duplicate-finder-maven-plugin</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it takes to fix issues instead of ignoring them? Can you please list issues that got without this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the issues:

[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING]   mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in compile classpath.
[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING]   mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in runtime classpath.
[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING]   mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in test classpath.

Copy link
Member Author

@elonazoulay elonazoulay Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second warning is from the trino-faulttolerant-tests module:

[WARNING] Found duplicate and different resources in [com.google.api:gax:2.17.0, com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded]:
[WARNING]   dependencies.properties
[WARNING] Found duplicate classes/resources in test classpath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update.

@kokosing
Copy link
Member

@elonazoulay Are you working on this actively or would like me or somebody else to take it over?

pom.xml Outdated
@@ -1124,33 +1125,53 @@
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a wildcard exclusion here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing either way works: I think the changes in 2.2.8 will help our use case and we were able to deploy 2.2.7 (this pr) recently and it worked as good or better than previous version we were using (2.2.2). Either way I will be testing 2.2.8 in the next day or so - lmk what works best for you. If you want to use the other pr I can test that as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I will remove the duplicate finder and paste the results as well and update after we deploy it. And if you want to use the other pr I will test it also.

@elonazoulay elonazoulay changed the title Upgrade GCS client to 2.2.7 Upgrade GCS client to 2.2.8 Sep 21, 2022
@kokosing
Copy link
Member

kokosing commented Sep 22, 2022

CI hit: #14248, retrying to verify if is truly flaky. EDIT it is truly flaky.

@kokosing kokosing merged commit 34984f1 into trinodb:master Sep 22, 2022
@kokosing
Copy link
Member

Merged, thanks!

@github-actions github-actions bot added this to the 398 milestone Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.