Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: stream error: stream ID xxx; INTERNAL_ERROR with increased load on calling Attr to check object existence #3735

Closed
ivymilkyway opened this issue Feb 24, 2021 · 4 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@ivymilkyway
Copy link

Our application is streaming tons of data from GCS files.
With increased workload, we’ve seen increased stream error like
Get "https://storage.googleapis.com/storage/v1/b/bagomatic-prod/o/bag%2Fstandard_v1%2F5G21A6P05L4100129%3A1565945553%3A1565945573.bag?alt=json&prettyPrint=false&projection=full": stream error: stream ID 3787; INTERNAL_ERROR

from calling the Attrs method to check the existence of an object.

func (o *ObjectHandle) Attrs(ctx context.Context) (attrs *ObjectAttrs, err error) {
https://github.com/googleapis/google-cloud-go/blob/master/storage/storage.go#L803

I've seen previous discussion regarding the stream error: stream IDxxx; INTERNAL_ERROR. seems adding retries on this error solves issue for reads. However for Attrs call, it doesn't seem to retry on those errors.

We have added simple exponential backoff retries o this method, seems to reduce the error rate by some extent, however still seeing them with increased workload.

Our application first check the existence of the object, and then start range reader to stream from the file. Interestingly we didn't see lots of errors from the range reader, most of the errors from the Attr call, which is low volume comparing to the range reader since it's called only once per request.

@ivymilkyway ivymilkyway changed the title stream error: stream ID xxx; INTERNAL_ERROR with increased load on calling Attr to check object existence [storage] stream error: stream ID xxx; INTERNAL_ERROR with increased load on calling Attr to check object existence Feb 24, 2021
@codyoss codyoss changed the title [storage] stream error: stream ID xxx; INTERNAL_ERROR with increased load on calling Attr to check object existence storage: stream error: stream ID xxx; INTERNAL_ERROR with increased load on calling Attr to check object existence Feb 25, 2021
@codyoss codyoss added api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue. labels Feb 25, 2021
@andrewsg
Copy link

Thanks for your report. I'll check with our team on whether retries can support this error on actions other than reads, and look into why you might be experiencing such a high error rate on this call.

@I-Cat
Copy link
Contributor

I-Cat commented Apr 17, 2021

Thanks for your report. I'll check with our team on whether retries can support this error on actions other than reads, >> and look into why you might be experiencing such a high error rate on this call.

@andrewsg

The code comes back as expecting prog.go:1:1:

{"error":{"code":401,"message":"Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.","errors":[{"message":"Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.","domain":"global","reason":"required","locationType":"header","location":"Authorization"}]}}

Not Found
== No file

@frankyn
Copy link
Member

frankyn commented Jan 7, 2022

@tritone and @BrennaEpp is this issue addressed by the retry work y'all are doing? (I believe so but want to double check).

@tritone
Copy link
Contributor

tritone commented Jan 7, 2022

We have not added this particular error to the default ShouldRetry method. However, we now provide a method to specify a custom error predicate for retries: https://pkg.go.dev/cloud.google.com/go/storage@main#WithErrorFunc (note that this feature is merged but not yet released).

You can use ObjectHandle.Retryer to customize the retry behavior for ObjectHandle.Attrs, including using a custom ErrorFunc which will mark this error as retryable.

I'm going to close this issue for now but feel free to re-open if you have further concerns.

@tritone tritone closed this as completed Jan 7, 2022
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Jul 29, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#84162

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Jul 29, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jul 29, 2022
…85329

84975: storage: add `MVCCRangeKeyStack` for range keys r=nicktrav,jbowens a=erikgrinaker

**storage: add `MVCCRangeKeyStack` for range keys**

This patch adds `MVCCRangeKeyStack` and `MVCCRangeKeyVersion`, a new
range key representation that will be returned by `SimpleMVCCIterator`.
It is more compact, for efficiency, and comes with a set of convenience
methods to simplify common range key processing.

Resolves #83895.

Release note: None
  
**storage: return `MVCCRangeKeyStack` from `SimpleMVCCIterator`**

This patch changes `SimpleMVCCIterator.RangeKeys()` to return
`MVCCRangeKeyStack` instead of `[]MVCCRangeKeyValue`. Callers have not
been migrated to properly make use of this -- instead, they call
`AsRangeKeyValues()` and construct and use the old data structure.

The MVCC range tombstones tech note is also updated to reflect this.

Release note: None
  
**storage: migrate MVCC code to `MVCCRangeKeyStack`**

Release note: None
  
***: migrate higher-level code to `MVCCRangeKeyStack`**

Release note: None
  
**kvserver/gc: partially migrate to `MVCCRangeKeyStack`**

Some parts require invasive changes to MVCC stats helpers. These will
shortly be consolidated with other MVCC stats logic elsewhere, so the
existing logic is retained for now by using `AsRangeKeyValues()`.

Release note: None
  
**storage: remove `FirstRangeKeyAbove()` and `HasRangeKeyBetween()`**

Release note: None

85017: Revert "sql: Add database ID to sampled query log" r=THardy98 a=THardy98

Reverts: #84195
This reverts commit 307817e.

Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure. Protobuf field not reserved as 
no official build was released with these changes yet.

Release note (sql change): Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure.

85024: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR r=rhu713 a=rhu713

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: #85217, #85216, #85204, #84162

Release note: None


85069: optbuilder: handle unnest returning a tuple r=DrewKimball a=DrewKimball

Currently, the return types of SRFs that return multiple columns are
represented as tuples with labels. The tuple labels are used to decide
whether or not to create a single output column for the SRF, or multiple.
The `unnest` function can return a single column if it has a single argument,
and the type of that column can be a tuple with labels. This could cause the
old logic to mistakenly create multiple output columns for `unnest`, which
could lead to panics down the line and incorrect behavior otherwise.

This commit adds a special case for `unnest` in the `optbuilder` to only expand
tuple return types if there is more than one argument (implying more than one
output column). Other SRFs do not have the same problem because they either
always return the same number of columns, cannot return tuples, or both.

Fixes #58438

Release note (bug fix): Fixed a bug existing since release 20.1 that could
cause a panic in rare cases when the unnest function was used with a
tuple return type.

85100: opt: perf improvements for large queries r=DrewKimball a=DrewKimball

**opt: add bench test for slow queries**

This commit adds two slow-planning queries pulled from #64793 to be used
in benchmarking the optimizer. In addition, the `ReorderJoinsLimit` has been
set to the default 8 for benchmarking tests.

**opt: add struct for tracking column equivalence sets**

Previously, the `JoinOrderBuilder` would construct a `FuncDepSet` from
scratch on each call to `addJoins` in order to eliminate redundant join
filters. This led to unnecessary large allocations because `addJoins` is
called an exponential number of times in query size.

This commit adds a struct `EquivSet` that efficiently stores equivalence
relations as `ColSets` in a slice. Rather than being constructed on each
call to `addJoins`, a `Reset` method is called that maintains slice memory.

In the future, `EquivSet` can be used to handle equivalencies within `FuncDepSet`
structs as well. This well avoid a significant number of allocations in cases with
many equivalent columns, as outlined in #83963.

**opt: avoid usage of FastIntMap in optimizer hot paths**

Previously, `computeHashJoinCost` would use a `FastIntMap` to represent join
equality filters to pass to `computeFiltersCost`. In addition,
`GenerateMergeJoins` used a `FastIntMap` to look up columns among its join
equality columns. This lead to unnecessary allocations since column IDs are
often large enough to exceed the small field of `FastIntMap`.

This commit modifies `computeFiltersCost` to take an anonymous function
that is used to decide whether to skip an equality condition, removing the
need for a mapping between columns.

This commit also refactors `GenerateMergeJoins` to simply perform a linear
scan of its equality columns; this avoids the allocation issue, and should be
fast in practice because the number of equalities will not generally be large.

Release note: None

85146: [backupccl] Use Expr for backup's Detached and Revision History options r=benbardin a=benbardin

This will allow us to set them to null, which will be helpful for ALTER commands.

Release note: None

85234: dev: add rewritable paths for norm tests r=mgartner a=mgartner

Tests in `pkg/sql/opt/norm` are similar to tests in `pkg/sql/opt/xform`
and `pkg/sql/opt/memo` in that they rely on fixtures in
`pkg/sql/opt/testutils/opttester/testfixtures`. This commit adds these
fixtures as rewritable paths for norm tests so that
`./dev test pkg/sql/opt/xform --rewrite` does not fail with errors like:

    open pkg/sql/opt/testutils/opttester/testfixtures/tpcc_schema: operation not permitted

Release note: None

85325: sql: fix explain gist output to show number of scan span constraints r=cucaroach a=cucaroach

If there were span constraints we would always print 1, need to actually
append them to get the count right.

Fixes: #85324

Release note: None


85327: sql: fix udf logic test r=chengxiong-ruan a=chengxiong-ruan

Fixes: #85303

Release note: None

85329: colexec: fix recent concat fix r=yuzefovich a=yuzefovich

The recent fix of the Concat operator in the vectorized engine doesn't
handle the array concatenation correctly and this is now fixed.

Fixes: #85295.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: DrewKimball <[email protected]>
Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 3, 2022
Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Retry these errors assuggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 3, 2022
Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Retry these errors assuggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 8, 2022
…o v1.21.0

This commit bumps the `cloud.google.com/go/storage` vendor to
include the ability to inject custom retry functions when reading
and writing from the underlying SDK -
https://pkg.go.dev/cloud.google.com/go/storage#Client.SetRetry.

The motivation for this change is to combat the high rate of
restores we are seeing fail due to an internal http2 stream error
that is being surfaced by the SDK in our roachtests. As seen in
cockroachdb#85024 we would like
to wrap the default retry logic with our custom retry handling for this
particular error. This is the recommended solution as per:
googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Note, the dependencies have been bumped to the version that we have
been running on master since the 22.1 branch was cut.

Release note (general change): bump cloud.google.com/go/storage from
v18.2.0 to v1.21.0 to allow for injection of custom retry logic in the
SDK
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 8, 2022
…o v1.21.0

This commit bumps the `cloud.google.com/go/storage` vendor to
include the ability to inject custom retry functions when reading
and writing from the underlying SDK -
https://pkg.go.dev/cloud.google.com/go/storage#Client.SetRetry.

The motivation for this change is to combat the high rate of
restores we are seeing fail due to an internal http2 stream error
that is being surfaced by the SDK in our roachtests. As seen in cockroachdb#85024
we would like to wrap the default retry logic with our custom retry
handling for this particular error. This is the recommended solution as per:
googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Note, the dependencies have been bumped to the version that we have
been running on master since the 22.1 branch was cut.

Release note (general change): bump cloud.google.com/go/storage from
v18.2.0 to v1.21.0 to allow for injection of custom retry logic in the
SDK
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 9, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None

Release justification: add retries for temporary errors that were causing
roachtests to fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

6 participants