Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentry: stopper.go:179: runtime.errorString: runtime error: index out of range #31624

Closed
tbg opened this issue Oct 19, 2018 · 7 comments · Fixed by #35630
Closed

sentry: stopper.go:179: runtime.errorString: runtime error: index out of range #31624

tbg opened this issue Oct 19, 2018 · 7 comments · Fixed by #35630
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@tbg
Copy link
Member

tbg commented Oct 19, 2018

https://sentry.io/cockroach-labs/cockroachdb/issues/735970788/

*log.safeError: stopper.go:179: runtime.errorString: runtime error: index out of range
  File "runtime/panic.go", line 505, in gopanic
  File "runtime/panic.go", line 28, in panicindex
  File "github.com/cockroachdb/cockroach/vendor/github.com/andy-kimball/arenaskl/arena.go", line 92, in Prev
  File "github.com/cockroachdb/cockroach/vendor/github.com/andy-kimball/arenaskl/skl.go", line 241, in Prev
  File "github.com/cockroachdb/cockroach/vendor/github.com/andy-kimball/arenaskl/iterator.go", line 83, in Prev
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 1075, in prevInitNode
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 979, in incomingGapVal
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 575, in addNode
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 300, in addRange
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 259, in AddRange
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/interval_skl.go", line 202, in Add
  File "github.com/cockroachdb/cockroach/pkg/storage/tscache/skl_impl.go", line 84, in Add
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2067, in updateTimestampCache
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2015, in done
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2705, in func1
@tbg tbg added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 19, 2018
@tbg
Copy link
Member Author

tbg commented Oct 19, 2018

@nvanbenschoten this looks like a panic in the skiplist based timestamp cache 😬

@tbg
Copy link
Member Author

tbg commented Oct 19, 2018

note that there was a re-panic, so the message printed might not be the original one.

@petermattis
Copy link
Collaborator

@tschottdorf How did this get the C-test-failure label? Do you have some script which automatically creates these issues?

@petermattis petermattis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. and removed C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 19, 2018
@tbg
Copy link
Member Author

tbg commented Oct 19, 2018

I have my account hooked up to auto-relay issues from sentry via zapier. Let me fix that up.

@petermattis
Copy link
Collaborator

Also, s/O-robot/O-sentry/g.

@tbg
Copy link
Member Author

tbg commented Oct 19, 2018

done.

@tbg tbg closed this as completed Oct 19, 2018
@tbg tbg reopened this Oct 19, 2018
nvanbenschoten added a commit to nvanbenschoten/arenaskl that referenced this issue Mar 10, 2019
This change protects against large allocations causing Arena's
internal size accounting to overflow and produce incorrect results.
This overflow could allow for invalid offsets being returned from
`Arena.Alloc`, leading to slice bounds and index out of range panics
when passed to `Arena.GetBytes` or `Arena.GetPointer`.

Specifically, if `Arena.n` overflowed in `Arena.Alloc`, which was
possible because it was a uint32 and we allow up to MaxUint32 size
allocations, then `Arena.Alloc` could bypass the length check against
`Arena.buf`. The "padded" size would then be subtracted from the offset,
allowing the offset to underflow back around to an offset that was out
of `Arena.buf`'s bounds.

I believe this is the underlying cause of the following two crashes:
- cockroachdb/cockroach#31624
- cockroachdb/cockroach#35557

The reason for this is subtle and has to do with failed allocations
slowly building up and eventually overflowing `Arena.n`. I'll explain
on the PR that vendors this fix.

We now protect against this overflow in two ways. We perform an initial
size check before performing the atomic addition in `Arena.Alloc` to
prevent failed allocations from "building up" to an overflow. We also
use 64-bit arithmetic so that it would take 2^32 concurrent allocations
with the maximum allocation size (math.MaxUint32) to overflow the
accounting. An alternative would be to use a CAS loop to strictly
serialize all additions to `Arena.n`, but that would limit concurrency.
@nvanbenschoten
Copy link
Member

I have a suspected fix for this: andy-kimball/arenaskl#4.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 12, 2019
Fixes cockroachdb#31624.
Fixes cockroachdb#35557.

This commit picks up andy-kimball/arenaskl#4.

I strongly suspect that the uint32 overflow fixed in that PR was the
cause of the two index out of bounds panics. See that commit for more
details.

Release note: None
craig bot pushed a commit that referenced this issue Mar 12, 2019
35321: opt: propagate set operation output types to input columns r=rytaft a=rytaft

This commit updates the `optbuilder` logic for set operations in which
the types of the input columns do not match the types of the output
columns. This can happen if a column on one side has type Unknown,
but the corresponding column on the other side has a known type such
as Int. The known type must be propagated to the side with the unknown
type to prevent errors in the execution engine related to decoding
types.

If there are any column types on either side that don't match the output,
then the `optbuilder` propagates the output types of the set operation down
to the input columns by wrapping the side with mismatched types in a
Project operation. The Project operation passes through columns that
already have the correct type, and creates cast expressions for those
that don't.

Fixes #34524

Release note (bug fix): Fixed an error that happened when executing
some set operations containing only nulls in one of the input columns.

35587: opt: add more cost for lookup joins with more ON conditions r=RaduBerinde a=RaduBerinde

This is a very limited fix for #34810.

The core problem is that we don't take into account that if we have an
ON condition, not only there's a cost to evaluate it on each row, but
we are generating more internal rows to get a given number of output
rows.

I attempted to do a more general fix (for all join types), where I
tried to estimate the "internal" number of rows using
`unknownFilterSelectivity` for each ON condition. There were two
problems:
 - in some cases (especially with lookup joins) we have an extra ON
   condition that doesn't actually do anything:
     `ab JOIN xy ON a=x AND a=10` becomes
     `ab JOIN xy ON a=x AND a=10 AND x=10` becomes
   and `a=10` could remain as an ON condition. This results in bad
   query plans in important cases (e.g. TPCC) where it prefers to do
   an extra lookup join (due to a non-covering index) just because of
   that condition.

 - we don't have the equality columns readily available for hash join
   (and didn't want to extract them each time we cost). In the future
   we may split the planning into a logical and physical stage, and we
   should then separate the logical joins from hash join.

For 19.1, we simply simply add a cost for lookup joins that is
proportional to the number of remaining ON conditions. This is the
least disruptive method that still fixes the case observed in #34810.
I will leave the issue open to address this properly in the next
release.

Note that although hash joins and merge joins have the same issue in
principle, in practice we always generate these expressions with
equality on all possible columns.

Release note: None

35630: storage/tscache: Pick up andy-kimball/arenaskl fix r=nvanbenschoten a=nvanbenschoten

Fixes #31624.
Fixes #35557.

This commit picks up andy-kimball/arenaskl#4.

I strongly suspect that the uint32 overflow fixed in that PR was the
cause of the two index out of bounds panics. See that commit for more
details.

The PR also fixes a bug in memory recylcling within the tscache. I confirmed
on adriatic that over 900 64MB arenas had been allocated since it was last
wiped.

35644: opt: use correct ordering for insert input in execbuilder r=RaduBerinde a=RaduBerinde

We were setting up a projection on the Insert's input but we were
accidentally using the parent Insert's ordering instead of that of the
input.

Fixes #35564.

Release note (bug fix): Fixed a "column not in input" crash when
`INSERT ... RETURNING` is used inside a clause that requires an
ordering.

35651: jobs, sql, ui: Create `AutoCreateStats` job type r=celiala a=celiala

With #34279, enabling the cluster setting
`sql.stats.experimental_automatic_collection.enabled` has the potential
to create many CreateStats jobs, which can cause the Jobs view on the
AdminUI to become cluttered.

This commit creates a new `AutoCreateStats` job type for these auto-created
CreateStats jobs, so that users are able to still see their own manual runs
of CREATE STATISTICS, via the pre-existing `CreateStats` type.

cc @danhhz, @piyush-singh, @rolandcrosby 

![jobs-auto-create-stats](https://user-images.githubusercontent.com/3051672/54212467-5cea2c80-44b9-11e9-9c11-db749814f019.gif)

Release note (admin ui change): AutoCreateStats type added to
Jobs page to filter automatic statistics jobs.

Fixes #34377.

Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Celia La <[email protected]>
@craig craig bot closed this as completed in #35630 Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants