Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: sideloading bypasses RocksDB env for some operations #31913

Closed
tbg opened this issue Oct 26, 2018 · 5 comments
Closed

storage: sideloading bypasses RocksDB env for some operations #31913

tbg opened this issue Oct 26, 2018 · 5 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Oct 26, 2018

I noticed in #31732 (comment) that the sideloaded storage does not perform the TruncateTo operation through the RocksDB env:

return err
}
var deleted int
for _, match := range matches {
base := filepath.Base(match)
if len(base) < 1 || base[0] != 'i' {
continue
}
base = base[1:]
upToDot := strings.SplitN(base, ".", 2)
i, err := strconv.ParseUint(upToDot[0], 10, 64)
if err != nil {
return errors.Wrapf(err, "while parsing %q during TruncateTo", match)
}
if i >= index {

This may or may not be a problem, but it's definitely suspicious. Note how Clear just above the snippet does go through extra trouble to use the env.

Perhaps this all works as intended, depending on how the RocksDB env is implemented. It does seem to work when encryption is on, probably because the env really just does the same thing writing directly to the FS would do. With encryption, is that still true, i.e. are files still passed through to the filesystem into the same location (mod scrambling the contents)? I suspect it might be, but then why does Clear go the extra mile through the env?

This really seems like something needs to be investigated and cleaned up.

Note that there's one case where things are definitely broken, namely the one in which you use an in-memory RocksDB instance. In that case, TruncateTo will snoop around on disk, but the env will keep everything in memory (i.e. TruncateTo becomes a no-op and garbage will pile up in-mem).

@tbg tbg added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 26, 2018
@tbg tbg added this to the 2.2 milestone Oct 26, 2018
tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
craig bot pushed a commit that referenced this issue Oct 26, 2018
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf

We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as #31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches #31732.
Touches #31740.
Touches #30261.
Touches #31768.
Touches #31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Oct 27, 2018
@petermattis petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 28, 2018
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
@petermattis
Copy link
Collaborator

@mberhault Also related to encryption at rest. It is believed that this leaks non-encrypted data to disk.

@tbg tbg mentioned this issue Feb 14, 2019
29 tasks
@tbg tbg added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Feb 14, 2019
@tbg
Copy link
Member Author

tbg commented Feb 14, 2019

It is believed that this leaks non-encrypted data to disk.

Hmm, not sure about that. The files are written through the env:

f, err := eng.OpenFile(filename)
if err != nil {
if strings.Contains(err.Error(), "No such file or directory") {
return os.ErrNotExist
}
return err
}
for i := int64(0); i < int64(len(data)); i += chunkSize {
end := i + chunkSize
if l := int64(len(data)); end > l {
end = l
}
chunk := data[i:end]
// rate limit
limitBulkIOWrite(ctx, limiter, len(chunk))
err = f.Append(chunk)
if err == nil && sync {
err = f.Sync()
}
if err != nil {
break
}

@mberhault also tells me in #19783 that the encrypted files have the same size as the unencrypted version, so in my eyes this becomes a cleanup rather than a technical necessity.

@mberhault mberhault removed their assignment Dec 18, 2019
jbowens added a commit to jbowens/cockroach that referenced this issue May 29, 2020
Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation.

Fixes cockroachdb#42034.
Related to cockroachdb#31913.

Release note: None
jbowens added a commit to jbowens/cockroach that referenced this issue May 29, 2020
Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation that panics when any of its methods are
invoked. This is sufficient for existing call sites that use Stat to
check for existence.

Fixes cockroachdb#42034.
Related to cockroachdb#31913.

Release note: None
jbowens added a commit to jbowens/cockroach that referenced this issue Jun 4, 2020
Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the `storage/fs.FS` interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat of
the filesystem. For in-memory RocksDB engines, it provides a mocked
`os.FileInfo` implementation that panics when any of its methods are
invoked. This is sufficient for existing call sites that use Stat to
check for existence.

Fixes cockroachdb#42034.

Related to cockroachdb#31913.

Release note: None
craig bot pushed a commit that referenced this issue Jun 8, 2020
49717: kvserver: use engine's filesystem r=jbowens a=jbowens

Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation.

Fixes #42034.
Related to #31913.

Release note: None

Co-authored-by: Jackson Owens <[email protected]>
@tbg tbg removed their assignment Sep 22, 2020
@tbg
Copy link
Member Author

tbg commented Sep 22, 2020

@petermattis could storage take a look at this? Concretely, whether we're leaking unencrypted data to disk and if not, if we feel good about the status quo.

@petermattis
Copy link
Collaborator

I think @jbowens possibly already fixed this in 20.2. @jbowens can you see if there is anything left to do here?

@jbowens
Copy link
Collaborator

jbowens commented Sep 28, 2020

@jbowens can you see if there is anything left to do here?

I think we're all good. Closing.

@jbowens jbowens closed this as completed Sep 28, 2020
tbg added a commit to tbg/cockroach that referenced this issue Jan 19, 2021
While investigating cockroachdb#59126, I realized we were still going through the
engine for listing the files. I rectified this and as a result,
completely removed the in-mem implementation of the sideloaded storage
that we had in place: instead of it, we simply use an in-mem engine,
which now works like a charm thanks to not bypassing the engine for
listing the file contents (`filepath.Glob` --> `eng.List`).

While I was here, I added some defense in depth: it was unreasonable
that `TruncateTo` was returning an error as a result of failing to
perform best-effort cleanup on its directory.

I am no longer seeing the spurious errors observed in cockroachdb#59126 with
this change. This makes sense, as in this test the sideloaded storage
is set up with an in-memory engine. The old code, which read from
actual disk, would try to glob `auxiliary/...` (i.e. some path relative
to the cwd - oops), which would return no results. As a result, the
sideloaded storage would try to invoke `eng.Remove("auxiliary/...")`
which the engine would refuse, since there would actually be (in-mem)
files stored in the engine. This all resolves now that everyone is
looking at the same files.

It's possible that this fixes the issue with BenchmarkUserFile, as
previously errors from TruncateTo could block the raft log queue,
which in turn could at least slow down the test, but it's hard to
say for sure.

Touches cockroachdb#59126.
Touches cockroachdb#31913.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 20, 2021
While investigating cockroachdb#59126, I realized we were still going through the
engine for listing the files. I rectified this and as a result,
completely removed the in-mem implementation of the sideloaded storage
that we had in place: instead of it, we simply use an in-mem engine,
which now works like a charm thanks to not bypassing the engine for
listing the file contents (`filepath.Glob` --> `eng.List`).

While I was here, I added some defense in depth: it was unreasonable
that `TruncateTo` was returning an error as a result of failing to
perform best-effort cleanup on its directory.

I am no longer seeing the spurious errors observed in cockroachdb#59126 with
this change. This makes sense, as in this test the sideloaded storage
is set up with an in-memory engine. The old code, which read from
actual disk, would try to glob `auxiliary/...` (i.e. some path relative
to the cwd - oops), which would return no results. As a result, the
sideloaded storage would try to invoke `eng.Remove("auxiliary/...")`
which the engine would refuse, since there would actually be (in-mem)
files stored in the engine. This all resolves now that everyone is
looking at the same files.

It's possible that this fixes the issue with BenchmarkUserFile, as
previously errors from TruncateTo could block the raft log queue,
which in turn could at least slow down the test, but it's hard to
say for sure.

Touches cockroachdb#59126.
Touches cockroachdb#31913.

Release note: None
craig bot pushed a commit that referenced this issue Jan 21, 2021
59103: tracing: link child into parent, even if not verbose r=irfansharif a=tbg

Prior to this change, when a child was derived from a local parent, we
would not register the child with the parent. In effect, this meant that
any payloads in the child would not be collected.

Release note: None


59134: kvserver: avoid engine bypass in sideload storage r=jbowens a=tbg

While investigating #59126, I realized we were still going through the
engine for listing the files. I rectified this and as a result,
completely removed the in-mem implementation of the sideloaded storage
that we had in place: instead of it, we simply use an in-mem engine,
which now works like a charm thanks to not bypassing the engine for
listing the file contents (`filepath.Glob` --> `eng.List`).

While I was here, I added some defense in depth: it was unreasonable
that `TruncateTo` was returning an error as a result of failing to
perform best-effort cleanup on its directory.

I am no longer seeing the spurious errors observed in #59126 with
this change.

Touches #59126.
Touches #31913.
Fixes #58948.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
pbardea pushed a commit to pbardea/cockroach that referenced this issue Jan 21, 2021
While investigating cockroachdb#59126, I realized we were still going through the
engine for listing the files. I rectified this and as a result,
completely removed the in-mem implementation of the sideloaded storage
that we had in place: instead of it, we simply use an in-mem engine,
which now works like a charm thanks to not bypassing the engine for
listing the file contents (`filepath.Glob` --> `eng.List`).

While I was here, I added some defense in depth: it was unreasonable
that `TruncateTo` was returning an error as a result of failing to
perform best-effort cleanup on its directory.

I am no longer seeing the spurious errors observed in cockroachdb#59126 with
this change. This makes sense, as in this test the sideloaded storage
is set up with an in-memory engine. The old code, which read from
actual disk, would try to glob `auxiliary/...` (i.e. some path relative
to the cwd - oops), which would return no results. As a result, the
sideloaded storage would try to invoke `eng.Remove("auxiliary/...")`
which the engine would refuse, since there would actually be (in-mem)
files stored in the engine. This all resolves now that everyone is
looking at the same files.

It's possible that this fixes the issue with BenchmarkUserFile, as
previously errors from TruncateTo could block the raft log queue,
which in turn could at least slow down the test, but it's hard to
say for sure.

Touches cockroachdb#59126.
Touches cockroachdb#31913.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

No branches or pull requests

4 participants