sequencer: Cache compact.Tree between sequencing runs #1598

pav-kv · 2019-05-14T10:12:51Z

This change adds compact.Tree caching to sequencer. This should significantly
reduce cost of sequencing, because most of the time sequencer does many runs
in a row.

Checklist

I have updated the CHANGELOG.
- Adjust the draft version number according to semantic versioning rules.
I have updated documentation accordingly.

pav-kv · 2019-05-14T10:26:54Z

@Martin2112 @AlCutter What do you think about this idea?
Context: #1188 (comment).

codecov · 2019-05-14T10:28:52Z

Codecov Report

❗ No coverage uploaded for pull request base (master@629410d). Click here to learn what that means.
The diff coverage is 58.33%.

@@            Coverage Diff            @@
##             master    #1598   +/-   ##
=========================================
  Coverage          ?   67.14%           
=========================================
  Files             ?      111           
  Lines             ?     8988           
  Branches          ?        0           
=========================================
  Hits              ?     6035           
  Misses            ?     2347           
  Partials          ?      606

Impacted Files	Coverage Δ
log/cache.go	`50% <50%> (ø)`
log/sequencer.go	`75.1% <75%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 629410d...1a25c20. Read the comment docs.

pav-kv · 2019-05-14T10:34:36Z

Caching is a consequence of IntegrateBatch method being stateless. An alternative approach is having per-tree Sequencer objects which eliminates need for caching (the "cached" things could be just stored in the Sequencer itself). I think the latter approach is better, because it eliminates the need in syncing lifetimes of cached objects and mastership. But that obviously requires higher level redesign of log manager.

Martin2112 · 2019-05-14T11:10:38Z

It may be time for some history from the dim and distant past of those long-forgotten days in 2016 ...

Not retaining state was done for several reasons. 1.) Not holding something that could get corrupted in RAM 2.) Reduced need to think about various complicated distributed failure modes. 3.) Limited time and resources to implement made us pretty cautious.

So fast forward to the glorious days of 2019. The kind of questions that should be asked are whether this opens up any possible ways to go wrong that don't currently exist. I think the versioning protects against most stale database / accidental multiple master cases but it's worth thinking it through with worst case assumptions.

If we're going to do this I think it should check that what comes out of the cache appears to be the same as what went in. Also what happens if time on the machine appears to have gone backwards? I think there is a check somewhere for writing an older than current root but not sure it covers this. It might be possible make it either write an incorrect timestamp or get stuck failing to update the root until it resigns and a new signer takes over.

pav-kv · 2019-05-14T11:22:15Z

1.) Not holding something that could get corrupted in RAM
...
If we're going to do this I think it should check that what comes out of the cache appears to be the same as what went in.

This is a good point. In this PR I check tree size and root hash of the compact.Tree to match the latest signed log root read from the storage. Assuming the hash is cryptographic, a match is a strong proof that nothing has been corrupted.

Also what happens if time on the machine appears to have gone backwards? I think there is a check somewhere for writing an older than current root but not sure it covers this. It might be possible make it either write an incorrect timestamp or get stuck failing to update the root until it resigns and a new signer takes over.

This is an orthogonal question, as the cache only stores hashes. This PR doesn't change behavior w.r.t. timestamps.

Martin2112 · 2019-05-14T12:37:13Z

Yes it currently checks what it uses so could be ok but otherwise it would be safer to hang on to the serialized format and verify each time it's used. That may be overkill.

The timestamp thing is an example as there's a timestamp in LogRootV1. I don't think the code checks that so if multiple roots are written at the same tree size the cached root gets out of date? Again I don't think that timestamp gets used for anything but if it did then it the door would be open for something along the lines of the bug that caused us an ~8 hour log outage a few years ago.

It's not a bad idea. We just need to make sure we've thought through how it might break. If it can't that's great but we have to do at least do it.

AlCutter · 2019-05-14T13:07:36Z

log/cache.go

+//
+// TODO(pavelkalinnikov): Support some cache clearing mechanism.
+type TreeCache struct {
+	ct map[int64]*compact.Tree


You may want to use sync.Map here, or otherwise protect with sync.Mutex?

Fantastic point.

AlCutter · 2019-05-14T13:12:15Z

log/sequencer.go

@@ -401,7 +405,7 @@ func (s Sequencer) IntegrateBatch(ctx context.Context, tree *trillian.Tree, limi
 		}

 		// Collate node updates.
-		nodeMap, err := s.updateCompactTree(merkleTree, sequencedLeaves, label)
+		nodeMap, err := s.updateCompactTree(merkleTree, tree.TreeId, sequencedLeaves, label)


Ideally you probably only want this to happen once the tx has successfully committed?

It's probably fine because on a rollback the replayed root hash won't match the cached one so you'll end up having to init the cache from storage, but ...

That was precisely my thinking. But it's probably still not a bad idea to make it safer. OTOH, it's not trivial: note that the returned compact.Tree is mutable, and we can't rollback these updates without some sort of "clone" function.

This triggers another interesting observation: this code can be retried, as it's within a RunInTransaction function. If it is retried, the 1st retry will fail (and clear the cache) because of compact.Tree mutability and root hash mismatch, but the 2nd retry has a chance of succeeding again. This is probably a good reason to have a "clone" function instead.

Oh wait, ignore the last comment. The 1st retry will not fail, it will only be forced to reload the compact.Tree from storage.

pav-kv · 2019-05-28T14:18:52Z

Closing in favor of a larger refactoring: #1640

googlebot added the cla: yes label May 14, 2019

pav-kv mentioned this pull request May 14, 2019

Investigate sequencer init tree latency #1188

Open

AlCutter reviewed May 14, 2019

View reviewed changes

pav-kv mentioned this pull request May 22, 2019

Log Signer hanging when select statement gets stuck #1605

Closed

sequencer: Cache compact.Tree between sequencing runs

1a25c20

pav-kv force-pushed the sequencer_cache branch from e3f0835 to 1a25c20 Compare May 28, 2019 09:46

pav-kv mentioned this pull request May 28, 2019

Refactor logs operation #1640

Open

4 tasks

pav-kv closed this May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequencer: Cache compact.Tree between sequencing runs #1598

sequencer: Cache compact.Tree between sequencing runs #1598

pav-kv commented May 14, 2019

pav-kv commented May 14, 2019 •

edited

Loading

codecov bot commented May 14, 2019 •

edited

Loading

pav-kv commented May 14, 2019 •

edited

Loading

Martin2112 commented May 14, 2019

pav-kv commented May 14, 2019

Martin2112 commented May 14, 2019

AlCutter May 14, 2019

pav-kv May 14, 2019

AlCutter May 14, 2019

pav-kv May 14, 2019

pav-kv May 14, 2019 •

edited

Loading

pav-kv May 14, 2019

pav-kv commented May 28, 2019

sequencer: Cache compact.Tree between sequencing runs #1598

sequencer: Cache compact.Tree between sequencing runs #1598

Conversation

pav-kv commented May 14, 2019

Checklist

pav-kv commented May 14, 2019 • edited Loading

codecov bot commented May 14, 2019 • edited Loading

Codecov Report

pav-kv commented May 14, 2019 • edited Loading

Martin2112 commented May 14, 2019

pav-kv commented May 14, 2019

Martin2112 commented May 14, 2019

AlCutter May 14, 2019

Choose a reason for hiding this comment

pav-kv May 14, 2019

Choose a reason for hiding this comment

AlCutter May 14, 2019

Choose a reason for hiding this comment

pav-kv May 14, 2019

Choose a reason for hiding this comment

pav-kv May 14, 2019 • edited Loading

Choose a reason for hiding this comment

pav-kv May 14, 2019

Choose a reason for hiding this comment

pav-kv commented May 28, 2019

pav-kv commented May 14, 2019 •

edited

Loading

codecov bot commented May 14, 2019 •

edited

Loading

pav-kv commented May 14, 2019 •

edited

Loading

pav-kv May 14, 2019 •

edited

Loading