Skip to content

Commit

Permalink
Validate LastLedgerHash for downloaded shards.
Browse files Browse the repository at this point in the history
* Add documentation for shard validation
* Retrieve last ledger hash for imported shards
* Compare the reference hash to the lastLedgerHash in Shard::finalize
* Limit retry attempts for imported shards with unconfirmed last ledger hashes
* Use a common function for removing failed shards
* Add new ShardInfo::State for imported shards
  • Loading branch information
undertome committed Apr 1, 2020
1 parent 3780657 commit 1793624
Show file tree
Hide file tree
Showing 5 changed files with 324 additions and 38 deletions.
130 changes: 130 additions & 0 deletions src/ripple/nodestore/ShardValidation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Downloaded Shard Validation

## Overview

In order to validate shards that have been downloaded from file servers (as opposed to shards acquired from peers), the application must confirm the validity of the downloaded shard's last ledger. The following sections describe this confirmation process in greater detail.

## Execution Concept

### Flag Ledger

Since the number of ledgers contained in each shard is always a multiple of 256, a shard's last ledger is always a flag ledger. Conveniently, the application provides a mechanism for retrieving the hash for a given flag ledger:

```C++
boost::optional<uint256>
hashOfSeq (ReadView const& ledger,
LedgerIndex seq,
beast::Journal journal)
```
When validating downloaded shards, we use this function to retrieve the hash of the shard's last ledger. If the function returns a hash that matches the hash stored in the shard, validation of the shard can proceed.
### Caveats
#### Later Ledger
The `getHashBySeq` function will provide the hash of a flag ledger only if the application has stored a later ledger. When validating a downloaded shard, if there is no later ledger stored, validation of the shard will be deferred until a later ledger has been stored.
We employ a simple heuristic for determining whether the application has stored a ledger later than the last ledger of the downloaded shard:
```C++
// We use the presence (or absense) of the validated
// ledger as a heuristic for determining whether or
// not we have stored a ledger that comes after the
// last ledger in this shard. A later ledger must
// be present in order to reliably retrieve the hash
// of the shard's last ledger.
if (app_.getLedgerMaster().getValidatedLedger())
{
auto const hash = app_.getLedgerMaster().getHashBySeq(
lastLedgerSeq(shardIndex));
.
.
.
}
```

The `getHashBySeq` function will be invoked only when a call to `LedgerMaster::getValidatedLedger` returns a validated ledger, rather than a `nullptr`. Otherwise validation of the shard will be deferred.

### Retries

#### Retry Limit

If the server must defer shard validation, the software will initiate a timer that upon expiration, will re-attempt confirming the last ledger hash. We place an upper limit on the number of attempts the server makes to achieve this confirmation. When the maximum number of attempts has been reached, validation of the shard will fail, resulting in the removal of the shard. An attempt counts toward the limit only when we are able to get a validated ledger (implying a current view of the network), but are unable to retrieve the last ledger hash. Retries that occur because no validated ledger was available are not counted.

#### ShardInfo

The `DatabaseShardImp` class stores a container of `ShardInfo` structs, each of which contains information pertaining to a shard held by the server. These structs will be used during shard import to store the last ledger hash (when available) and to track the number of hash confirmation attempts that have been made.

```C++
struct ShardInfo
{
.
.
.

// Used to limit the number of times we attempt
// to retrieve a shard's last ledger hash, when
// the hash should have been found. See
// scheduleFinalizeShard(). Once this limit has
// been exceeded, the shard has failed and is
// removed.
bool
attemptHashConfirmation()
{
if (lastLedgerHashAttempts + 1 <= maxLastLedgerHashAttempts)
{
++lastLedgerHashAttempts;
return true;
}

return false;
}

// This variable is used during the validation
// of imported shards and must match the
// imported shard's last ledger hash.
uint256 lastLedgerHash;

// The number of times we've attempted to
// confirm this shard's last ledger hash.
uint16_t lastLedgerHashAttempts;

// The upper limit on attempts to confirm
// the shard's last ledger hash.
static const uint8_t maxLastLedgerHashAttempts = 5;
};
```

### Shard Import

Once a shard has been successfully downloaded by the `ShardArchiveHandler`, this class invokes the `importShard` method on the shard database:

```C++
bool
DatabaseShardImp::importShard(
std::uint32_t shardIndex,
boost::filesystem::path const& srcDir)
```
At the end of this method, `DatabaseShardImp::finalizeShard` is invoked which begins validation of the downloaded shard. This will be changed so that instead, the software first creates a task to confirm the last ledger hash. Upon the successful completion of this task, shard validation will begin.
```C++
bool
DatabaseShardImp::importShard(
std::uint32_t shardIndex,
boost::filesystem::path const& srcDir)
{
.
.
.
taskQueue_->addTask([this]()
{
// Verify hash.
// Invoke DatabaseShardImp::finalizeShard on success.
// Defer task if necessary.
});
}
```
167 changes: 131 additions & 36 deletions src/ripple/nodestore/impl/DatabaseShardImp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,14 @@ DatabaseShardImp::importShard(
}

it->second.shard = std::move(shard);
finalizeShard(it->second, true, lock);
it->second.state = ShardInfo::State::postImport;

[[maybe_unused]] auto
canAttempt = it->second.attemptHashConfirmation();

assert(canAttempt);
scheduleFinalizeShard(shardIndex);

return true;
}

Expand Down Expand Up @@ -1202,6 +1209,95 @@ DatabaseShardImp::findAcquireIndex(
return boost::none;
}

void
DatabaseShardImp::scheduleFinalizeShard(std::uint32_t shardIndex)
{
taskQueue_->addTask([this, shardIndex]
{
auto retry = [this, shardIndex]
{
const uint8_t TIMER_INTERVAL = 60;

using timer = boost::asio::basic_waitable_timer<
std::chrono::steady_clock>;

std::shared_ptr<timer> retryTimer(new timer(app_.getIOService()));

retryTimer->expires_from_now(std::chrono::seconds(TIMER_INTERVAL));
retryTimer->async_wait([this, shardIndex, retryTimer]
(boost::system::error_code const& ec)
{
scheduleFinalizeShard(shardIndex);
});
};

// We use the presence (or absence) of the validated
// ledger as a heuristic for determining whether or
// not we have stored a ledger that comes after the
// last ledger in this shard. A later ledger must
// be present in order to reliably retrieve the hash
// of the shard's last ledger.
if (app_.getLedgerMaster().getValidatedLedger())
{
auto const hash = app_.getLedgerMaster().walkHashBySeq(
lastLedgerSeq(shardIndex));

std::lock_guard lock(mutex_);
auto const it {shards_.find(shardIndex)};

if (auto found = (it == shards_.end()); found ||
it->second.state != ShardInfo::State::postImport)
{
JLOG(j_.error()) <<
"Shard " << shardIndex << " failed to import";
return;
}

if (!hash || hash->isZero())
{
if (it->second.attemptHashConfirmation())
{
JLOG(j_.warn())
<< "Retrying last ledger hash confirmation"
" for shard " << shardIndex << ".";

retry();
}
else
{
JLOG(j_.error())
<< "Shard "
<< shardIndex
<< " failed to import."
<< " Maximum last ledger hash"
" confirmation attempts reached.";

removeFailedShard(it->second.shard, lock);
}
return;
}

it->second.lastLedgerHash = *hash;
finalizeShard(it->second, true, lock);
}

// Our heuristic technique didn't check out.
// Try again when our timer expires.
else
{
JLOG(j_.warn())
<< "Failed to retrieve validated ledger while importing shard "
<< shardIndex
<< ". Retrying on timer expiration.";

// No limit on retry attempts when the
// hash isn't queried, so no need to call
// ShardInfo::attemptHashConfirmation()
retry();
}
});
}

void
DatabaseShardImp::finalizeShard(
ShardInfo& shardInfo,
Expand All @@ -1212,6 +1308,8 @@ DatabaseShardImp::finalizeShard(
assert(shardInfo.shard->index() != acquireIndex_);
assert(shardInfo.shard->isBackendComplete());
assert(shardInfo.state != ShardInfo::State::finalize);
assert((shardInfo.state == ShardInfo::State::postImport) ==
shardInfo.lastLedgerHash.isNonZero());

auto const shardIndex {shardInfo.shard->index()};

Expand All @@ -1222,10 +1320,14 @@ DatabaseShardImp::finalizeShard(
return;

std::shared_ptr<Shard> shard;
uint256 lastLedgerHash;
{
std::lock_guard lock(mutex_);
if (auto const it {shards_.find(shardIndex)}; it != shards_.end())
{
shard = it->second.shard;
lastLedgerHash = it->second.lastLedgerHash;
}
else
{
JLOG(j_.error()) <<
Expand All @@ -1234,29 +1336,15 @@ DatabaseShardImp::finalizeShard(
}
}

if (!shard->finalize(writeSQLite))
if (!shard->finalize(writeSQLite, lastLedgerHash))
{
if (isStopping())
return;

// Bad shard, remove it
{
std::lock_guard lock(mutex_);
shards_.erase(shardIndex);
updateStatus(lock);

using namespace boost::filesystem;
path const dir {shard->getDir()};
shard.reset();
try
{
remove_all(dir);
}
catch (std::exception const& e)
{
JLOG(j_.error()) <<
"exception " << e.what() << " in function " << __func__;
}
removeFailedShard(shard, lock);
}

setFileStats();
Expand Down Expand Up @@ -1406,25 +1494,7 @@ DatabaseShardImp::storeLedgerInShard(
{
// Shard may be corrupt, remove it
std::lock_guard lock(mutex_);

shards_.erase(shard->index());
if (shard->index() == acquireIndex_)
acquireIndex_ = 0;

updateStatus(lock);

using namespace boost::filesystem;
path const dir {shard->getDir()};
shard.reset();
try
{
remove_all(dir);
}
catch (std::exception const& e)
{
JLOG(j_.error()) <<
"exception " << e.what() << " in function " << __func__;
}
removeFailedShard(shard, lock);

result = false;
}
Expand Down Expand Up @@ -1453,6 +1523,31 @@ DatabaseShardImp::storeLedgerInShard(
return result;
}

void
DatabaseShardImp::removeFailedShard(
std::shared_ptr<Shard> shard,
std::lock_guard<std::mutex> & lock)
{
shards_.erase(shard->index());
if (shard->index() == acquireIndex_)
acquireIndex_ = 0;

updateStatus(lock);

using namespace boost::filesystem;
path const dir {shard->getDir()};
shard.reset();
try
{
remove_all(dir);
}
catch (std::exception const& e)
{
JLOG(j_.error()) <<
"exception " << e.what() << " in function " << __func__;
}
}

//------------------------------------------------------------------------------

std::unique_ptr<DatabaseShard>
Expand Down
Loading

0 comments on commit 1793624

Please sign in to comment.