Rippled seemingly not aware of its own state, reports incorrectly #2428

professorhantzen · 2018-03-11T07:03:28Z

Just started running a stock rippled server (other than a switch to a nudb backend and increasing the node size, its "out-of-the-box").

After running for a few hours, I did the following (all requests over local ws and with a few ledgers-worth of delay in between):

Checked its complete ledgers using server_info. It reported: "37142307-37145003".
Requested a specific account and lines for ledger_index "37142507" (a ledger well within its range). This request failed with "ledgerNotFound".
Checked its complete ledgers again. Now it reported: "37142307-37142506,37142508-37145005". Ie, immediately after I made the request, it seemed to "realise" it didn't have that ledger, and updated its list accordingly.
I then requested the specific account and lines for the missing ledger again. This time, it returned the correct lines result from that ledger, without any error.
Checked complete ledgers again, it still reported "37142307-37142506,37142508-37145005".

So:

Why did it report the range incorrectly? (1)
Why did it only update its state after I asked it something it didn't know? (2, 3)
Why did it return a result it claimed to not have? (4)
Why did it not update its own state to reflect it now had the information? (5)

sublimator · 2018-03-11T08:15:59Z

ouch

sublimator · 2018-03-11T08:16:24Z

What version of rippled are you running? Might help them with debugging

professorhantzen · 2018-03-11T09:04:17Z

Good point! 0.90.0

professorhantzen · 2018-03-11T09:31:51Z

This is on a stock installation of Ubuntu 16.04. Only nodejs and rippled are on this box - rippled was installed using alien (as per the guide).

I've located the debug.log from a couple of minutes before and after this occurred:

2018-Mar-11 06:44:19 LoadMonitor:WRN Job: recvTransaction->checkTransaction run: 0ms wait: 1059ms
2018-Mar-11 06:44:21 LoadMonitor:WRN Job: processLedgerData run: 1443ms wait: 0ms
2018-Mar-11 06:44:21 LoadMonitor:WRN Job: acceptLedger run: 2964ms wait: 0ms
2018-Mar-11 06:44:21 LedgerConsensus:WRN By the time we got 6A09DEAEB9488C8F641A767A2A75C4F6A8C1237DCACBE2A2C8A55996A86CA6E4 no peers were proposing it
2018-Mar-11 06:44:34 LoadMonitor:WRN Job: processLedgerData run: 1123ms wait: 0ms
2018-Mar-11 06:44:49 LoadMonitor:WRN Job: getConsensusLedger run: 1142ms wait: 0ms
2018-Mar-11 06:44:49 LoadMonitor:WRN Job: processLedgerData run: 1319ms wait: 0ms
2018-Mar-11 06:44:49 LoadMonitor:WRN Job: getConsensusLedger run: 1141ms wait: 0ms
2018-Mar-11 06:44:49 LoadMonitor:WRN Job: getConsensusLedger run: 1233ms wait: 0ms
2018-Mar-11 06:45:20 LoadMonitor:WRN Job: acceptLedger run: 1581ms wait: 0ms
2018-Mar-11 06:45:54 Ledger:WRN Don't have TX root for ledger
2018-Mar-11 06:45:54 Ledger:WRN Don't have AS root for ledger
2018-Mar-11 06:45:54 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 06:45:54 Ledger:WRN Don't have TX root for ledger
2018-Mar-11 06:45:54 Ledger:WRN Don't have AS root for ledger
2018-Mar-11 06:45:54 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 06:45:55 LoadMonitor:WRN Job: sweep run: 1070ms wait: 0ms
2018-Mar-11 06:45:55 LoadMonitor:WRN Job: processLedgerData run: 1416ms wait: 0ms
2018-Mar-11 06:45:55 LoadMonitor:WRN Job: processLedgerData run: 1112ms wait: 0ms
2018-Mar-11 06:46:00 LoadMonitor:WRN Job: processLedgerData run: 1252ms wait: 22ms
2018-Mar-11 06:46:04 LoadMonitor:WRN Job: acceptLedger run: 1313ms wait: 0ms
2018-Mar-11 06:46:05 LoadMonitor:WRN Job: processLedgerData run: 6153ms wait: 0ms
2018-Mar-11 06:46:05 LoadMonitor:WRN Job: processLedgerData run: 3028ms wait: 0ms
2018-Mar-11 06:46:05 LoadMonitor:WRN Job: processLedgerData run: 357ms wait: 2987ms
2018-Mar-11 06:46:23 Ledger:ERR Missing node in 6856BD9669ABADD4D65A7CF12A2189C3EA7D42B300C61629FD37D90C54E678C1
2018-Mar-11 06:46:25 LoadMonitor:WRN Job: cmd:account_info run: 2073ms wait: 0ms
2018-Mar-11 06:46:25 LoadMonitor:WRN Job: WS-Client run: 2074ms wait: 0ms
2018-Mar-11 06:46:25 Ledger:ERR Missing node in 6856BD9669ABADD4D65A7CF12A2189C3EA7D42B300C61629FD37D90C54E678C1
2018-Mar-11 06:46:45 LoadMonitor:WRN Job: processLedgerData run: 1256ms wait: 1ms
2018-Mar-11 06:47:12 LoadMonitor:WRN Job: getConsensusLedger run: 1651ms wait: 0ms
2018-Mar-11 06:47:12 LoadMonitor:WRN Job: getConsensusLedger run: 1603ms wait: 0ms
2018-Mar-11 06:47:12 LoadMonitor:WRN Job: getConsensusLedger run: 1501ms wait: 0ms
2018-Mar-11 06:47:12 LoadMonitor:WRN Job: recvPropose->checkPropose run: 1ms wait: 1350ms
2018-Mar-11 06:47:12 LoadMonitor:WRN Job: recvPropose->checkPropose run: 0ms wait: 1284ms

The "missing node" hashes are, respectively, 37142507 (the ledger in question above), and 37142528 (21 ledgers later).

sublimator · 2018-03-11T09:34:41Z

I'm not sure if this will help, but it might close the gaps in your reported state:
https://ripple.com/build/rippled-apis/#ledger-request

professorhantzen · 2018-03-11T10:18:33Z

That's very useful, thanks.

I since rebooted with online_delete disabled. It seemed to forget all ledgers and start acquiring over again from that point, which struck me as odd.

bachase · 2018-03-12T13:56:58Z

@professorhantzen have you experienced this issue since restarting?

professorhantzen · 2018-03-12T15:45:59Z

@bachase No. I haven't done extensive testing, but it seems to be working as expected now.

professorhantzen · 2018-03-12T16:08:10Z

I just grep'd debug.log for ERR. It turned up that a few hours later the second missing node error happened again:

2018-Mar-10 18:20:03 PeerFinder:ERR Logic testing     111.231.244.93:38330 with error, Connection timed out
2018-Mar-11 06:45:54 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 06:45:54 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 06:46:23 Ledger:ERR Missing node in 6856BD9669ABADD4D65A7CF12A2189C3EA7D42B300C61629FD37D90C54E678C1
2018-Mar-11 06:46:25 Ledger:ERR Missing node in 6856BD9669ABADD4D65A7CF12A2189C3EA7D42B300C61629FD37D90C54E678C1
2018-Mar-11 07:52:42 PeerFinder:ERR Logic testing     111.231.244.93:55766 with error, Connection timed out
2018-Mar-11 07:55:17 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 07:55:17 Ledger:ERR Missing node in 2B74234078C0BFDA17891C3DD42319389951F6C32379E5A7AF8597F406311878
2018-Mar-11 08:08:42 PeerFinder:ERR Logic testing     111.231.244.93:58950 with error, Connection timed out
2018-Mar-11 08:49:41 PeerFinder:ERR Logic testing     203.211.155.6:26443 with error, Connection timed out
2018-Mar-11 08:49:51 PeerFinder:ERR Logic testing     219.250.83.124:53842 with error, Connection timed out
2018-Mar-11 09:27:46 Ledger:ERR Missing node in 6856BD9669ABADD4D65A7CF12A2189C3EA7D42B300C61629FD37D90C54E678C1
2018-Mar-11 15:51:56 PeerFinder:ERR Logic testing     212.199.177.241:46664 with error, Connection timed out

I also specifically queried the full contents of that ledger and it seemed to work fine.

miguelportilla · 2018-03-12T16:28:45Z

@professorhantzen thank you for submitting the issue. The short answer is complete ledgers in server_info is only an estimate and does not accurately represent what is stored in the database. I have answered your questions below in more detail.

Why did it report the range incorrectly? (1)
When complete ledgers is populated, rippled does not check to make sure each ledger is present in the local database as that process is intensive and can take too long. Instead it assumes ranges are contiguous and backfills in magnitudes of 500. So indexes 37142307 and 37142807 may have been present and everything inbetween was assumed to be as well.
Why did it only update its state after I asked it something it didn't know? (2, 3)
If an RPC requests a ledger by index claimed to be complete and fails to find it, the index is immediately removed from complete ledgers.
Why did it return a result it claimed to not have? (4)
When an RPC requests a ledger claimed to be complete and fails to find it, the missing node is automactically acquire.
Why did it not update its own state to reflect it now had the information? (5)
Unfortunatley when the missing ledger is acquired due to the previous circumstances, complete ledgers is not updated. This can likely be improved and I may submit a fix for it.

sublimator · 2018-03-12T16:38:07Z

Ugh, I believe the ripple-lib rewrite is assuming complete_ledgers to be honorable:
https://github.com/ripple/ripple-lib/blob/develop/src/ledger/transactions.ts#L135
https://github.com/ripple/ripple-lib/blob/develop/src/ledger/utils.ts#L92
https://github.com/ripple/ripple-lib/blob/develop/src/common/connection.ts#L382
https://github.com/ripple/ripple-lib/blob/develop/src/common/connection.ts#L44
https://github.com/ripple/ripple-lib/blob/develop/src/common/rangeset.ts#L54

miguelportilla · 2018-03-12T16:46:55Z

@sublimator Excellent point, thank you.

sublimator · 2018-03-12T16:47:56Z

Welcome Micky :)

professorhantzen · 2018-03-12T18:12:58Z

@miguelportilla Thanks for your great explanation.

How is "complete_ledgers" populated?

If all of the mechanisms by which a ledger could be added or deleted from the db are defined, shouldn't it possible in principle for "complete_ledgers" to remain accurate without affecting performance? If "complete_ledgers" were only updated at the point of each ledger addition or deletion, there should be no need for an iterative scan. Or is it limited by how asynchrony is handled?

JoelKatz · 2018-03-12T18:27:10Z

The challenge is server startup. Consider a server that has full history. It can't serve queries until it has confirmed that it has tens of millions of ledgers. But it's not practical to check every entry in every ledger. The server compromises and checks the SQLite databases and assumes that the back end contains a database if the SQLite database says it should. If the back end database (the one that holds the actual ledger nodes) is corrupt or was deleted (without deleting the SQLite databases), then the server will be lied to.

Note that the server will not update its complete ledgers unless it has some reason to. Say your server happens to have ledgers from 10,000,000 up in its SQLite databases. It will detect this on startup and report that in complete ledgers. But if you deleted your node back end database, then it won't actually have those ledgers. When it discovers that it doesn't have them, it will update its tracking of what ledgers are complete. But it won't update complete ledgers to add them even if it happens to retrieve them because it wasn't asked to.

During operation, your server acquires all ledgers that it sees on the network and updates its tracking. But if it's only configured to have, say, 100,000 ledgers, it won't add them to its tracking of complete ledgers later during the run. The discovery for ledgers we "happen to have" only occurs on startup.

sublimator · 2018-03-12T18:28:51Z

good as it gets ?

professorhantzen · 2018-03-13T11:02:26Z

Taking off my rose-coloured glasses, the present situation looks like this:

Every rippled node is restricted to having only 0.2% knowledge (1/500th) of which ledgers it has available. The minimum reason for this is because one rippled node may have one missing ledger.

If it might further discussion, would something like the following work (apologies if my naive view on rippled limits its usefulness):

Rippled updates "complete_ledgers" every time it adds or deletes a ledger, and this is the only way for the value to be updated. Ledgers are expected to be present or not, and are typically in contiguous blocks, so for efficiency this value could be stored and accessed as a run-length compressed bitfield with the address of each bit corresponding to the ledger_index. This is then persisted to disk regularly.
At shutdown, after closing all files, rippled saves a shutdown object, containing some integrity checksum of the current database files. If the database store is linear or easily deterministic, these should be partitioned to eliminate wastefulness in the next step.
At startup, rippled checks for the presence of the shutdown object, the database and "complete_ledgers". If there is a mismatch between what it expects and finds per the shutdown object, it runs a concurrent process that audits the presence of each ledger it expects to have on disk, at a rate determined by load (and only for incorrectly checksum'd partitions). For the duration of this process, additions and deletions initiated by any other method are queued and executed by this process. (By being made aware of every action it can continually update what it expects to have and not waste time attempting to delete a ledger it doesn't have or has not yet verified it has, for example.)
In the case of complete_ledgers becoming corrupted or lost, a config setting can be checked to determine if rippled should rebuild it from everything it has on disk, or clean and start over. (Ie, in the case of a full history node, rebuild would be preferred.)

In all cases, the goal is that if a bit is set to on in complete_ledgers, this ledger can reliably be expected to be available on disk, and as the value is normally updated during ordinary ledger close events or backfill retrieval, this shouldn't slow things down.

Users could be advised if they want their server to run at optimum, they should not "manually delete" database files to free up space, and instead a delete_ledgers RPC command could be provided for the purpose (which would update complete_ledgers as it goes).

professorhantzen · 2018-03-14T17:16:31Z

The current "complete_ledgers" does what its designed to do, which as I understand it is to provide a reasonable enough estimate such that rippled can start up and get to work quickly. This made me think the existing problem is partly due to nomenclature. The existing value is not the complete_ledgers. It's an estimate.

Perhaps it would be better if "complete_ledgers" was called "complete_ledgers_estimate", or - in the case of also implementing something that provides an accurate measure, split "complete_ledgers" into two:

complete_ledgers = { estimate: <current output>, actual: <some future more accurate output> }

Thus, any agent that relies on the value - such as ripple-lib - can switch to "complete_ledgers.actual", whereas rippled could continue to operate as normal on "complete_ledgers.estimate".

sublimator · 2018-03-14T17:20:26Z

Keep it seriously ... ssssssssss

professorhantzen · 2018-03-15T16:05:19Z

Well, anything more thorough than the current method has the potential to take up to 500 times longer. At present, the fastest the above server appears to do the 1/500th indexing is about 31250 ledgers/sec (so 62.5 actual ledger checks) - this is on a pretty decent machine with SSD's benchmarked at 2-3Gb/sec reads. At that rate it might take 20 minutes to "index" a complete set of ledgers. That times 500 to do each one and it's looking like a one week startup time. Obviously it's fair to assume any decent solution won't take 500 times as long, but its also fair to assume any solution will necessarily fall somewhere between the existing time and that.

carlhua · 2020-08-06T18:37:41Z

Closing this issue - I think we had a sufficient explanation. If there are additional issues please submit a new ticket.

mDuo13 added the Food for Thought Ideas, moonshots, etc. Stuff to think about broadly. Not necessarily directly actionable. label Aug 2, 2019

carlhua closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rippled seemingly not aware of its own state, reports incorrectly #2428

Rippled seemingly not aware of its own state, reports incorrectly #2428

professorhantzen commented Mar 11, 2018 •

edited

Loading

sublimator commented Mar 11, 2018

sublimator commented Mar 11, 2018

professorhantzen commented Mar 11, 2018

professorhantzen commented Mar 11, 2018 •

edited

Loading

sublimator commented Mar 11, 2018

professorhantzen commented Mar 11, 2018

bachase commented Mar 12, 2018

professorhantzen commented Mar 12, 2018

professorhantzen commented Mar 12, 2018

miguelportilla commented Mar 12, 2018

sublimator commented Mar 12, 2018

miguelportilla commented Mar 12, 2018

sublimator commented Mar 12, 2018 via email

professorhantzen commented Mar 12, 2018

JoelKatz commented Mar 12, 2018 •

edited

Loading

sublimator commented Mar 12, 2018 via email

professorhantzen commented Mar 13, 2018

professorhantzen commented Mar 14, 2018 •

edited

Loading

sublimator commented Mar 14, 2018 via email

professorhantzen commented Mar 15, 2018

carlhua commented Aug 6, 2020

Rippled seemingly not aware of its own state, reports incorrectly #2428

Rippled seemingly not aware of its own state, reports incorrectly #2428

Comments

professorhantzen commented Mar 11, 2018 • edited Loading

sublimator commented Mar 11, 2018

sublimator commented Mar 11, 2018

professorhantzen commented Mar 11, 2018

professorhantzen commented Mar 11, 2018 • edited Loading

sublimator commented Mar 11, 2018

professorhantzen commented Mar 11, 2018

bachase commented Mar 12, 2018

professorhantzen commented Mar 12, 2018

professorhantzen commented Mar 12, 2018

miguelportilla commented Mar 12, 2018

sublimator commented Mar 12, 2018

miguelportilla commented Mar 12, 2018

sublimator commented Mar 12, 2018 via email

professorhantzen commented Mar 12, 2018

JoelKatz commented Mar 12, 2018 • edited Loading

sublimator commented Mar 12, 2018 via email

professorhantzen commented Mar 13, 2018

professorhantzen commented Mar 14, 2018 • edited Loading

sublimator commented Mar 14, 2018 via email

professorhantzen commented Mar 15, 2018

carlhua commented Aug 6, 2020

professorhantzen commented Mar 11, 2018 •

edited

Loading

professorhantzen commented Mar 11, 2018 •

edited

Loading

JoelKatz commented Mar 12, 2018 •

edited

Loading

professorhantzen commented Mar 14, 2018 •

edited

Loading