Reporting Mode #3609

cjcobb23 · 2020-09-16T01:09:46Z

High Level Overview of Change

This PR adds a new operating mode to rippled called "Reporting Mode". When rippled runs in reporting mode, the server will continuously extract validated data from a rippled node, as soon as ledgers are validated. The data is then written to Postgres and Cassandra (though NuDB is an option as well, for systems with only a single reporting node). Multiple reporting nodes can share access to the same databases; at any given time, only one is writing to the database, and the others simply read the data to handle RPC's. A reporting node will forward any RPC's that require access to the open ledger or the p2p network to a rippled node that is connected to the p2p network (known as a p2p node).

Travis is going to fail right now; we will sort that out as part of the review process. To build the code, configure cmake with -Dreporting=ON. Additional dependencies are required for Postgres and Cassandra. Postgres dependencies can be installed with the OS package manager. The Cassandra cpp driver must also be installed: https://github.com/datastax/cpp-driver. See the updated Linux build instructions for more details.

This was a joint effort between @mtrippled and I. Keep in mind that the git blame might not always be correct, due to rebasing and squashing.

Context of Change

Updated spec is here: https://github.com/ripple/rippled-specs/pull/13.
All of the ETL related code is under src/ripple/app/reporting/.
Cassandra related code is in CassandraFactory.cpp
Postgres related code is in Pg.h and Pg.cpp
Several RPC calls have a new codepath when running in reporting mode, mainly account_tx, tx, ledger and tx_history.
Several new gRPC methods were added.
Code related to forwarding requests to a p2p node is spread throughout several places.

I know this patch is big, so a good place to start the review would be in the ReportingETL class, specifically with ReportingETL::runETLPipeline() or ReportingETL::monitor(). But feel free to start wherever you like.

Type of Change

[x ] New feature (non-breaking change which adds functionality)
[x ] Tests (You added tests for code that already exists, or your new feature included in this PR)
[x ] Documentation Updates

Test Plan

Unit tests were added for the new gRPC methods.
We will perform AB testing of all public RPC methods, verifying RPC results from a reporting node match RPC results from a regular rippled node.
We will also perform performance testing of Postgres and Cassandra with full history data.

Future Tasks

Transition from SHAMap (tree) to a flat map for the key-value store.
Put reporting mode in it's own repo separate from rippled.

This change is

cjcobb23 · 2020-09-17T16:43:44Z

We are currently rigorously testing Reporting Mode. If bugs are found during testing, I will push [FOLD] commits to fix those bugs.

src/ripple/core/Pg.cpp

mtrippled · 2020-09-18T23:30:19Z

I created a patch to make the postgres connection pool work with Stoppable: mtrippled@3c434b9

mtrippled · 2020-09-20T20:36:44Z

src/ripple/app/main/GRPCServer.cpp

-Resource::Consumer
-GRPCServerImpl::CallData<Request, Response>::getUsage()
+bool
+GRPCServerImpl::CallData<Request, Response>::wasForwarded()


This implementation is different than that of secure_gateway for json-rpc and websocket. Namely, A port becomes UNLIMITED if there is a "User" field in the HTTP header. The presence of an IP address supplied by the client via "Forwarded" or "X-Forwarded-For" headers means that the client of the proxy is charged any fees for usage. This is the role known as "PROXY".

To make grpc consistent, I suggest adding a field in the protocol buffer message, such as "user" when, if set and coming from a secure_gateway host, will make the port UNLIMITED.

The way it works right now is there is a field in each protobuf message called client_ip. If client_ip is set, then the usage fee is charged to the value of client_ip. If client_ip is not set and the request came from a secure_gateway host, then the client is considered UNLIMITED.
We can add a "User" field to the protobuf message, but I'm not sure how that would actually help or change anything from what we have now. As is, there is already a mechanism for bypassing usage fees via secure_gateway, as well as a mechanism for charging usage fees to the original client ip in the case of proxied RPCs. The role is set for each case. What would adding a "User" field to the message give us that we don't currently have?

The User field was intended as a way to track activities in the log file by identified user. As implemented for grpc I left a comment about what happens if the client IP can't be figured out which may lead to a port wrongly being interpreted as from secure_gateway. Other than that, the problem I see is a lack of consistency between grpc & how the other port types behave with secure_gateway. But the intended result for this particular purpose--allowing the reporting server to perform ETL without resource limits--works.

What you left in the other comment, about the client IP not being figured out, is a good point actually. Ok, we can add the User field to the gRPC messages then. For when the IP can't be figured out, and for consistency.

I added a user field to the protobuf messages used for ETL to make this consistent with the json-rpc and websocket interfaces.

cjcobb23 · 2020-09-21T15:31:56Z

src/ripple/app/main/Application.cpp

@@ -280,7 +280,8 @@ class ApplicationImp : public Application, public RootStoppable, public BasicApp
        , pgPool_(
              config_->reporting() ? make_PgPool(
                                         config_->section("ledger_tx_tables"),
-                                         logs_->journal("PgPool"))
+                                         logs_->journal("PgPool"),
+                                         *this)


I think the parent should be ReportingETL actually. ReportingETL depends on PgPool, and needs to be stopped after PgPool is stopped. Application is already a Stoppable parent of ReportingETL. We should also update the diagram in Stoppable.h

I'm OK with this, also need to make sure that RPC also works with this arrangement.

cjcobb23 · 2020-09-21T15:34:05Z

src/ripple/core/Pg.h

+    std::mutex mutex_;
+    std::condition_variable cond_;
+    std::size_t connections_{};
+    bool stop_{false};


I think this should be an atomic bool. This variable is read in checkout, which could be called from different threads concurrently.

In checkout (and each of its other users), it's locked by mutex_ already.

Oh you're right. I had just seen the unprotected access at the top of checkout(), but it's checked under the mutex later down in that function, so I think it's fine to leave as a normal bool.

I'm concerned that this bool is read by Pg::query, seemingly outside the protection of the mutex. Pg::stop_ is a reference to PgPool::stop_.

Fixed. Good catch.

mtrippled · 2020-09-21T15:34:16Z

src/ripple/app/main/GRPCServer.cpp

+    auto res = beast::IP::Endpoint::from_string_checked(peer);
+    if (res)
+        return {res.get()};
+    else


If the IP address can't be figured out and none is returned, then the port will become UNLIMITED, right, because no IP is set in the message.

Right, we can add a user field to the protobuf messages to make sure the port does not become UNLIMITED when the software fails to figure out the value of a non-empty client_ip.

mtrippled · 2020-09-22T23:18:46Z

Another commit: mtrippled@1d62fd3
2 cleanups.

mtrippled · 2020-09-23T18:20:07Z

Here's a fix that cleans up after errors found in bulk import: mtrippled@ecba03f

mtrippled · 2020-09-24T21:02:58Z

src/ripple/rpc/handlers/GetCounts.cpp

-    ret[jss::node_reads_hit] = app.getNodeStore().getFetchHitCount();
-    ret[jss::node_written_bytes] = app.getNodeStore().getStoreSize();
-    ret[jss::node_read_bytes] = app.getNodeStore().getFetchSize();
+    ret[jss::nodestore] = app.getNodeStore().getCountsJson();


This changes the output formatting of the get_counts RPC call to have a sub-object called "nodestore" that contains nodestore counters. The function Database::getCountsJson() was created to allow both get_counts and server_info counters to access this data. The following commit reverts get_counts so that the counters are no longer in a new object.
mtrippled@3faaa43

mtrippled · 2020-09-24T23:52:49Z

@mDuo13 This PR definitely changes things from a published API standpoint.

HowardHinnant · 2020-09-25T13:37:05Z

I'm trying to build on macOS and am so far not able to.

CMake Error at /Applications/CMake.app/Contents/share/cmake-3.15/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find PostgreSQL (missing: PostgreSQL_INCLUDE_DIR
  PostgreSQL_TYPE_INCLUDE_DIR)

I tried following the new directions in linux/README

Packages for the cpp driver are available on some operating systems, and are available at
[https://downloads.datastax.com/cpp-driver/](https://downloads.datastax.com/cpp-driver/).

But this URL leads me to:

You tried to access a web page but it couldn't be found; it may have moved, been deleted or never existed in the first place.

ximinez · 2021-01-13T18:59:19Z

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

cjcobb23 · 2021-01-13T19:01:40Z

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

Do you have any idea why the logs are outputting "address already in use" then? And can you confirm that this failure does not occur on develop?

cjcobb23 · 2021-01-13T19:06:12Z

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

@ximinez Are these failures occurring on windows or linux (or both)?

ximinez · 2021-01-13T19:27:59Z

Do you have any idea why the logs are outputting "address already in use" then? And can you confirm that this failure does not occur on develop?

@cjcobb23 I don't know why this is happening. My first guess is that one of the jobs is not shutting down correctly, or a new test has a hard-coded port somewhere that is being run in different jobs. I just started testing develop that before I saw this message. 😀

Are these failures occurring on windows or linux (or both)?

Both.

cjcobb23 · 2021-01-13T19:46:13Z

@ximinez can you share more details about how intermittent these failures are? Does this failure occur every few times you run the tests, 1 in 10 times, 1 in 50, etc. I've ran the unit tests about a dozen times today, and have not been able to reproduce.

cjcobb23 · 2021-01-13T20:00:17Z

@cjcobb23 I don't know why this is happening. My first guess is that one of the jobs is not shutting down correctly, or a new test has a hard-coded port somewhere that is being run in different jobs. I just started testing develop that before I saw this message. grinning

@ximinez The log message you posted says bind port failed for port_alt. port_alt is only mentioned here: https://github.com/cjcobb23/rippled/blob/08848f1c54905c9185865c2117652c84b5b88b8b/src/test/server/ServerStatus_test.cpp#L83
The port number is hardcoded. Maybe this is the culprit? That line is not part of this changeset though.

ximinez · 2021-01-13T20:30:26Z

@cjcobb23 Yeah, that port definitely raises a red flag, and it's definitely not new, but it only exists in one test, which should be restricted to one job. Someone will have to dig in to that. My best guess is that the previous test is not releasing the port before the next test tries to open it. Or maybe the random port assignment just happens to grab that same port every once in a while.

The good news for you and this PR is that the problem does occur with the current develop branch, so there's no reason for it to hold this up.

src/ripple/app/ledger/OrderBookDB.h

ximinez

Things are looking good so far. I'm going to tentatively approve, pending the results of running the tests overnight.

ximinez · 2021-01-14T17:44:57Z

Follow up: Of 126 run sets overnight, there were 32 that had one or more failures. All of those failures were in tests that involve downloads, so I don't know how it could be related to Reporting. I'll run more tests today with the develop branch just to be sure.

$ grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u | wc
     32      32     800
$ ls -1 build/testrun.* | wc
    126     126    3150
$ grep -H -e "#.*failed" -e "failed to complete" $( grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u  ) | cut -d: -f3- | sort | uniq -c
     17  DatabaseDownloader_test.cpp(148)
     10  DatabaseDownloader_test.cpp(190)
      1  DatabaseDownloader_test.cpp(192)
      9  DatabaseDownloader_test.cpp(195)
      6  DatabaseDownloader_test.cpp(214)
      4  DatabaseDownloader_test.cpp(216)
      2  DatabaseDownloader_test.cpp(219)
      1  ShardArchiveHandler_test.cpp(243)
      1  ShardArchiveHandler_test.cpp(250)
      1  ShardArchiveHandler_test.cpp(347)
      1  ShardArchiveHandler_test.cpp(354)
      1  unhandled exception: boost::filesystem::copy_file: File exists: "/tmp/36d7-1620-3642-d9a8/state.db", "/tmp/36d7-1620-3642-d9a8/download/state.db"

ximinez · 2021-01-15T18:06:08Z

After running develop overnight, it looks like these intermittent failures predate this PR, so you're off the hook. 😄

$ grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u | wc
     27      27     675
$ ls -1 build/testrun.* | wc
    128     128    3193
$ grep -H -e "#.*failed" -e "failed to complete" $( grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u  ) | cut -d: -f3- | sort | uniq -c
     10  DatabaseDownloader_test.cpp(148)
      8  DatabaseDownloader_test.cpp(190)
      3  DatabaseDownloader_test.cpp(192)
      5  DatabaseDownloader_test.cpp(195)
      8  DatabaseDownloader_test.cpp(214)
      4  DatabaseDownloader_test.cpp(216)
      4  DatabaseDownloader_test.cpp(219)
      1  ShardArchiveHandler_test.cpp(243)
      1  ShardArchiveHandler_test.cpp(250)

Though it would be good for someone to look into these, too.

codecov-io · 2021-01-18T12:48:37Z

Codecov Report

Merging #3609 (3970344) into develop (55dc7a2) will not change coverage.
The diff coverage is n/a.

@@       Coverage Diff       @@
##   develop   #3609   +/-   ##
===============================
===============================

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55dc7a2...b380cde. Read the comment docs.

cjcobb23 requested review from mtrippled, seelabs, natenichols, HowardHinnant, nbougalis and miguelportilla September 16, 2020 01:09

carlhua added the Documentation README changes, code comments, etc. label Sep 16, 2020

carlhua assigned miguelportilla, nbougalis, HowardHinnant, mtrippled, carlhua, natenichols and cjcobb23 and unassigned mtrippled, cjcobb23 and natenichols Sep 16, 2020

mtrippled reviewed Sep 18, 2020

View reviewed changes

src/ripple/core/Pg.cpp Outdated Show resolved Hide resolved

mtrippled reviewed Sep 18, 2020

View reviewed changes

src/ripple/core/Pg.cpp Show resolved Hide resolved

mtrippled reviewed Sep 20, 2020

View reviewed changes

cjcobb23 commented Sep 21, 2020

View reviewed changes

mtrippled reviewed Sep 21, 2020

View reviewed changes

mtrippled reviewed Sep 24, 2020

View reviewed changes

[FOLD] fix hardcoded port conflict

e519af5

ximinez requested changes Jan 13, 2021

View reviewed changes

src/ripple/app/ledger/OrderBookDB.h Outdated Show resolved Hide resolved

CJ Cobb added 2 commits January 13, 2021 15:51

[FOLD] remove unnecessary include

0e8e186

[FOLD] add 4 to base_port in incPorts. eliminate loop

8ee4903

ximinez approved these changes Jan 13, 2021

View reviewed changes

manojsdoshi mentioned this pull request Jan 15, 2021

Proposed 1.7.0-b11 #3729

Closed

[FOLD] write postgres and cassandra build output to /dev/null

b380cde

manojsdoshi mentioned this pull request Jan 18, 2021

Proposed 1.7.0-b11 #3730

Closed

[FOLD] attempt to further reduce build output

79e14c6

This was referenced Jan 20, 2021

Proposed 1.7.0-b11 #3733

Closed

Proposed 1.7.0-b11 #3735

Closed

[FOLD] revert logging travis cache contents post build

a46e2dc

This was referenced Jan 20, 2021

Proposed 1.7.0-b11 #3737

Closed

Proposed 1.7.0-b11 #3739

Merged

manojsdoshi closed this in #3739 Jan 21, 2021

mDuo13 mentioned this pull request Feb 2, 2021

searched_all: false when ledgers are empty (Version: 1.7.0-rc1) #3750

Open

mDuo13 mentioned this pull request Feb 18, 2021

Reporting mode api XRPLF/xrpl-dev-portal#968

Merged

This was referenced Mar 2, 2021

account_tx: legacy mode removed XRPLF/xrpl-dev-portal#982

Merged

Remove long-deprecated account_tx switch #2926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporting Mode #3609

Reporting Mode #3609

cjcobb23 commented Sep 16, 2020 •

edited

Loading

cjcobb23 commented Sep 17, 2020

mtrippled commented Sep 18, 2020 •

edited

Loading

mtrippled Sep 20, 2020

cjcobb23 Sep 21, 2020

mtrippled Sep 21, 2020

cjcobb23 Sep 21, 2020

cjcobb23 Oct 30, 2020

cjcobb23 Sep 21, 2020

mtrippled Sep 21, 2020

cjcobb23 Sep 21, 2020

mtrippled Sep 21, 2020

cjcobb23 Sep 21, 2020

HowardHinnant Sep 29, 2020

mtrippled Oct 15, 2020

mtrippled Sep 21, 2020

cjcobb23 Sep 21, 2020

mtrippled commented Sep 22, 2020

mtrippled commented Sep 23, 2020

mtrippled Sep 24, 2020

mtrippled commented Sep 24, 2020

HowardHinnant commented Sep 25, 2020

ximinez commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021 •

edited

Loading

cjcobb23 commented Jan 13, 2021 •

edited

Loading

ximinez commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021 •

edited

Loading

ximinez commented Jan 13, 2021

ximinez left a comment

ximinez commented Jan 14, 2021

ximinez commented Jan 15, 2021

codecov-io commented Jan 18, 2021

Reporting Mode #3609

Reporting Mode #3609

Conversation

cjcobb23 commented Sep 16, 2020 • edited Loading

High Level Overview of Change

Context of Change

Type of Change

Test Plan

Future Tasks

cjcobb23 commented Sep 17, 2020

mtrippled commented Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrippled commented Sep 22, 2020

mtrippled commented Sep 23, 2020

Choose a reason for hiding this comment

mtrippled commented Sep 24, 2020

HowardHinnant commented Sep 25, 2020

ximinez commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021 • edited Loading

cjcobb23 commented Jan 13, 2021 • edited Loading

ximinez commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021

cjcobb23 commented Jan 13, 2021 • edited Loading

ximinez commented Jan 13, 2021

ximinez left a comment

Choose a reason for hiding this comment

ximinez commented Jan 14, 2021

ximinez commented Jan 15, 2021

codecov-io commented Jan 18, 2021

Codecov Report

cjcobb23 commented Sep 16, 2020 •

edited

Loading

mtrippled commented Sep 18, 2020 •

edited

Loading

cjcobb23 commented Jan 13, 2021 •

edited

Loading

cjcobb23 commented Jan 13, 2021 •

edited

Loading

cjcobb23 commented Jan 13, 2021 •

edited

Loading