Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting Mode #3609

Closed
wants to merge 11 commits into from
Closed

Reporting Mode #3609

wants to merge 11 commits into from

Conversation

cjcobb23
Copy link
Contributor

@cjcobb23 cjcobb23 commented Sep 16, 2020

High Level Overview of Change

This PR adds a new operating mode to rippled called "Reporting Mode". When rippled runs in reporting mode, the server will continuously extract validated data from a rippled node, as soon as ledgers are validated. The data is then written to Postgres and Cassandra (though NuDB is an option as well, for systems with only a single reporting node). Multiple reporting nodes can share access to the same databases; at any given time, only one is writing to the database, and the others simply read the data to handle RPC's. A reporting node will forward any RPC's that require access to the open ledger or the p2p network to a rippled node that is connected to the p2p network (known as a p2p node).

Travis is going to fail right now; we will sort that out as part of the review process. To build the code, configure cmake with -Dreporting=ON. Additional dependencies are required for Postgres and Cassandra. Postgres dependencies can be installed with the OS package manager. The Cassandra cpp driver must also be installed: https://github.com/datastax/cpp-driver. See the updated Linux build instructions for more details.

This was a joint effort between @mtrippled and I. Keep in mind that the git blame might not always be correct, due to rebasing and squashing.

Context of Change

Updated spec is here: https://github.com/ripple/rippled-specs/pull/13.
All of the ETL related code is under src/ripple/app/reporting/.
Cassandra related code is in CassandraFactory.cpp
Postgres related code is in Pg.h and Pg.cpp
Several RPC calls have a new codepath when running in reporting mode, mainly account_tx, tx, ledger and tx_history.
Several new gRPC methods were added.
Code related to forwarding requests to a p2p node is spread throughout several places.

I know this patch is big, so a good place to start the review would be in the ReportingETL class, specifically with ReportingETL::runETLPipeline() or ReportingETL::monitor(). But feel free to start wherever you like.

Type of Change

  • [x ] New feature (non-breaking change which adds functionality)
  • [x ] Tests (You added tests for code that already exists, or your new feature included in this PR)
  • [x ] Documentation Updates

Test Plan

Unit tests were added for the new gRPC methods.
We will perform AB testing of all public RPC methods, verifying RPC results from a reporting node match RPC results from a regular rippled node.
We will also perform performance testing of Postgres and Cassandra with full history data.

Future Tasks

Transition from SHAMap (tree) to a flat map for the key-value store.
Put reporting mode in it's own repo separate from rippled.


This change is Reviewable

@cjcobb23
Copy link
Contributor Author

We are currently rigorously testing Reporting Mode. If bugs are found during testing, I will push [FOLD] commits to fix those bugs.

@mtrippled
Copy link
Collaborator

mtrippled commented Sep 18, 2020

I created a patch to make the postgres connection pool work with Stoppable: mtrippled@3c434b9

Resource::Consumer
GRPCServerImpl::CallData<Request, Response>::getUsage()
bool
GRPCServerImpl::CallData<Request, Response>::wasForwarded()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is different than that of secure_gateway for json-rpc and websocket. Namely, A port becomes UNLIMITED if there is a "User" field in the HTTP header. The presence of an IP address supplied by the client via "Forwarded" or "X-Forwarded-For" headers means that the client of the proxy is charged any fees for usage. This is the role known as "PROXY".

To make grpc consistent, I suggest adding a field in the protocol buffer message, such as "user" when, if set and coming from a secure_gateway host, will make the port UNLIMITED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it works right now is there is a field in each protobuf message called client_ip. If client_ip is set, then the usage fee is charged to the value of client_ip. If client_ip is not set and the request came from a secure_gateway host, then the client is considered UNLIMITED.
We can add a "User" field to the protobuf message, but I'm not sure how that would actually help or change anything from what we have now. As is, there is already a mechanism for bypassing usage fees via secure_gateway, as well as a mechanism for charging usage fees to the original client ip in the case of proxied RPCs. The role is set for each case. What would adding a "User" field to the message give us that we don't currently have?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The User field was intended as a way to track activities in the log file by identified user. As implemented for grpc I left a comment about what happens if the client IP can't be figured out which may lead to a port wrongly being interpreted as from secure_gateway. Other than that, the problem I see is a lack of consistency between grpc & how the other port types behave with secure_gateway. But the intended result for this particular purpose--allowing the reporting server to perform ETL without resource limits--works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you left in the other comment, about the client IP not being figured out, is a good point actually. Ok, we can add the User field to the gRPC messages then. For when the IP can't be figured out, and for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a user field to the protobuf messages used for ETL to make this consistent with the json-rpc and websocket interfaces.

@@ -280,7 +280,8 @@ class ApplicationImp : public Application, public RootStoppable, public BasicApp
, pgPool_(
config_->reporting() ? make_PgPool(
config_->section("ledger_tx_tables"),
logs_->journal("PgPool"))
logs_->journal("PgPool"),
*this)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parent should be ReportingETL actually. ReportingETL depends on PgPool, and needs to be stopped after PgPool is stopped. Application is already a Stoppable parent of ReportingETL. We should also update the diagram in Stoppable.h

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this, also need to make sure that RPC also works with this arrangement.

std::mutex mutex_;
std::condition_variable cond_;
std::size_t connections_{};
bool stop_{false};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be an atomic bool. This variable is read in checkout, which could be called from different threads concurrently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In checkout (and each of its other users), it's locked by mutex_ already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. I had just seen the unprotected access at the top of checkout(), but it's checked under the mutex later down in that function, so I think it's fine to leave as a normal bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this bool is read by Pg::query, seemingly outside the protection of the mutex. Pg::stop_ is a reference to PgPool::stop_.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Good catch.

auto res = beast::IP::Endpoint::from_string_checked(peer);
if (res)
return {res.get()};
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the IP address can't be figured out and none is returned, then the port will become UNLIMITED, right, because no IP is set in the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can add a user field to the protobuf messages to make sure the port does not become UNLIMITED when the software fails to figure out the value of a non-empty client_ip.

@mtrippled
Copy link
Collaborator

Another commit: mtrippled@1d62fd3
2 cleanups.

@mtrippled
Copy link
Collaborator

Here's a fix that cleans up after errors found in bulk import: mtrippled@ecba03f

ret[jss::node_reads_hit] = app.getNodeStore().getFetchHitCount();
ret[jss::node_written_bytes] = app.getNodeStore().getStoreSize();
ret[jss::node_read_bytes] = app.getNodeStore().getFetchSize();
ret[jss::nodestore] = app.getNodeStore().getCountsJson();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the output formatting of the get_counts RPC call to have a sub-object called "nodestore" that contains nodestore counters. The function Database::getCountsJson() was created to allow both get_counts and server_info counters to access this data. The following commit reverts get_counts so that the counters are no longer in a new object.
mtrippled@3faaa43

@mtrippled
Copy link
Collaborator

@mDuo13 This PR definitely changes things from a published API standpoint.

@HowardHinnant
Copy link
Contributor

I'm trying to build on macOS and am so far not able to.

CMake Error at /Applications/CMake.app/Contents/share/cmake-3.15/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find PostgreSQL (missing: PostgreSQL_INCLUDE_DIR
  PostgreSQL_TYPE_INCLUDE_DIR)

I tried following the new directions in linux/README

Packages for the cpp driver are available on some operating systems, and are available at
[https://downloads.datastax.com/cpp-driver/](https://downloads.datastax.com/cpp-driver/).

But this URL leads me to:

You tried to access a web page but it couldn't be found; it may have moved, been deleted or never existed in the first place.

@ximinez
Copy link
Collaborator

ximinez commented Jan 13, 2021

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

@cjcobb23
Copy link
Contributor Author

cjcobb23 commented Jan 13, 2021

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

Do you have any idea why the logs are outputting "address already in use" then? And can you confirm that this failure does not occur on develop?

@cjcobb23
Copy link
Contributor Author

cjcobb23 commented Jan 13, 2021

I'm having trouble reproducing the unittest failures. I see a log message that says "Address already in use". Are you possibly running a rippled server at the same time as the unit tests? If the unit test is trying to use a port that the server is already using, this type of failure will occur.

@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance.

@ximinez Are these failures occurring on windows or linux (or both)?

@ximinez
Copy link
Collaborator

ximinez commented Jan 13, 2021

Do you have any idea why the logs are outputting "address already in use" then? And can you confirm that this failure does not occur on develop?

@cjcobb23 I don't know why this is happening. My first guess is that one of the jobs is not shutting down correctly, or a new test has a hard-coded port somewhere that is being run in different jobs. I just started testing develop that before I saw this message. 😀

Are these failures occurring on windows or linux (or both)?

Both.

@cjcobb23
Copy link
Contributor Author

@ximinez can you share more details about how intermittent these failures are? Does this failure occur every few times you run the tests, 1 in 10 times, 1 in 50, etc. I've ran the unit tests about a dozen times today, and have not been able to reproduce.

@cjcobb23
Copy link
Contributor Author

cjcobb23 commented Jan 13, 2021

@cjcobb23 I don't know why this is happening. My first guess is that one of the jobs is not shutting down correctly, or a new test has a hard-coded port somewhere that is being run in different jobs. I just started testing develop that before I saw this message. grinning

@ximinez The log message you posted says bind port failed for port_alt. port_alt is only mentioned here: https://github.com/cjcobb23/rippled/blob/08848f1c54905c9185865c2117652c84b5b88b8b/src/test/server/ServerStatus_test.cpp#L83
The port number is hardcoded. Maybe this is the culprit? That line is not part of this changeset though.

@ximinez
Copy link
Collaborator

ximinez commented Jan 13, 2021

@cjcobb23 Yeah, that port definitely raises a red flag, and it's definitely not new, but it only exists in one test, which should be restricted to one job. Someone will have to dig in to that. My best guess is that the previous test is not releasing the port before the next test tries to open it. Or maybe the random port assignment just happens to grab that same port every once in a while.

The good news for you and this PR is that the problem does occur with the current develop branch, so there's no reason for it to hold this up.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things are looking good so far. I'm going to tentatively approve, pending the results of running the tests overnight.

@ximinez
Copy link
Collaborator

ximinez commented Jan 14, 2021

Follow up: Of 126 run sets overnight, there were 32 that had one or more failures. All of those failures were in tests that involve downloads, so I don't know how it could be related to Reporting. I'll run more tests today with the develop branch just to be sure.

$ grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u | wc
     32      32     800
$ ls -1 build/testrun.* | wc
    126     126    3150
$ grep -H -e "#.*failed" -e "failed to complete" $( grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u  ) | cut -d: -f3- | sort | uniq -c
     17  DatabaseDownloader_test.cpp(148)
     10  DatabaseDownloader_test.cpp(190)
      1  DatabaseDownloader_test.cpp(192)
      9  DatabaseDownloader_test.cpp(195)
      6  DatabaseDownloader_test.cpp(214)
      4  DatabaseDownloader_test.cpp(216)
      2  DatabaseDownloader_test.cpp(219)
      1  ShardArchiveHandler_test.cpp(243)
      1  ShardArchiveHandler_test.cpp(250)
      1  ShardArchiveHandler_test.cpp(347)
      1  ShardArchiveHandler_test.cpp(354)
      1  unhandled exception: boost::filesystem::copy_file: File exists: "/tmp/36d7-1620-3642-d9a8/state.db", "/tmp/36d7-1620-3642-d9a8/download/state.db"

@ximinez
Copy link
Collaborator

ximinez commented Jan 15, 2021

After running develop overnight, it looks like these intermittent failures predate this PR, so you're off the hook. 😄

$ grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u | wc
     27      27     675
$ ls -1 build/testrun.* | wc
    128     128    3193
$ grep -H -e "#.*failed" -e "failed to complete" $( grep -H  -e "1 failure" -e "failures" -e "failed to complete" -e "failed" build/testrun.* |grep -v -e "0 failures" -e "ERR:Env Env::close() failed: null" | cut -d: -f1 | sort -u  ) | cut -d: -f3- | sort | uniq -c
     10  DatabaseDownloader_test.cpp(148)
      8  DatabaseDownloader_test.cpp(190)
      3  DatabaseDownloader_test.cpp(192)
      5  DatabaseDownloader_test.cpp(195)
      8  DatabaseDownloader_test.cpp(214)
      4  DatabaseDownloader_test.cpp(216)
      4  DatabaseDownloader_test.cpp(219)
      1  ShardArchiveHandler_test.cpp(243)
      1  ShardArchiveHandler_test.cpp(250)

Though it would be good for someone to look into these, too.

@manojsdoshi manojsdoshi mentioned this pull request Jan 15, 2021
@codecov-io
Copy link

Codecov Report

Merging #3609 (3970344) into develop (55dc7a2) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@       Coverage Diff       @@
##   develop   #3609   +/-   ##
===============================
===============================

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55dc7a2...b380cde. Read the comment docs.

@manojsdoshi manojsdoshi mentioned this pull request Jan 18, 2021
This was referenced Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation README changes, code comments, etc. Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.