-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reporting Mode #3609
Reporting Mode #3609
Conversation
We are currently rigorously testing Reporting Mode. If bugs are found during testing, I will push |
I created a patch to make the postgres connection pool work with Stoppable: mtrippled@3c434b9 |
Resource::Consumer | ||
GRPCServerImpl::CallData<Request, Response>::getUsage() | ||
bool | ||
GRPCServerImpl::CallData<Request, Response>::wasForwarded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is different than that of secure_gateway for json-rpc and websocket. Namely, A port becomes UNLIMITED if there is a "User" field in the HTTP header. The presence of an IP address supplied by the client via "Forwarded" or "X-Forwarded-For" headers means that the client of the proxy is charged any fees for usage. This is the role known as "PROXY".
To make grpc consistent, I suggest adding a field in the protocol buffer message, such as "user" when, if set and coming from a secure_gateway host, will make the port UNLIMITED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works right now is there is a field in each protobuf message called client_ip
. If client_ip
is set, then the usage fee is charged to the value of client_ip
. If client_ip
is not set and the request came from a secure_gateway host, then the client is considered UNLIMITED.
We can add a "User" field to the protobuf message, but I'm not sure how that would actually help or change anything from what we have now. As is, there is already a mechanism for bypassing usage fees via secure_gateway, as well as a mechanism for charging usage fees to the original client ip in the case of proxied RPCs. The role is set for each case. What would adding a "User" field to the message give us that we don't currently have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The User field was intended as a way to track activities in the log file by identified user. As implemented for grpc I left a comment about what happens if the client IP can't be figured out which may lead to a port wrongly being interpreted as from secure_gateway. Other than that, the problem I see is a lack of consistency between grpc & how the other port types behave with secure_gateway. But the intended result for this particular purpose--allowing the reporting server to perform ETL without resource limits--works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you left in the other comment, about the client IP not being figured out, is a good point actually. Ok, we can add the User field to the gRPC messages then. For when the IP can't be figured out, and for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a user
field to the protobuf messages used for ETL to make this consistent with the json-rpc and websocket interfaces.
src/ripple/app/main/Application.cpp
Outdated
@@ -280,7 +280,8 @@ class ApplicationImp : public Application, public RootStoppable, public BasicApp | |||
, pgPool_( | |||
config_->reporting() ? make_PgPool( | |||
config_->section("ledger_tx_tables"), | |||
logs_->journal("PgPool")) | |||
logs_->journal("PgPool"), | |||
*this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the parent should be ReportingETL
actually. ReportingETL
depends on PgPool
, and needs to be stopped after PgPool
is stopped. Application
is already a Stoppable parent of ReportingETL
. We should also update the diagram in Stoppable.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this, also need to make sure that RPC also works with this arrangement.
std::mutex mutex_; | ||
std::condition_variable cond_; | ||
std::size_t connections_{}; | ||
bool stop_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be an atomic bool. This variable is read in checkout
, which could be called from different threads concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In checkout (and each of its other users), it's locked by mutex_ already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. I had just seen the unprotected access at the top of checkout()
, but it's checked under the mutex later down in that function, so I think it's fine to leave as a normal bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this bool
is read by Pg::query
, seemingly outside the protection of the mutex. Pg::stop_
is a reference to PgPool::stop_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Good catch.
src/ripple/app/main/GRPCServer.cpp
Outdated
auto res = beast::IP::Endpoint::from_string_checked(peer); | ||
if (res) | ||
return {res.get()}; | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the IP address can't be figured out and none is returned, then the port will become UNLIMITED, right, because no IP is set in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can add a user field to the protobuf messages to make sure the port does not become UNLIMITED when the software fails to figure out the value of a non-empty client_ip
.
Another commit: mtrippled@1d62fd3 |
Here's a fix that cleans up after errors found in bulk import: mtrippled@ecba03f |
ret[jss::node_reads_hit] = app.getNodeStore().getFetchHitCount(); | ||
ret[jss::node_written_bytes] = app.getNodeStore().getStoreSize(); | ||
ret[jss::node_read_bytes] = app.getNodeStore().getFetchSize(); | ||
ret[jss::nodestore] = app.getNodeStore().getCountsJson(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the output formatting of the get_counts RPC call to have a sub-object called "nodestore" that contains nodestore counters. The function Database::getCountsJson() was created to allow both get_counts and server_info counters to access this data. The following commit reverts get_counts so that the counters are no longer in a new object.
mtrippled@3faaa43
@mDuo13 This PR definitely changes things from a published API standpoint. |
I'm trying to build on macOS and am so far not able to.
I tried following the new directions in linux/README
But this URL leads me to:
|
@cjcobb23 Nope and nope. This is just rippled running the tests by itself. It is using 16 or 24 jobs, but only one rippled instance. |
Do you have any idea why the logs are outputting "address already in use" then? And can you confirm that this failure does not occur on develop? |
@ximinez Are these failures occurring on windows or linux (or both)? |
@cjcobb23 I don't know why this is happening. My first guess is that one of the jobs is not shutting down correctly, or a new test has a hard-coded port somewhere that is being run in different jobs. I just started testing
Both. |
@ximinez can you share more details about how intermittent these failures are? Does this failure occur every few times you run the tests, 1 in 10 times, 1 in 50, etc. I've ran the unit tests about a dozen times today, and have not been able to reproduce. |
@ximinez The log message you posted says bind port failed for |
@cjcobb23 Yeah, that port definitely raises a red flag, and it's definitely not new, but it only exists in one test, which should be restricted to one job. Someone will have to dig in to that. My best guess is that the previous test is not releasing the port before the next test tries to open it. Or maybe the random port assignment just happens to grab that same port every once in a while. The good news for you and this PR is that the problem does occur with the current develop branch, so there's no reason for it to hold this up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are looking good so far. I'm going to tentatively approve, pending the results of running the tests overnight.
Follow up: Of 126 run sets overnight, there were 32 that had one or more failures. All of those failures were in tests that involve downloads, so I don't know how it could be related to Reporting. I'll run more tests today with the
|
After running develop overnight, it looks like these intermittent failures predate this PR, so you're off the hook. 😄
Though it would be good for someone to look into these, too. |
Codecov Report
@@ Coverage Diff @@
## develop #3609 +/- ##
===============================
===============================
Continue to review full report at Codecov.
|
High Level Overview of Change
This PR adds a new operating mode to rippled called "Reporting Mode". When rippled runs in reporting mode, the server will continuously extract validated data from a rippled node, as soon as ledgers are validated. The data is then written to Postgres and Cassandra (though NuDB is an option as well, for systems with only a single reporting node). Multiple reporting nodes can share access to the same databases; at any given time, only one is writing to the database, and the others simply read the data to handle RPC's. A reporting node will forward any RPC's that require access to the open ledger or the p2p network to a rippled node that is connected to the p2p network (known as a p2p node).
Travis is going to fail right now; we will sort that out as part of the review process. To build the code, configure cmake with -Dreporting=ON. Additional dependencies are required for Postgres and Cassandra. Postgres dependencies can be installed with the OS package manager. The Cassandra cpp driver must also be installed: https://github.com/datastax/cpp-driver. See the updated Linux build instructions for more details.
This was a joint effort between @mtrippled and I. Keep in mind that the git blame might not always be correct, due to rebasing and squashing.
Context of Change
Updated spec is here: https://github.com/ripple/rippled-specs/pull/13.
All of the ETL related code is under src/ripple/app/reporting/.
Cassandra related code is in CassandraFactory.cpp
Postgres related code is in Pg.h and Pg.cpp
Several RPC calls have a new codepath when running in reporting mode, mainly account_tx, tx, ledger and tx_history.
Several new gRPC methods were added.
Code related to forwarding requests to a p2p node is spread throughout several places.
I know this patch is big, so a good place to start the review would be in the
ReportingETL
class, specifically withReportingETL::runETLPipeline()
orReportingETL::monitor()
. But feel free to start wherever you like.Type of Change
Test Plan
Unit tests were added for the new gRPC methods.
We will perform AB testing of all public RPC methods, verifying RPC results from a reporting node match RPC results from a regular rippled node.
We will also perform performance testing of Postgres and Cassandra with full history data.
Future Tasks
Transition from SHAMap (tree) to a flat map for the key-value store.
Put reporting mode in it's own repo separate from rippled.
This change is