Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payment couldn't be completed because: there is no route available PFS5 #4927

Closed
agatsoh opened this issue Sep 13, 2019 · 21 comments
Closed
Assignees
Labels
Component / Transport Transport related issues Severity / Medium State / Investigating For issues that are currently being looked into before labeling further
Milestone

Comments

@agatsoh
Copy link
Contributor

agatsoh commented Sep 13, 2019

Description

Seems the node is not getting any routes from the PFS for the target node.

f'HTTP status code "{resp.status_code}" while fetching {url}. '
scenario_player.exceptions.legacy.RESTAPIStatusMismatchError: HTTP status code "409" while fetching http://127.0.0.1:41163/api/v1/payments/0x62083c80353Df771426D209eF578619EE68D5C7A/0x536b7368575301950A00E0B945f75D22AD64534e. Expected 200: {"errors": "Payment couldn't be completed because: there is no route available"}

scenario-player-run_pfs5_too_low_capacity_2019-09-13T13:08:38.log

node_logs_0_1_2_3_4.zip

A snapshot of the console of the SP from my local machine
https://gist.github.com/agatsoh/f53f16372f87e672c21ec4989b8ea845

Edit:
Raiden commit

(raiden_env) [krishna@krishna-pc raiden]$ git log
commit bb0d6563ba5328399e303c74c00ed68bb210535b (HEAD -> develop, upstream/develop)
Author: Yoichi Hirai <[email protected]>
Date:   Thu Sep 12 18:36:43 2019 +0200

    Fixes on @LefterisJP's comments

SP commit

(raiden_env) [krishna@krishna-pc scenario-player]$ git log
commit 272da93ef08d53d176e18ef261a0952185912244 (HEAD -> dev, upstream/dev)
Merge: f28b481 c54466e
Author: Nils Diefenbach <[email protected]>
Date:   Fri Sep 13 08:58:44 2019 +0200

    [FIX-#319] Respect token.address scenario setting

Expected Behavior

The node should be able to find the correct path. The PFS should reply with the correct path.

@czepluch
Copy link
Contributor

czepluch commented Sep 13, 2019

For me this scenario passes locally. Will try to run it some more times and see if I can reproduce. Maybe it's flaky. Might be related to https://github.com/orgs/raiden-network/projects/12#card-26353251 that also seems to have shown flaky behaviour.

@czepluch
Copy link
Contributor

It was indeed the first assert_pfs_history that failed. So my assumption would be that the PFS acted wrongly. I will look into it.

@agatsoh
Copy link
Contributor Author

agatsoh commented Sep 13, 2019

Just checked for me fails all the time, I run the scenario like this

scenario_player --chain=goerli:http://10.104.6.13:8545 run --no-ui --keystore-file=/home/krishna/.ethereum/keystore/UTC--2018-10-12T07-01-18.438476520Z--8f2e0940bed6f90f1cb14feb37f045bb79c41b2d --password=${KEYSTORE_PW} /home/krishna/raidenforked/raiden/raiden/tests/scenarios/pfs5_too_low_capacity.yaml

@czepluch
Copy link
Contributor

czepluch commented Sep 13, 2019

Took a look at the logs and also had @palango take a look. And it looks like nodes are UNREACHABLE or offline when the payment takes place. I have also managed to run into this problem locally myself now. Looks like a matrix problem. @ulope can you take a look when you're back from vacation? Or maybe @err508 has time?

@czepluch
Copy link
Contributor

This doesn't just happen for the specific scenario. I have experienced it for several different ones, which points even further towards a transport/matrix problem.

@christianbrb christianbrb transferred this issue from raiden-network/scenario-player Sep 19, 2019
@christianbrb christianbrb added the Component / Transport Transport related issues label Sep 19, 2019
@czepluch
Copy link
Contributor

@ulope may I assign this to you? I know you're looking into it.

@christianbrb
Copy link
Contributor

@andrevmatos also experienced this issue. A restart of the Raiden Node fixed it.

@czepluch
Copy link
Contributor

@ulope Here are all the logs from a run that failed.

ms1_node21_fails_logs.zip

@christianbrb christianbrb assigned ulope and unassigned czepluch and agatsoh Sep 20, 2019
@christianbrb christianbrb added the State / Investigating For issues that are currently being looked into before labeling further label Sep 20, 2019
@palango
Copy link
Contributor

palango commented Sep 20, 2019

I think I ran onto this as well yesterday. So if the PFS logs help you I can upload them.

@andrevmatos
Copy link
Contributor

andrevmatos commented Sep 20, 2019

This seems related to matrix transport. I've seen some python client I had online appear as offline to the light-client, even if it continued to operate normally. Couldn't notice anything in the logs, but it seems like the sync stopped and the servers set the presence of the node as offline. If that's the cause, PFS is also not going to find the route because it'll see the node as offline in matrix. Restarting the node seems to fix it because it restarts the client/transport and the presence event is picked up again by the server and broadcast to the peers.

@ulope
Copy link
Collaborator

ulope commented Sep 24, 2019

Current working theory is that some sort of problem is happening with federation requests between (at least) transport03 and transport01.

As mentioned in chat to help track down the cause please run the following command if you see this issue while running scenarios:

cat ~/.raiden/scenario-player/scenarios/<scenario-name>/node_<highest-run-number>_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'

(please substitute the correct values in the angle brackets).

@palango
Copy link
Contributor

palango commented Sep 24, 2019

cat ~/.raiden/scenario-player/scenarios/mfee1_flat_fee/node_9_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"

@Dominik1999
Copy link
Contributor

Dominik1999 commented Sep 24, 2019

➜  BF1-test cat node_3_*/run-*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport03.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"
"https://transport01.raiden.network"

@LefterisJP
Copy link
Contributor

Got the same problem running PFS1.

cat ~/.raiden/scenario-player/scenarios/pfs1_get_a_simple_path/node_0_003/run-000.log  | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport03.raiden.network"

@rakanalh rakanalh added this to the Alderaan milestone Sep 24, 2019
@konradkonrad
Copy link
Contributor

Same Problem for MS3.

LOGPATH="$HOME/.raiden/scenario-player/scenarios/ms3_simple_monitoring/"; cat $LOGPATH/node_$(cat $LOGPATH/run_number.txt)_*/*.log | jq 'select(.event == "Using Matrix server") | .server_url'
"https://transport01.raiden.network"
"https://transport01.raiden.network"

@agatsoh
Copy link
Contributor Author

agatsoh commented Sep 24, 2019

I tested this today and on my local machine with the latest SP and latest raiden

(raiden_env) [krishna@krishna-pc scenario-player]$ git log
commit 8ded944959e6d469056cbce268e0c4aa64d4c66a (HEAD -> dev, upstream/dev)
Author: Nils Diefenbach <[email protected]>
Date:   Tue Sep 24 11:35:57 2019 +0200

    Update scenario-example-v2.yaml

scenario-player-run_pfs5_too_low_capacity_2019-09-24T16:48:32.log

pfs5_too_low_capacity_node_logs.zip

@err508
Copy link
Contributor

err508 commented Sep 24, 2019

I think these three logs are interesting, because all nodes are on transport3 here.
It's from a failed run of mfee1_flat_fee.
The user that causes problems is Alice, cb64, 16f0 are none of the other nodes, nor the scenario player.

Time line:

  • Scenario run starts at 10:10
  • at 10:11:30.75 Alice is online at transport3
  • at 10:11:30.76 transport1 learns that Alice is online
  • at 10:12:10.07156 transport1 receives a Presence update that Alice is offline from 16f0@transport1
  • at 10:12:10.0716 transport3 receives a Presence update that Alice is offline from cb64@transport1
  • at 10:17:10 Alice sync with transport03 noticing that Bob is online
  • at 10:17:19.7250 transport3 receives another presence update that Alice is offline from cb64@transport1
  • at 10:17:19.740495 transport1 receives a Presence update that Alice is offline from 16f0@transport1
  • at 10:17:50 the scenario fails, as Alice is supposed to be offline

shortened_html_logs.zip

@err508
Copy link
Contributor

err508 commented Sep 24, 2019

@palango could you let me know what cb64 and 16f0 are doing in the scenario, could this be the pfs/ms?

@palango
Copy link
Contributor

palango commented Sep 24, 2019

@err508 What are the full addresses

@err508
Copy link
Contributor

err508 commented Sep 24, 2019

@palango 0x16f090b46c9eea72478a21f00764e2798d966d0b and 0xcb645ac5a359fb2c6dd2971cb5f556f6ceb1ea06 but nvm, these are two clients for the presence logging.

@ulope
Copy link
Collaborator

ulope commented Sep 25, 2019

After I did a purge of the test transport servers databases the situation seems to have improved for many of the testers but unfortunately not all (I know @czepluch is still seeing the issue).

What I've discovered so far:

  • Raiden nodes that experience this issue appear to go offline because:
    • They perform the normal matrix /sync long polling http request
    • This triggers a federation request server side
    • This request stalls for some yet unknown reason (eventually getting aborted with Connection reset by peer or a similar error)
    • This leads to a 500 response from the /sync endpoint
      • Crucially 500 responses don't seem to update the 'last seen' mechanism on the server
    • Raiden retries matrix requests that return 500 responses
    • If this goes on long enough the 30 s timeout in synapse expires (see two bullets above) and the node is assumed offline (which gets broadcast as normal to all listening peers)

It doesn't seem to be a networking issue. I've run networking tests in the background all day yesterday and haven't seen any indications that this could be the reason.

I'm going ahead with upgrading the test servers to a recent synapse release (with a workaround for #4634) anyway in the hope that this fixes the federation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component / Transport Transport related issues Severity / Medium State / Investigating For issues that are currently being looked into before labeling further
Projects
None yet
Development

No branches or pull requests