Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payments wait forever to be conducted #2779

Closed
Dominik1999 opened this issue Oct 15, 2018 · 19 comments
Closed

Payments wait forever to be conducted #2779

Dominik1999 opened this issue Oct 15, 2018 · 19 comments
Assignees
Labels
State / Investigating For issues that are currently being looked into before labeling further

Comments

@Dominik1999
Copy link
Contributor

Dominik1999 commented Oct 15, 2018

Potential Bug / Possible Improvement for the API

Scenario #2778 ran for 5 hours without getting any response code on started payments. Scenario was manually aborted after 5 hours.
See https://gist.github.com/Dominik1999/ceee2ec4ede7c4f815664610d078998a

I would expect the API to respond with code 408 Requested Timeout.

Reproduce

Run the following simple scenario.

@hackaugusto
Copy link
Contributor

hackaugusto commented Oct 15, 2018

@Dominik1999 the scenario runner logs are not sufficient to figure out what happened. Could you provide the logs from the individual nodes and their reveal timeout? (the timeout of the request depends on the reveal timeout)

Edit: node 0 seems the most important one.

@Dominik1999
Copy link
Contributor Author

@hackaugusto sorry this took me a while ... here is the log file of node 10 which started a payment that never succeeded. At least this is how I would interpret the scenario player log.

raiden.log

@hackaugusto
Copy link
Contributor

@Dominik1999 could you get the logs for node 0 too?

@Dominik1999
Copy link
Contributor Author

@hackaugusto
Here I provide you with everything. It is an easier scenario, but shows the same problem. The Payment from Node:0 to Node:6 never succeeds.
(Note: In the log-file of Node 0, I only see 5 POST messages, i.e. payments from Node 0 to Node: 1 - Node: 5. The 6th POST message to Node:6 is not in the log file. The 5 other payments were successful in the scenario run. So the problem might be that the scenario player never actually starts the payment, even when he says so)

I can run the same scenario several times and get this error at different transfers in the scenario. @czepluch faces the same error.

@czepluch
Copy link
Contributor

czepluch commented Oct 16, 2018

Yes. I just face the problem in an even simpler scenario where only one transfer is being made without any mediators. Ulo tried to run a similar scenario to mine at it works for him. Note that the first time I ran the scenario it succeeded, but not since then even without reusing the same token network.

Edit: I did kill the SP while running the scenario the second time, so I maybe this resulted in some bad state in one of the nodes.

Log from sending node: https://gist.github.com/czepluch/7e337ea81c0385264d9910f9ebb224db
Log from receiving node: https://gist.github.com/czepluch/e9cea4b09c2f3e09cd0cdc2b408fb985

@ulope
Copy link
Collaborator

ulope commented Oct 16, 2018

I just discussed this with dominik in a call. The POST log messages are misleading since they are only logged after the request is finished.

The log message to look for is logged by the raiden.api.rest logger. For example Initiating payment.

@LefterisJP
Copy link
Contributor

So I had a small look at @czepluch 's 2 logs.

Who is getting stuck? Receiver or sender?

In those logs the sender (0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a) tries multiple times to send payment with identifier 18154601289524244314 to the target 0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e. The token is 0x30e04cb1ccf93585c2228f651369d8d1e9e3d7a1.

The sender retries many times but the target logs do not show any received transfers. The target only seems to have an active healthcheck with the sender, early in the game:

2018-10-16 09:46:29 [debug    ] Healthcheck                    [raiden.network.transport.matrix] current_user=@0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e:transport01.raiden.network node=2f6df0dc peer_address= 0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a
2018-10-16 09:46:29 [debug    ] Changing address presence state [raiden.network.transport.matrix] address=0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a current_user=@0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e:transport01.raiden.network node=2f6df0dc prev_state=None state=<UserPresence.ONLINE: 'online'>
2018-10-16 09:46:29 [debug    ] State change                   [raiden.raiden_service] node=2f6df0dc state_change={"node_address": "0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a", "network_state": "reachable", "_type": "raiden.transfer.state_change.ActionChangeNodeNetworkState", "_version": 0}

And then some logs showing he has acknowledged the channel opening.

2018-10-16 09:48:03 [debug    ] Opening channel                [raiden.api.rest] partner_address= 0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a registry_address=0xbfa863Ac58a3E0A82B58a8e958F2752Bfb573388 settle_timeout=None token_address=0x30e04CB1CCF93585c2228F651369d8d1e9e3D7a1
2018-10-16 09:48:42 [debug    ] Depositing to new channel      [raiden.api.rest] partner_address=0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a registry_address=0xbfa863Ac58a3E0A82B58a8e958F2752Bfb573388 token_address=0x30e04CB1CCF93585c2228F651369d8d1e9e3D7a1 total_deposit=1000

@ulope
Copy link
Collaborator

ulope commented Oct 16, 2018

Yes this is definitely weird. Wild guess: maybe a matrix problem?

@christianbrb christianbrb added the State / Investigating For issues that are currently being looked into before labeling further label Oct 16, 2018
@czepluch
Copy link
Contributor

czepluch commented Oct 16, 2018

@LefterisJP tried to make a transfer to the "receiving node" from another node that has a path while above problem occurs and that transfer also just hangs.

It's a new scenario so here are the logs for that one:
Logs of sender address when hanging starts: https://gist.github.com/czepluch/23ce8ae58afde131a986f42b690f04eb
Logs of receiving address when hanging starts: https://gist.github.com/czepluch/0145be18e8ed95f54d0e31feb08599ef
Logs of another address trying to send a transfer to the receiving node after the hanging has started (this also hangs): https://gist.github.com/czepluch/07e6af8bb820dc2645194ae8f5cba438 (payment starts at line 1863).

If I stop the scenario when it hangs and start it again with a new token, it's does just fine until it reaches the same transfer again and then it hangs again.

Edit: The scenario in my case is this one: https://gist.github.com/czepluch/7f2e6c92f3892d2a37a29a8864e2de69

@ulope
Copy link
Collaborator

ulope commented Oct 18, 2018

More testing yesterday showed that once a transfer between a particular set of nodes has entered this hanging state it doesn't recover even after a node restart and deploying a new token network.

The parameters determining the hanging state seem to be Nodes involved and Eth chain used.

That makes me even more suspicious that the problem may be in the matrix transport, since those three parameters also control room assignment for node-to-node communication.

@LefterisJP
Copy link
Contributor

Thank you @ulope for your insight. Is there any way to debug this by using a dedicated matrix server (our testing servers), recreate the problem in its simplest form and watch what happens in Matrix?

@ulope
Copy link
Collaborator

ulope commented Oct 19, 2018

@LefterisJP You can already observe what's happening. Simply login in (or register an account, if you do that enter no email address in the registration form, it's not supported on our setup) to the matrix server used by the nodes. In this case it was transport01.

In the corresponding room you can see that the initiating node is sending messages but no reply exists from the target.

Unfortunately, as mentioned before I'm not working today, but will look more into this over the weekend.

@christianbrb christianbrb assigned andrevmatos and unassigned ulope Oct 22, 2018
@christianbrb
Copy link
Contributor

@czepluch What is your even simpler scenario you are referring to in this comment? #2779 (comment)

@czepluch
Copy link
Contributor

@christianbrb As I explain in the comment it was just a simple scenario with one node sending a transfer to another one with no mediators.

@ulope
Copy link
Collaborator

ulope commented Oct 24, 2018

Here is a simpler scenario that also causes this behavior: https://gist.github.com/ulope/9113c794e431e04f460f65e9695cfede

(Please remember to change the raiden executable path to fit your local environment)

During testing this I also again came across #2838 and discovered a new issue #2879

@andrevmatos
Copy link
Contributor

Talking with @konradkonrad and @hackaugusto , I can see two ways for the receiver not be receiving sender messages:

  1. A deadlock or other blocking operation happened on the callback handling for a message or event (in _handle_response from raiden-libs's GMatrixClient), and this would block the next /sync calls and events handling (which is intended, it one event is blocked). We've seen this in the past, and their approach to debug is to install a signal handler to print the stacktrace for all running greenlets to identify these deadlocks. I think that's not what's happening, because this makes the user to also be offline (no /sync -> offline), and if the sender is sending the messages, it means the receiver is online.
  2. We didn't join or aren't listening for events on this room: that's probably what's happening, and can have a couple of reasons. My main bet is a race where the whitelist of peers we accept invites from (currently populated on start_health_check) isn't populated yet during Transport.start, when the initialSync takes place and the invite events sent while we were offline are processed, so these events are ignored, we don't join the room and don't listen to events on it.

I'll make a PR to fix 2., and we can test.

@christianbrb
Copy link
Contributor

Added this to the testnet 16 milestone as PR #2889 is needed for PR #2948

@LefterisJP
Copy link
Contributor

As discussed in the standup we are not sure if #2948 will fix this. @ulope tried with the WIP PR and it did not seem to have fixed the problem.

ulope added a commit to ulope/raiden that referenced this issue Dec 5, 2018
ulope added a commit to ulope/raiden that referenced this issue Dec 6, 2018
LefterisJP pushed a commit that referenced this issue Dec 6, 2018
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019
err508 pushed a commit to err508/raiden that referenced this issue Mar 6, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 6, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 7, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 8, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 13, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 14, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 18, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
err508 pushed a commit to err508/raiden that referenced this issue Mar 27, 2019
condition:
- Client A invites
- The invite triggers _handle_invite in Client B's transport
- Client A starts sending messages to Client B
- Messages are lost, as the invite was not processed yet

The race condition will be fixed in another PR.
Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
State / Investigating For issues that are currently being looked into before labeling further
Projects
None yet
Development

No branches or pull requests

7 participants