Payments wait forever to be conducted #2779

Dominik1999 · 2018-10-15T09:54:46Z

Potential Bug / Possible Improvement for the API

Scenario #2778 ran for 5 hours without getting any response code on started payments. Scenario was manually aborted after 5 hours.
See https://gist.github.com/Dominik1999/ceee2ec4ede7c4f815664610d078998a

I would expect the API to respond with code 408 Requested Timeout.

Reproduce

Run the following simple scenario.

hackaugusto · 2018-10-15T10:07:34Z

@Dominik1999 the scenario runner logs are not sufficient to figure out what happened. Could you provide the logs from the individual nodes and their reveal timeout? (the timeout of the request depends on the reveal timeout)

Edit: node 0 seems the most important one.

Dominik1999 · 2018-10-15T16:20:52Z

@hackaugusto sorry this took me a while ... here is the log file of node 10 which started a payment that never succeeded. At least this is how I would interpret the scenario player log.

raiden.log

hackaugusto · 2018-10-16T08:05:00Z

@Dominik1999 could you get the logs for node 0 too?

Dominik1999 · 2018-10-16T09:39:04Z

@hackaugusto
Here I provide you with everything. It is an easier scenario, but shows the same problem. The Payment from Node:0 to Node:6 never succeeds.
(Note: In the log-file of Node 0, I only see 5 POST messages, i.e. payments from Node 0 to Node: 1 - Node: 5. The 6th POST message to Node:6 is not in the log file. The 5 other payments were successful in the scenario run. So the problem might be that the scenario player never actually starts the payment, even when he says so)

scenario: https://gist.github.com/Dominik1999/0abcaec23574d57ee99b5d2a970411c6
scenario player log file: https://gist.github.com/Dominik1999/477c5229e6a3fb3202d09f26453fef27
log file node 0: https://gist.github.com/Dominik1999/6b2fad619fe47af39854d7508cdf93d6
log file node 6 (receiver): https://gist.github.com/Dominik1999/2032cf0708a4d95c5765c4eba0b7386e

I can run the same scenario several times and get this error at different transfers in the scenario. @czepluch faces the same error.

czepluch · 2018-10-16T09:47:33Z

Yes. I just face the problem in an even simpler scenario where only one transfer is being made without any mediators. Ulo tried to run a similar scenario to mine at it works for him. Note that the first time I ran the scenario it succeeded, but not since then even without reusing the same token network.

Edit: I did kill the SP while running the scenario the second time, so I maybe this resulted in some bad state in one of the nodes.

Log from sending node: https://gist.github.com/czepluch/7e337ea81c0385264d9910f9ebb224db
Log from receiving node: https://gist.github.com/czepluch/e9cea4b09c2f3e09cd0cdc2b408fb985

ulope · 2018-10-16T12:35:16Z

I just discussed this with dominik in a call. The POST log messages are misleading since they are only logged after the request is finished.

The log message to look for is logged by the raiden.api.rest logger. For example Initiating payment.

LefterisJP · 2018-10-16T13:41:09Z

So I had a small look at @czepluch 's 2 logs.

Who is getting stuck? Receiver or sender?

In those logs the sender (0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a) tries multiple times to send payment with identifier 18154601289524244314 to the target 0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e. The token is 0x30e04cb1ccf93585c2228f651369d8d1e9e3d7a1.

The sender retries many times but the target logs do not show any received transfers. The target only seems to have an active healthcheck with the sender, early in the game:

2018-10-16 09:46:29 [debug    ] Healthcheck                    [raiden.network.transport.matrix] current_user=@0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e:transport01.raiden.network node=2f6df0dc peer_address= 0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a
2018-10-16 09:46:29 [debug    ] Changing address presence state [raiden.network.transport.matrix] address=0xe5c68d721324d00e6c90b98b6e6c64250b2bcb9a current_user=@0x2f6df0dc2b8ad73e0e5d75cadd5b0508625e865e:transport01.raiden.network node=2f6df0dc prev_state=None state=<UserPresence.ONLINE: 'online'>
2018-10-16 09:46:29 [debug    ] State change                   [raiden.raiden_service] node=2f6df0dc state_change={"node_address": "0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a", "network_state": "reachable", "_type": "raiden.transfer.state_change.ActionChangeNodeNetworkState", "_version": 0}

And then some logs showing he has acknowledged the channel opening.

2018-10-16 09:48:03 [debug    ] Opening channel                [raiden.api.rest] partner_address= 0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a registry_address=0xbfa863Ac58a3E0A82B58a8e958F2752Bfb573388 settle_timeout=None token_address=0x30e04CB1CCF93585c2228F651369d8d1e9e3D7a1
2018-10-16 09:48:42 [debug    ] Depositing to new channel      [raiden.api.rest] partner_address=0xE5c68D721324D00E6C90B98b6E6C64250B2BcB9a registry_address=0xbfa863Ac58a3E0A82B58a8e958F2752Bfb573388 token_address=0x30e04CB1CCF93585c2228F651369d8d1e9e3D7a1 total_deposit=1000

ulope · 2018-10-16T13:46:40Z

Yes this is definitely weird. Wild guess: maybe a matrix problem?

czepluch · 2018-10-16T16:00:25Z

@LefterisJP tried to make a transfer to the "receiving node" from another node that has a path while above problem occurs and that transfer also just hangs.

It's a new scenario so here are the logs for that one:
Logs of sender address when hanging starts: https://gist.github.com/czepluch/23ce8ae58afde131a986f42b690f04eb
Logs of receiving address when hanging starts: https://gist.github.com/czepluch/0145be18e8ed95f54d0e31feb08599ef
Logs of another address trying to send a transfer to the receiving node after the hanging has started (this also hangs): https://gist.github.com/czepluch/07e6af8bb820dc2645194ae8f5cba438 (payment starts at line 1863).

If I stop the scenario when it hangs and start it again with a new token, it's does just fine until it reaches the same transfer again and then it hangs again.

Edit: The scenario in my case is this one: https://gist.github.com/czepluch/7f2e6c92f3892d2a37a29a8864e2de69

ulope · 2018-10-18T08:38:35Z

More testing yesterday showed that once a transfer between a particular set of nodes has entered this hanging state it doesn't recover even after a node restart and deploying a new token network.

The parameters determining the hanging state seem to be Nodes involved and Eth chain used.

That makes me even more suspicious that the problem may be in the matrix transport, since those three parameters also control room assignment for node-to-node communication.

LefterisJP · 2018-10-18T20:56:46Z

Thank you @ulope for your insight. Is there any way to debug this by using a dedicated matrix server (our testing servers), recreate the problem in its simplest form and watch what happens in Matrix?

ulope · 2018-10-19T07:14:00Z

@LefterisJP You can already observe what's happening. Simply login in (or register an account, if you do that enter no email address in the registration form, it's not supported on our setup) to the matrix server used by the nodes. In this case it was transport01.

In the corresponding room you can see that the initiating node is sending messages but no reply exists from the target.

Unfortunately, as mentioned before I'm not working today, but will look more into this over the weekend.

christianbrb · 2018-10-22T18:29:29Z

@czepluch What is your even simpler scenario you are referring to in this comment? #2779 (comment)

czepluch · 2018-10-23T09:43:21Z

@christianbrb As I explain in the comment it was just a simple scenario with one node sending a transfer to another one with no mediators.

ulope · 2018-10-24T11:16:16Z

Here is a simpler scenario that also causes this behavior: https://gist.github.com/ulope/9113c794e431e04f460f65e9695cfede

(Please remember to change the raiden executable path to fit your local environment)

During testing this I also again came across #2838 and discovered a new issue #2879

andrevmatos · 2018-10-25T10:02:59Z

Talking with @konradkonrad and @hackaugusto , I can see two ways for the receiver not be receiving sender messages:

A deadlock or other blocking operation happened on the callback handling for a message or event (in _handle_response from raiden-libs's GMatrixClient), and this would block the next /sync calls and events handling (which is intended, it one event is blocked). We've seen this in the past, and their approach to debug is to install a signal handler to print the stacktrace for all running greenlets to identify these deadlocks. I think that's not what's happening, because this makes the user to also be offline (no /sync -> offline), and if the sender is sending the messages, it means the receiver is online.
We didn't join or aren't listening for events on this room: that's probably what's happening, and can have a couple of reasons. My main bet is a race where the whitelist of peers we accept invites from (currently populated on start_health_check) isn't populated yet during Transport.start, when the initialSync takes place and the invite events sent while we were offline are processed, so these events are ignored, we don't join the room and don't listen to events on it.

I'll make a PR to fix 2., and we can test.

christianbrb · 2018-11-06T12:50:39Z

Added this to the testnet 16 milestone as PR #2889 is needed for PR #2948

LefterisJP · 2018-11-07T09:39:18Z

As discussed in the standup we are not sure if #2948 will fix this. @ulope tried with the WIP PR and it did not seem to have fixed the problem.

Fixes raiden-network#2779

Fixes #2779

Fixes raiden-network#2779

condition: - Client A invites - The invite triggers _handle_invite in Client B's transport - Client A starts sending messages to Client B - Messages are lost, as the invite was not processed yet The race condition will be fixed in another PR. Appeared during raiden-network#3124, related raiden-network#2779, raiden-network#3123.

Dominik1999 assigned ulope Oct 15, 2018

christianbrb added the State / Investigating For issues that are currently being looked into before labeling further label Oct 16, 2018

christianbrb assigned andrevmatos and unassigned ulope Oct 22, 2018

andrevmatos mentioned this issue Oct 25, 2018

Add whitelisting of users before transport starts #2889

Closed

ulope mentioned this issue Nov 5, 2018

Payment Transfer gets stuck Abruptly #2922

Closed

christianbrb added this to the Red Eyes Testnet 16 milestone Nov 6, 2018

christianbrb modified the milestones: Red Eyes Testnet 16, Red Eyes Testnet 17 Nov 12, 2018

christianbrb assigned ulope Nov 12, 2018

LefterisJP modified the milestones: Red Eyes Testnet 17, Red Eyes Testnet 18 Nov 26, 2018

ulope mentioned this issue Dec 5, 2018

Fix hanging transfers #3123

Merged

ulope added a commit to ulope/raiden that referenced this issue Dec 5, 2018

Fix hanging transfers

a9305a4

Fixes raiden-network#2779

ulope added a commit to ulope/raiden that referenced this issue Dec 6, 2018

Fix hanging transfers

5a5380c

Fixes raiden-network#2779

LefterisJP closed this as completed in #3123 Dec 6, 2018

LefterisJP pushed a commit that referenced this issue Dec 6, 2018

Fix hanging transfers

d84b691

Fixes #2779

hackaugusto mentioned this issue Jan 14, 2019

Increase test coverage of MatrixTransport #3124

Closed

15 tasks

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

07ca8cb

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

ae50ea2

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

efef49b

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

6e284dd

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

9739eea

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

0761f98

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

20c270e

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 18, 2019

Fix hanging transfers

3f7e021

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

2ab1a4b

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

f4a9627

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

9814771

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

73b13e3

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

95b8931

Fixes raiden-network#2779

hackaugusto pushed a commit to hackaugusto/raiden that referenced this issue Jan 25, 2019

Fix hanging transfers

df38991

Fixes raiden-network#2779

err508 mentioned this issue Mar 6, 2019

send_async / _handle_invite race #3588

Closed

err508 mentioned this issue May 2, 2019

Improve transport logging and remove unused path in leave_unused_rooms #3981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Payments wait forever to be conducted #2779

Payments wait forever to be conducted #2779

Dominik1999 commented Oct 15, 2018 •

edited by christianbrb

Loading

hackaugusto commented Oct 15, 2018 •

edited

Loading

Dominik1999 commented Oct 15, 2018

hackaugusto commented Oct 16, 2018

Dominik1999 commented Oct 16, 2018

czepluch commented Oct 16, 2018 •

edited

Loading

ulope commented Oct 16, 2018

LefterisJP commented Oct 16, 2018

ulope commented Oct 16, 2018

czepluch commented Oct 16, 2018 •

edited

Loading

ulope commented Oct 18, 2018

LefterisJP commented Oct 18, 2018

ulope commented Oct 19, 2018

christianbrb commented Oct 22, 2018

czepluch commented Oct 23, 2018

ulope commented Oct 24, 2018

andrevmatos commented Oct 25, 2018

christianbrb commented Nov 6, 2018

LefterisJP commented Nov 7, 2018

Payments wait forever to be conducted #2779

Payments wait forever to be conducted #2779

Comments

Dominik1999 commented Oct 15, 2018 • edited by christianbrb Loading

Potential Bug / Possible Improvement for the API

Reproduce

hackaugusto commented Oct 15, 2018 • edited Loading

Dominik1999 commented Oct 15, 2018

hackaugusto commented Oct 16, 2018

Dominik1999 commented Oct 16, 2018

czepluch commented Oct 16, 2018 • edited Loading

ulope commented Oct 16, 2018

LefterisJP commented Oct 16, 2018

ulope commented Oct 16, 2018

czepluch commented Oct 16, 2018 • edited Loading

ulope commented Oct 18, 2018

LefterisJP commented Oct 18, 2018

ulope commented Oct 19, 2018

christianbrb commented Oct 22, 2018

czepluch commented Oct 23, 2018

ulope commented Oct 24, 2018

andrevmatos commented Oct 25, 2018

christianbrb commented Nov 6, 2018

LefterisJP commented Nov 7, 2018

Dominik1999 commented Oct 15, 2018 •

edited by christianbrb

Loading

hackaugusto commented Oct 15, 2018 •

edited

Loading

czepluch commented Oct 16, 2018 •

edited

Loading

czepluch commented Oct 16, 2018 •

edited

Loading