Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boards/cc2538/radio: networking has high losses #5786

Closed
PeterKietzmann opened this issue Aug 30, 2016 · 24 comments
Closed

boards/cc2538/radio: networking has high losses #5786

PeterKietzmann opened this issue Aug 30, 2016 · 24 comments
Assignees
Labels
Area: drivers Area: Device drivers State: archived State: The PR has been archived for possible future re-adaptation Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@PeterKietzmann
Copy link
Member

Pinging several times between two remote revision A nodes (remote-reva) with bigger payloads (e.g. 200, 500, 1000 Bytes) leads to high losses.

@PeterKietzmann PeterKietzmann added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: drivers Area: Device drivers labels Aug 30, 2016
@alignan
Copy link
Contributor

alignan commented Aug 31, 2016

Hi @PeterKietzmann, I will try to reproduce, thanks for testing! mind pointing me to steps to reproduce?

@PeterKietzmann
Copy link
Member Author

  1. Flash gnrc_networking on two remotes.
  2. Repeatedly ping the link local address address with a big payload.
    For example: ping6 20 <ipv6 addr> 500

@alignan
Copy link
Contributor

alignan commented Sep 2, 2016

A quick sweep & repeat (5-7 times each step) without too much difference:

(...) 

ping6 20 fe80::212:4b00:615:a029 264 500
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19.06746942 s
rtt min/avg/max = 30.404/30.411/30.417 ms

ping6 20 fe80::212:4b00:615:a029 350
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19.06919232 s
rtt min/avg/max = 38.928/38.934/38.941 ms

ping6 20 fe80::212:4b00:615:a029 364
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19.06939407 s
rtt min/avg/max = 39.921/39.928/39.942 ms

ping6 20 fe80::212:4b00:615:a029 368
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19.06945231 s
rtt min/avg/max = 40.208/40.215/40.222 ms

ping6 20 fe80::212:4b00:615:a029 369
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 0 received, 100% packet loss

ping6 20 fe80::212:4b00:615:a029 500
--- fe80::212:4b00:615:a029 ping statistics ---
20 packets transmitted, 14 received, 30% packet loss, time 26.0610268 s
rtt min/avg/max = 54.413/38.100/54.454 ms

Strangely packet size of 369-380 always yielded 100% packet loss (after 10+ runs, note for 368 the packet loss is 0%), whereas larger packet size yields in packet loss but not 0%. I will keep testing, and try to test also with another CC2538-based platform

@aeneby
Copy link
Member

aeneby commented Sep 3, 2016

Strangely packet size of 369-380 always yielded 100% packet loss

Wow... bizarre. I can reproduce this too (well, 99.9% loss). Do we know if the same thing occurs on other platforms?

@aeneby
Copy link
Member

aeneby commented Sep 3, 2016

It seems that introducing a slight delay in the transmission improves things significantly. The delay only needs to be very slight - enabling DEBUG in cpu/cc2538/radio/cc2538_rf_netdev.c for all nodes is even sufficient, since this adds a delay for printing debug output.

Why this helps, though, I have no idea.

@OlegHahm OlegHahm added this to the Release 2016.10 milestone Sep 3, 2016
@alignan
Copy link
Contributor

alignan commented Sep 15, 2016

Have you tried #5804 with DEBUG enabled? wondering if this contributes to the required delay as well

@alignan alignan changed the title remote-reva/cc2538/radio: networking has high losses boards/cc2538/radio: networking has high losses Sep 20, 2016
@smlng
Copy link
Member

smlng commented Sep 20, 2016

I made the same observations, see #5840 for some debug output. I saw 100% packet loss with ping6 and 256B payload from samr21-xpro to remote-revb

@aeneby
Copy link
Member

aeneby commented Sep 20, 2016

For those testing with large payloads over a layer 3 protocol, please also see #5803

@aeneby
Copy link
Member

aeneby commented Sep 20, 2016

Have you tried #5804/ with DEBUG enabled? wondering if this contributes to the required delay as well

@alignan no, I haven't tested this yet. One thing I can't currently test, but would like to, is whether pinging cc2538 from another platform exhibits the same problem. i.e. do other drivers send their packets as quickly as the cc2538 does? I know that RIOT's cc2538 driver is capable of sending consecutive packets very quickly - it doesn't wait for transmission of the first packet to complete before starting to process the next one, because it doesn't need to. So if other drivers are not capable of sending packets this fast, and therefore not provoking the problem, one workaround might be to wait for transmission to complete before starting to process the next packet.

However, the real underlying issue here seems to be the cc2538 driver not handling multiple packets in the RX FIFO (I think the cc2538 hardware is quite unique in its ability to do this?) To me, this seems like a non-trivial problem to solve, and I would welcome any ideas on how to handle it properly rather than introducing artificial delays.

@smlng
Copy link
Member

smlng commented Sep 23, 2016

Another observation that causes kernel panic:

I have a remote-revb, a samr21-xpro, and a RPi with openlabs transceiver, and I send pings (no large payload, just standard size, i.e. 4B) between those devices. The following happens:

  • ping between remote-revb and samr21-xpro works in both directions
  • ping between remote-revb and RPi does not work in any direction
    • remote-revb -> RPi: see ping request in sniffer but nothing else
    • RPI -> remote-revb: see NS and NA in sniffer but no ping request nor replies
  • ping between RPi and samr21-xpro works in both directions
    • ping from samr21-xpro to RPi instantly creates kernel panic on remote-revb, though it only listens, no other packets in the air

@LucaZulberti
Copy link

LucaZulberti commented Sep 23, 2016

@aeneby Why the RX FIFO is unconditionally flushed? At the end of the "_recv" function when the packet is copied into the buffer there could be another packet into the FIFO.
The driver should flush the fifo only in case of overflow. I don't know how the upper layer works, but it should yield and wait for the next IRQ of packet reception without problems. Doesn't it?

@aeneby
Copy link
Member

aeneby commented Sep 23, 2016

@smlng

ping between remote-revb and RPi does not work in any direction

Could this perhaps be caused by the Linux kernel on the RPi dropping the packets? RIOT actually sends out incorrectly constructed packets for intra-PAN communication, and my observation has been that Linux will drop these. If that's the cause, however, it seems strange that it would work for the samr21-xpro, since both RIOT devices should be using the same stack.

ping between RPi and samr21-xpro works in both directions

I'd be interested to know if the packets sent out from the samr21-xpro are identical to the ones being send out from the remote-revb, i.e. identical src/dst PANs and the PAN compression/intra-PAN bit set to 0. Do you know what causes the panic in this scenario? I"m guessing it's a failed assertion, but which one?

@LucaZulberti, you are 100% correct. But the reason it hasn't been fixed is that I don't think the solution is quite as obvious as it first seems. If we let the RX FIFO overflow before we flush it, then we potentially lose a packet anyway, right? The one which didn't fit into the buffer, because we didn't flush it earlier. So there are some corner cases to consider. Unfortunately I will not have much time personally to look at this for the next several weeks, but I would be happy to review any proposed solutions.

@PeterKietzmann
Copy link
Member Author

@alignan, @aeneby in #5869 we completely disabled ACK interrupts and in that scope I realized that the cc2538 does not handle ACKs and retransmissions in hardware. Is that correct?

So if other drivers are not capable of sending packets this fast, and therefore not provoking the problem, one workaround might be to wait for transmission to complete before starting to process the next packet.

If my above assumption applies and in case we will implement ACK handling as well as retransmission handling in software, it might be reasonable to slow down the sender by waiting for an ACK (even if this is not with regards to the highest performance)

@aeneby
Copy link
Member

aeneby commented Oct 1, 2016

in #5869 we completely disabled ACK interrupts and in that scope I realized that the cc2538 does not handle ACKs and retransmissions in hardware. Is that correct?

@PeterKietzmann yes that is correct (although automatic sending of acknowledgements is supported in hardware)

If my above assumption applies and in case we will implement ACK handling as well as retransmission handling in software, it might be reasonable to slow down the sender by waiting for an ACK (even if this is not with regards to the highest performance)

Hmm but that would imply that we always need to have the ACK_REQ (acknowledgement request) flag set in the frame header of every sent packet, or risk running into the same problem again?

I notice that the cc2538 driver in a certain other IoT platform (hint: starts with C, rhymes with "non-sticky") waits for transmission to complete before returning from the send function. But as far as I'm concerned this should not be necessary, which is why I did not implement it. So the question is, are we really too fast at sending, or just too slow at receiving?

[edit] Clarification of ACK_REQ bit.

@smlng
Copy link
Member

smlng commented Oct 1, 2016

@aeneby from what I saw during my (interop) tests with other boards, i.e. pba-d-01-kw2x and samr21-xpro, the cc2538 is sending to fast when there is fragmentation. samr21 and kw2x require something > 3ms between distinct frame fragments, otherwise they are unable to re-assemble the original frame.

When I enabled debugging in the cc2538 driver (using gnrc_networking btw for my tests), the time between fragments raises to about 4.8ms and then samr21 and kw2x where able to re-assemble the original frame and large pings succeed in both directions.

So your suggestion to wait until transmission of one frame/fragment returns might be a good idea, to solve this in general. Nevertheless, I think @PeterKietzmann is right too, if the device cannot handle ACKs and retrans, the driver should implement those in software.

Currently, with #5869 in place we can only send unfragmented frames successfully everything else is a mess - this needs to be fixed. On the other hand, we should also look into the receiver function of the other boards, maybe we can speed them up a bit as well?!

@aeneby
Copy link
Member

aeneby commented Oct 1, 2016

I don't know about other boards/MCUs unfortunately, since all I have is ones based on the cc2538. I suppose in the interests of interoperability we could introduce a slight delay between sending packets; at this point optimization seems somewhat premature. Are you able to confirm that something like this (untested) patch resolves the issue?

As for handling ACKs and CSMA etc, wouldn't doing this in the driver be a duplication of this effort (and similar others)? Or am I misunderstanding something?

@smlng
Copy link
Member

smlng commented Oct 6, 2016

@aeneby your untested patch is now tested by me 😄 and it works! I now can send fragments, that is PINGs with large payloads, successfully between remote-revb and samr21-xpro.

However, @PeterKietzmann said: ACKs and retransmissions are currently missing and have to be implemented in/by the driver.

@PeterKietzmann
Copy link
Member Author

@smlng great! As said, for me it's ok to introduce this "blocker" if it fixes interoperability. As @aeneby proposed I also think that software retransmission and ACK handling might be generalized by a mac protocol, even though I didn't look into details of #3730

@smlng
Copy link
Member

smlng commented Oct 6, 2016

@PeterKietzmann fine with me to generalize such functionality, so we should move forward there, too.

However, @aeneby will you provide a PR with your patch or shall I do so? Further, I'd like to get those two remote-xyz / cc2538 PRs (#5823 and #5840) merged (very) soon, so we can move on.

@alignan
Copy link
Contributor

alignan commented Oct 6, 2016

Unfortunately, I have a very low bandwidth now to do any test/changes until next week, so any help on those is welcome, thanks!

@aeneby
Copy link
Member

aeneby commented Oct 6, 2016

@smlng see #5915

WRT ACK/CSMA, I don't know how far off a MAC protocol capable of doing this is, but it would certainly save duplication of effort if it was possible for every driver to utilise the same code. In saying that, however, the cc2538 does have a dedicated programmable "Command Strobe/CSMA-CA Processor" (CSP) which would be much more efficient at handling this. I haven't looked very deep into this, though.

@smlng
Copy link
Member

smlng commented Oct 10, 2016

@aeneby and @PeterKietzmann, I tested and merged #5915 - I think we can close this one then?

@PeterKietzmann
Copy link
Member Author

I will close it an set the memo label cause there is still room for improvement!

@smlng
Copy link
Member

smlng commented Oct 10, 2016

@PeterKietzmann, yes you're right, but thats the case with (m)any solutions/PRs - we might never close any issue then 😬

@smlng smlng added the State: archived State: The PR has been archived for possible future re-adaptation label Oct 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers State: archived State: The PR has been archived for possible future re-adaptation Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

6 participants