Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gnrc_sock_tcp: add gnrc sock tcp #16494

Merged

Conversation

brummer-simon
Copy link
Member

@brummer-simon brummer-simon commented May 22, 2021

Contribution description

This PR adds gnrc_sock_tcp, finally adding a sock_tcp implementation based on gnrc_tcp.

Testing procedure

To test this PR a simplified version of the gnrc_tcp testsuite is supplied under tests/gnrc_sock_tcp.
To execute it, just follow the instructions there.

Issues/PRs references

Depends on PR #16459
Depends on PR #16461
Depends on PR #16493
Closes #16493 - TCP Sockets can not be used / built

@github-actions github-actions bot added Area: network Area: Networking Area: sys Area: System Area: tests Area: tests and testing framework labels May 22, 2021
@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch from a732ede to 4e8063c Compare May 22, 2021 12:03
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021
@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch from 4e8063c to 43d57ab Compare July 2, 2021 08:04
@github-actions github-actions bot added the Area: doc Area: Documentation label Jul 2, 2021
@benpicco benpicco added the State: waiting for other PR State: The PR requires another PR to be merged first label Jul 8, 2021
@benpicco benpicco removed the State: waiting for other PR State: The PR requires another PR to be merged first label Jul 13, 2021
@benpicco
Copy link
Contributor

benpicco commented Jul 13, 2021

Now this also needs a rebase 😉

You want to fix the last commit message while you are at it:

gnrc_sock_tcp: Add gnrc_tcp _tcp

@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch 3 times, most recently from 646bb6b to 134180c Compare July 14, 2021 04:27
@brummer-simon
Copy link
Member Author

Done

@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@benpicco benpicco added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Jul 21, 2021
Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! If you use the existing types intended for implementing this you should be able to get rid of this casting show 😉

@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch 2 times, most recently from b1d0cef to 8f62149 Compare August 9, 2021 09:53
@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch 2 times, most recently from ba1588b to fe77e03 Compare August 9, 2021 18:56
@brummer-simon
Copy link
Member Author

Ping: @benpicco , @miri64 All of the above Issues have been addressed.

Comment on lines 38 to 42
if (err) {
printf("%s: returns %s\n", name, strerror(err));
} else {
printf("%s: returns %d\n", name, err);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this. strerror(0) returns "Success"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on every platform ;). And the python code calling the functions matches for the specific result. Thats why I handle 0 manually.

Copy link
Member Author

@brummer-simon brummer-simon Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW.: You added an implementation returning OK on 0, a few hours ago. :)

Copy link
Contributor

@benpicco benpicco Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg sorry for that - feel free to change to return value to "Success" so you can get rid of this adapter function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep it as it is. I am not sure if the returned string is always "Success" on all platforms. For any error code the return value should always be textual representation of the enum value but for 0 different strings are possible. Since this is test code I think keeping the tiny function is okay.

Comment on lines +47 to +48
dump_args(argc, argv);

Copy link
Contributor

@benpicco benpicco Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dump_args(argc, argv);
if (argc < 3) {
printf("usage: %s <host> <port>\n", argv[0]);
return -1;
}

let's be a bit more user friendly. Right now if you just call gnrc_tcp_open expecting to get usage information, the test will simply crash on you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when I try

> gnrc_tcp_open fe80::90a7:a6ff:fe4b:2e32 12345
gnrc_tcp_open: argc=3, argv[0] = gnrc_tcp_open, argv[1] = fe80::90a7:a6ff:fe4b:2e32, argv[2] = 12345
gnrc_tcp_open: returns -EAFNOSUPPORT

> gnrc_tcp_open fe80::90a7:a6ff:fe4b:2e32%5 12345
gnrc_tcp_open: argc=3, argv[0] = gnrc_tcp_open, argv[1] = fe80::90a7:a6ff:fe4b:2e32%5, argv[2] = 12345
gnrc_tcp_open: returns -EAFNOSUPPORT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if user-friendlyness should be is the design goal for a test API only called from python code.

For the -EAFNOSUPPORT thing I need to take a deeper look. This test should only expose sock_connect. Did you run this agains the gnrc_tcp tests instead of the gnrc_sock_tests that are the subject of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I ran tests/gnrc_tcp instead of tests/gnrc_sock_tcp 😅

I also figured out the format

> sock_tcp_connect [fe80::90a7:a6ff:fe4b:2e32]:12345 6666
sock_tcp_connect: argc=3, argv[0] = sock_tcp_connect, argv[1] = [fe80::90a7:a6ff:fe4b:2e32]:12345, argv[2] = 6666
sock_tcp_connect: returns Unknown error -110

I wanted to run this against netcat since I expected it to behave very much like netcat, just with more manual steps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I have to specify the interface as well

> sock_tcp_connect [fe80::90a7:a6ff:fe4b:2e32%5]:12345 7412
sock_tcp_connect: argc=3, argv[0] = sock_tcp_connect, argv[1] = [fe80::90a7:a6ff:fe4b:2e32%5]:12345, argv[2] = 7412
sock_tcp_connect: returns 0

Have you considered using netutils_get_ipv6() in gnrc_tcp_ep_from_str() 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm since we use strerror, most of the used errorcodes seem to be Unknown :(

I have not considered netutils_get_ipv6(), but I would argue that this is not in scope of this PR. Cleaning this up should happen as a seperate change. This PR is for the interface between gnrc and sock, not changes in gnrc.

Copy link
Contributor

@benpicco benpicco Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm since we use strerror, most of the used errorcodes seem to be Unknown :(

That's because strerror expects positive error codes

With this

--- a/tests/gnrc_sock_tcp/main.c
+++ b/tests/gnrc_sock_tcp/main.c
@@ -36,7 +36,7 @@ void dump_args(int argc, char **argv)
 void print_result(const char *name, int err)
 {
     if (err) {
-        printf("%s: returns %s\n", name, strerror(err));
+        printf("%s: returns %s\n", name, strerror(-err));
     } else {
         printf("%s: returns %d\n", name, err);
     }

I get

sock_tcp_read: returns Resource temporarily unavailable

instead of

sock_tcp_read: returns Unknown error -11

I have not considered netutils_get_ipv6(), but I would argue that this is not in scope of this PR. Cleaning this up should happen as a seperate change.

I agree. But we still should make the CLI interface of the test a bit nicer (print usage instead of crashing if argument count does not match, maybe even use default timeout / local port if the argument was omitted).

There is value in being able to test out individual bits of the API in an exploitative way, it shouldn't be a frustrating experience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it

@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 10, 2021

If you try to establish a new connection before sock_tcp_disconnect was finisched, the only TCB that could accept the connection is still in use.

How do I know sock_tcp_disconnect() has finished? Shouldn't it just block?
And since I kill the other half of the connection, I don't expect any response (unless Linux is taking care of this).

Yes it blocks. Internally it tries to close the connection. If the other side doesn't send anymore it still has to wait two minutes before it is unblocked. This has something todo with slow roundtrip times from the early internet. So if you try to open a new connection before sock_tcp_disconnect finished it must fail.

The entire process is quite complicated to explain in a code review, if you wanna know details, the TCP RFC is your friend. For your test: just print after your disconnect and then call nc again. In addition: you can tweak all TCP Timouts or the number of preallocated buffers via kconfig

@benpicco
Copy link
Contributor

The TCP sock API actually contains an example echo server.

I would expect this to work with gnrc_sock_tcp.

@brummer-simon
Copy link
Member Author

The TCP sock API actually contains an example echo server.

I would expect this to work with gnrc_sock_tcp.

The TCP sock API actually contains an example echo server.

I would expect this to work with gnrc_sock_tcp.

I haven't tried it but after the changes related to SOCK_NO_TIMEOUT, we should test it.

@benpicco
Copy link
Contributor

please squash! (&rebase)

@brummer-simon brummer-simon force-pushed the gnrc_sock_tcp-add_gnrc_sock_tcp branch from b979807 to 6fba0d9 Compare August 14, 2021 19:32
@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 14, 2021

please squash! (&rebase)

Done. I fixed the NO_TIMOUT mapping and the asserts as well.

@benpicco
Copy link
Contributor

benpicco commented Aug 14, 2021

Hmm the echo server example still does not behave as one would expect :/

main.c
#include "net/sock/tcp.h"

#include "net/ipv6/addr.h"
#include "net/gnrc.h"
#include "net/gnrc/netif.h"

#define SOCK_QUEUE_LEN  (1U)

sock_tcp_t sock_queue[SOCK_QUEUE_LEN];
uint8_t buf[128];

int main(void)
{
    /* get interfaces and print their addresses */
    gnrc_netif_t *netif = NULL;
    while ((netif = gnrc_netif_iter(netif))) {
        ipv6_addr_t ipv6_addrs[CONFIG_GNRC_NETIF_IPV6_ADDRS_NUMOF];
        int res = gnrc_netapi_get(netif->pid, NETOPT_IPV6_ADDR, 0, ipv6_addrs,
                                  sizeof(ipv6_addrs));

        if (res < 0) {
            continue;
        }
        for (unsigned i = 0; i < (unsigned)(res / sizeof(ipv6_addr_t)); i++) {
            char ipv6_addr[IPV6_ADDR_MAX_STR_LEN];

            ipv6_addr_to_str(ipv6_addr, &ipv6_addrs[i], IPV6_ADDR_MAX_STR_LEN);
            printf("My address is %s\n", ipv6_addr);
        }
    }

    sock_tcp_ep_t local = SOCK_IPV6_EP_ANY;
    sock_tcp_queue_t queue;
    local.port = 12345;
    if (sock_tcp_listen(&queue, &local, sock_queue, SOCK_QUEUE_LEN, 0) < 0) {
        puts("Error creating listening queue");
        return 1;
    }
    puts("Listening on port 12345");
    while (1) {
        sock_tcp_t *sock;
        if (sock_tcp_accept(&queue, &sock, SOCK_NO_TIMEOUT) < 0) {
            puts("Error accepting new sock");
        }
        else {
            int read_res = 0;
            puts("Reading data");
            while (read_res >= 0) {
                read_res = sock_tcp_read(sock, &buf, sizeof(buf),
                                         SOCK_NO_TIMEOUT);
                if (read_res < 0) {
                    puts("Disconnected");
                    break;
                }
                else {
                    int write_res;
                    printf("Read: \"");
                    for (int i = 0; i < read_res; i++) {
                        printf("%c", buf[i]);
                    }
                    puts("\"");
                    if ((write_res = sock_tcp_write(sock, &buf,
                                                    read_res)) < 0) {
                        puts("Errored on write, finished server loop");
                        break;
                    }
                }
            }
            sock_tcp_disconnect(sock);
        }
    }
    sock_tcp_stop_listen(&queue);
    return 0;
}
Makefile
# name of your application
APPLICATION = tcp_echo_server

# If no BOARD is found in the environment, use this default:
BOARD ?= native

# This has to be the absolute path to the RIOT base directory:
RIOTBASE ?= $(CURDIR)/../..

USEMODULE += gnrc_netdev_default
USEMODULE += auto_init_gnrc_netif
USEMODULE += gnrc_ipv6
USEMODULE += gnrc_sock_tcp

include $(RIOTBASE)/Makefile.include
main(): This is RIOT! (Version: 2021.10-devel-369-g6fba0-gnrc_sock_tcp-add_gnrc_sock_tcp)
My address is fe80::748f:e6ff:fed7:426d
Listening on port 12345
Reading data
Read: "Hello World
"
Read: ""
Read: ""
Read: ""
Read: ""
Read: ""
Read: ""
Read: ""
% echo Hello World | nc -6 fe80::748f:e6ff:fed7:426d%tapbr0 12345
Hello World

^C
% echo Hello World | nc -6 fe80::748f:e6ff:fed7:426d%tapbr0 12345                :(
1 %

@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 15, 2021

I managed to figure out whats happening here. In short: The example is broken. Here are more details:

We see here interesting behavior: On entering Ctrl+C netcat actually tries to teardown the connection normally by sending a FIN that gets an ACK in reply. Receiving a FIN before sending our own FIN puts a TCP connection into a state called CLOSED_WAIT.

This state means basically, our peer has sent all its data and excepts from us to send a FIN in return (aka. call close).
If you call read on a TCB in this state, your either get the remaining stored bytes or zero if everything was read already. The timeout is not in effect here, because we can't expect to receive any more data from the peer because the FIN was received already.

The zero bytes read returns mean here: You have read all data and from this connection no more data will come.
-> Since the example does not leave the processing loop if zero bytes are returned, the processing loop is stuck forever, therefore the connection is never closed from the RIOT side.

This behavior is not a bug in gnrc_tcp. Infact there was a bug (see #12367) a while ago that this behavior was not implemented this way.
The test function "test_gnrc_tcp_recv_behavior_on_closed_connection" ensures this behavior is enforced.

If you change:

if (read_res < 0) {
    puts("Disconnected");
    break;
}

to

if (read_res <= 0) {
    puts("Disconnected");
    break;
}

It works as intended.

@benpicco
Copy link
Contributor

That's good news (or maybe bad news as we have a bug in the example code 😕)

I moved the example code to a proper example (#16739) and noticed something else: The example client will not disconnect when the example server disconnects:

client

All up, running the shell now
> send fe80::7837:fcff:fe7d:1aaf 12345 Test
send fe80::7837:fcff:fe7d:1aaf 12345 Test
Read: "Test"
ps
asdsadas

server

main(): This is RIOT! (Version: 2021.10-devel-369-g6fba0-gnrc_sock_tcp-add_gnrc_sock_tcp)
My address is fe80::7837:fcff:fe7d:1aaf
Listening on port 12345
Reading data
Read: "Test"
Disconnected

Maybe there is a bug in the client code too?

@brummer-simon
Copy link
Member Author

That's good news (or maybe bad news as we have a bug in the example code confused)

I moved the example code to a proper example (#16739) and noticed something else: The example client will not disconnect when the example server disconnects:

client

All up, running the shell now
> send fe80::7837:fcff:fe7d:1aaf 12345 Test
send fe80::7837:fcff:fe7d:1aaf 12345 Test
Read: "Test"
ps
asdsadas

server

main(): This is RIOT! (Version: 2021.10-devel-369-g6fba0-gnrc_sock_tcp-add_gnrc_sock_tcp)
My address is fe80::7837:fcff:fe7d:1aaf
Listening on port 12345
Reading data
Read: "Test"
Disconnected

Maybe there is a bug in the client code too?

Not necessarily. Depending on the timing in connection teardown, disconnect can take up to two minutes for its return.
I mentioned this above somewhere.

@benpicco
Copy link
Contributor

Depending on the timing in connection teardown, disconnect can take up to two minutes for its return.
I mentioned this above somewhere.

Hm I thought this was only the case if the other side does not properly disconnect?
If the other end calls sock_tcp_disconnect() what is there to wait for?

@brummer-simon
Copy link
Member Author

Retransmissions of FIN Packets in cases where the ACK sent in response was lost on the way. A TCP connection is practically dead if after two times the maximum segment lifetime no packet was transmitted. Linux handles this this internally and return fast after close and allocates a new TCB for the next connection. RIOT can't do this since everything is statically allocated. Thats why close/disconnect has to wait until the used TCB is reusable again. If you want more details look for TIME-WAIT in the TCP RFC.

TCP connection teardown is pretty complicated

@benpicco
Copy link
Contributor

benpicco commented Aug 15, 2021

But I'm testing with netdev_tap - it's very unlikely that a packet gets lost here. And the behavior is consistent.

I don't see this issue with LWIP=1

@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 15, 2021

Okay. Let me explain the Issue in detail:
Lets assume we have a established TCP connection between host A and B.
A wants to terminate its connection and sends a FIN to B. B receives it and sends an ACK in response.
A receives the ACK and knows that B received the FIN before.

At some point later B sends a FIN to A. Now A knows that B wants to terminate its side of the connection.
Now A sends an ACK to B and here comes the Issue. There are two possibilities:

  1. B received the ACK. This was the last packet transmitted.
  2. The ACK gets lost on its way to B. After a while B's retransmission mechanism sends FIN again since it was never ACKed.

At this Point A never knows if B received the ACK because in either case B would not answer immediately.
TCPs solution to this is, that A has wait two times the maximum segment lifetime (the standard uses a Minute) for packets from B before it assumes that B received the last ACK and B will not send anything again.

So if the so called LAST WAIT timer times out and B didn't send anything within the last to minutes -> A knows that B received the ACK before and closes its connection. If A receives a FIN from B triggered by its re-transmit mechanism, during its waiting period, A sends a new ACK, resets the timer and waits again.

That's why you observe this waiting period regardless of the underlying network stability. TCP must assume the most unstable underlying connection (TCP was invented around 1981, just assume the network quality and roundtrip times back then ;)).

I don't know how LWIP is handling this (maybe they use some hack like sending reset packages on connection teardown, who knows). I thought about how I implement it differently but all solutions I came up with, would enable easily exploitable memory leakage in the packet buffer that's why I didn't try this.

You could of course shorten the maximum segment lifetime constant by defines. The gnrc_tcp tests do this to speedup the connection teardown process.

@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 15, 2021

A possible solution might be to calculate a better fitting and much shorter maximum segment lifetime dynamically based on to the mean round-trip time used by the re-transmission mechanism. This could speed things up, but this is definitely non-standard behavior and out of scope of this PR.

Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy yet as the explanation doesn't make the unexpected behavior go away or give a way for an application developer to mitigate that (other than always using timeouts).

However, this is indeed unrelated to the SOCK API adoption that now really just is a thin wrapper. So ACK!

@benpicco benpicco merged commit 8d9ca8e into RIOT-OS:master Aug 15, 2021
@brummer-simon brummer-simon deleted the gnrc_sock_tcp-add_gnrc_sock_tcp branch August 15, 2021 18:19
@brummer-simon
Copy link
Member Author

@benpicco - I did some digging regarding that timeout. Current Linux and BSD Systems don't use the 2 Minute MSL anymore. They use one minute instead. We can to this too. This eases the pain a little bit.

@brummer-simon
Copy link
Member Author

brummer-simon commented Aug 15, 2021

Interestingly enough this has been done already. The current MSL value is 30s. I think I'll take a lock into the dynamic MSL calculation thing. This might be a good but experimental optional solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: doc Area: Documentation Area: network Area: Networking Area: sys Area: System Area: tests Area: tests and testing framework CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants