Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption while serving large files via HTTP with TCP2 #28587

Closed
armandciejak opened this issue Sep 22, 2020 · 15 comments
Closed

Data corruption while serving large files via HTTP with TCP2 #28587

armandciejak opened this issue Sep 22, 2020 · 15 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@armandciejak
Copy link
Contributor

Describe the bug
While testing TCP2 with the dumb_http_server sample application on the MIMXRT1050_EVK board I regularly receive corrupted file.

To Reproduce
Steps to reproduce the behavior:

  1. extract http_download_test.zip
  2. apply test_http_download.patch
  3. west build -b mimxrt1050_evk zephyr/samples/net/sockets/dumb_http_server
  4. west flash
  5. ping the board on a terminal ping -A 192.168.1.241
  6. use the test script for downloading the file via HTTP and validating the file (size and md5sum) ./test_http_download.sh
  7. when a problem occurs, the script stops and the downloaded data is available in file test

Expected behavior
Receive non corrupted files.

Impact
Since we need to serve "large" files via HTTP this problem prevents us to deliver stable firmware to our customers.

Logs and console output
Diff between the original file and the received files.

--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted6	2020-09-22 08:52:32.679979057 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -468,7 +465,7 @@
 b2YgZmlmdHkgcGVyY2VudCAoNTAlKSBvciBtb3JlIG9mIHRoZQogICAgICBvdXRzdGFuZGluZyBz
 aGFyZXMsIG9yIChpaWkpIGJlbmVmaWNpYWwgb3duZXJzaGlwIG9mIHN1Y2ggZW50aXR5LgoKICAg
 ICAgIllvdSIgKG9yICJZb3VyIikgc2hhbGwgbWVhbiBhbiBpbmRpdmlkdWFsIG9yIExlZ2FsIEVu
-dGl0eQogICAgICBleGVyY2lzaW5nIHBlcm1pc3Npb25zIGdyYW50ZWQgYnkgdGhpcyBMaWNlbnNl
+dGlceedgICAgICBleGVyY2lzaW5nIHBlcm1pc3Npb25zIGdyYW50ZWQgYnkgdGhpcyBMaWNlbnNl
 LgoKICAgICAgIlNvdXJjZSIgZm9ybSBzaGFsbCBtZWFuIHRoZSBwcmVmZXJyZWQgZm9ybSBmb3Ig
 bWFraW5nIG1vZGlmaWNhdGlvbnMsCiAgICAgIGluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8g
 c29mdHdhcmUgc291cmNlIGNvZGUsIGRvY3VtZW50YXRpb24KICAgICAgc291cmNlLCBhbmQgY29u
--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted7	2020-09-22 09:09:52.446353114 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -508,7 +505,7 @@
 IGNvbnNwaWN1b3VzbHkgbWFya2VkIG9yIG90aGVyd2lzZQogICAgICBkZXNpZ25hdGVkIGluIHdy
 aXRpbmcgYnkgdGhlIGNvcHlyaWdodCBvd25lciBhcyAiTm90IGEgQ29udHJpYnV0aW9uLiIKCiAg
 ICAgICJDb250cmlidXRvciIgc2hhbGwgbWVhbiBMaWNlbnNvciBhbmQgYW55IGluZGl2aWR1YWwg
-b3IgTGVnYWwgRW50aXR5CiAgICAgIG9uIGJlaGFsZiBvZiB3aG9tIGEgQ29udHJpYnV0aW9uIGhh
+b3IW4gYSBsYgRW50aXR5CiAgICAgIG9uIGJlaGFsZiBvZiB3aG9tIGEgQ29udHJpYnV0aW9uIGhh
 cyBiZWVuIHJlY2VpdmVkIGJ5IExpY2Vuc29yIGFuZAogICAgICBzdWJzZXF1ZW50bHkgaW5jb3Jw
 b3JhdGVkIHdpdGhpbiB0aGUgV29yay4KCiAgIDIuIEdyYW50IG9mIENvcHlyaWdodCBMaWNlbnNl
 LiBTdWJqZWN0IHRvIHRoZSB0ZXJtcyBhbmQgY29uZGl0aW9ucyBvZgogICAgICB0aGlzIExpY2Vu
--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted8	2020-09-22 09:19:59.896412894 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -635,7 +632,7 @@
 b3duZXIKICAgICAgb3IgYnkgYW4gaW5kaXZpZHVhbCBvciBMZWdhbCBFbnRpdHkgYXV0aG9yaXpl
 ZCB0byBzdWJtaXQgb24gYmVoYWxmIG9mCiAgICAgIHRoZSBjb3B5cmlnaHQgb3duZXIuIEZvciB0
 aGUgcHVycG9zZXMgb2YgdGhpcyBkZWZpbml0aW9uLCAic3VibWl0dGVkIgogICAgICBtZWFucyBh
-bnkgZm9ybSBvZiBlbGVjdHJvbmljLCB2ZXJiYWwsIG9yIHdyaXR0ZW4gY29tbXVuaWNhdGlvbiBz
+bnkgZm9ybSBvZiBlbGVjdHJvbmljLCB2ZXJiYWwsIG9yIHdyaXR0ZW4gYSBvdXVuaWNhdGlvbiBz
 ZW50CiAgICAgIHRvIHRoZSBMaWNlbnNvciBvciBpdHMgcmVwcmVzZW50YXRpdmVzLCBpbmNsdWRp
 bmcgYnV0IG5vdCBsaW1pdGVkIHRvCiAgICAgIGNvbW11bmljYXRpb24gb24gZWxlY3Ryb25pYyBt
 YWlsaW5nIGxpc3RzLCBzb3VyY2UgY29kZSBjb250cm9sIHN5c3RlbXMsCiAgICAgIGFuZCBpc3N1
--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted9	2020-09-22 09:33:27.500521826 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -959,7 +956,8 @@
 U291cmNlIGZvcm0gb2YgYW55IERlcml2YXRpdmUgV29ya3MKICAgICAgICAgIHRoYXQgWW91IGRp
 c3RyaWJ1dGUsIGFsbCBjb3B5cmlnaHQsIHBhdGVudCwgdHJhZGVtYXJrLCBhbmQKICAgICAgICAg
 IGF0dHJpYnV0aW9uIG5vdGljZXMgZnJvbSB0aGUgU291cmNlIGZvcm0gb2YgdGhlIFdvcmssCiAg
-ICAgICAgICBleGNsdWRpbmcgdGhvc2Ugbm90aWNlcyB0aGF0IGRvIG5vdCBwZXJ0YWluIHRvIGFu
+ICAgICAgICBleGNsdWRpbmcgdGhvc2Ugbm90aoYXQgcmVtYWluCiAgICAg
+IXJ0YWluIHRvIGFu
 eSBwYXJ0IG9mCiAgICAgICAgICB0aGUgRGVyaXZhdGl2ZSBXb3JrczsgYW5kCgogICAgICAoZCkg
 SWYgdGhlIFdvcmsgaW5jbHVkZXMgYSAiTk9USUNFIiB0ZXh0IGZpbGUgYXMgcGFydCBvZiBpdHMK
 ICAgICAgICAgIGRpc3RyaWJ1dGlvbiwgdGhlbiBhbnkgRGVyaXZhdGl2ZSBXb3JrcyB0aGF0IFlv

Environment (please complete the following information):

  • OS: Kubuntu 20.04
  • Toolchain: Zephyr SDK 0.11.4, west 0.7.3
  • Zephyr: v2.4.0-rc2
@pfalcon
Copy link
Contributor

pfalcon commented Sep 22, 2020

extract http_download_test.zip
apply test_http_download.patch

I'd suggest to develop a separate sample for such test scenarios, with configurable download size, and context generated programmatically as a pseudo-random sequence (e.g. a hash of incrementing value).

Overall comments on this repro case: it mixes too much variables at once: a) it uses mimxrt1050_evk; b) but it even patches it; c) it patches network stack parameters.

And the issue can be a result of any of the changes above. The baseline of the testing should be qemu_x86 + standard (SLIP) networking (which is hard case btw), and default Zephyr config options. Does that work or not? Only after answering that question, can start varying parameters, one by one. (Well, monte-carlo way of testing with betting on zero may also work with enough effort of course ;-) ).

@jukkar
Copy link
Member

jukkar commented Sep 22, 2020

I could replicate this, it took 500 HTTP requests in my test run to see the issue. The difference of the files seems to be quite small, in your examples from 4 to 24 bytes, in my test it was 24 bytes.

@jukkar
Copy link
Member

jukkar commented Sep 22, 2020

@armandciejak Please try the PR #28595, it has a locking fix that might help here. At least my device is running fine and so far has served 2500 reqs without any issues.

@jukkar
Copy link
Member

jukkar commented Sep 22, 2020

I have been running the http curl script attached to this issue and seen following symptoms:

mimxrt1050_evk

nucleo_f767zi

sam-e70

  • have not seen any issues even after 12000 http reqs
  • disappointing and changing transfer rate from 100kb/sec to 300kb/sec, but typically the rate was around 150kb/sec

frdm-k64f

  • transferred 5600 http reqs, no issues found
  • frdm-k64f had the fastest transfer rate of the devices I tried, constantly 590kb/sec even if it is not the fastest device in this set

All of the tests above were with #28595 applied.

The issue might be in NXP mcux Ethernet driver, at least it is the only platform where this data mismatch has been seen in my tests. On the other hand both the mimxrt1050-evk and frdm-k64f are using the same Ethernet driver but I did not see issues in frdm-k64f.

@pfalcon
Copy link
Contributor

pfalcon commented Sep 22, 2020

@jukkar:

On the other hand both the mimxrt1050-evk and frdm-k64f are using the same Ethernet driver but I did not see issues in frdm-k64f.

But before that, you wrote:

All of the tests above were with #28595 applied.

So, do you mean that before testing frdm_k64f with #28595 applied, as described above, you also tested this board without patch, and didn't see any issues either?

@jukkar
Copy link
Member

jukkar commented Sep 23, 2020

So, do you mean that before testing frdm_k64f with #28595 applied, as described above, you also tested this board without patch, and didn't see any issues either?

No, I did not use frdm_k64f in my earlier tests.

@pfalcon
Copy link
Contributor

pfalcon commented Sep 23, 2020

@jukkar:

No, I did not use frdm_k64f in my earlier tests.

Thanks fro clarifying. Do you have all the code for the test somewhere in a branch, could you push to github? I'd like to run it on frdm_k64f (with and without #28595), but my concern is reproducibility starting from @armandciejak's instructions in the description of this ticket - my concern that doing things manually, each of us will make various deviations, and in the end we'll be testing slightly different things. It would be nice to have single setup to start from. Thanks.

@jukkar
Copy link
Member

jukkar commented Sep 23, 2020

Do you have all the code for the test somewhere in a branch

I used the http_download_test.zip file found in the beginning of this PR, and modified the IP addresses to suit my network setup.

@jukkar
Copy link
Member

jukkar commented Sep 23, 2020

If it is useful, I modified the test_http_download.sh script so I can run multiple runs against different hosts at the same time.

$ cat test_http_download.sh
#!/bin/sh

#SERVER='192.168.178.122:8080'
SERVER=$1:8080
EXPECTED_FILE_SIZE=108222
FILE=test.$SERVER

echo "d969441601d473db911f28dae62101b7 $FILE" > MD5SUM.$FILE

while true; do
    curl http://$SERVER --max-time 10 --output $FILE || exit $?
    FILE_SIZE=$(stat --printf=%s $FILE)
    if [ $FILE_SIZE -ne $EXPECTED_FILE_SIZE ]; then
        echo "wrong size $FILE, expected $EXPECTED_FILE_SIZE, got $FILE_SIZE"
        exit 1
    fi
    md5sum -c MD5SUM.$FILE || exit $?
done

pfalcon added a commit to pfalcon/zephyr that referenced this issue Sep 26, 2020
@pfalcon
Copy link
Contributor

pfalcon commented Sep 26, 2020

Ok, so I tried to do testing from my side. I took http_download_test.zip attached to this ticket, removed unrelated changes, which literally makes it replacing response_big.html.bin to a symlink to response_100k.html.bin, and pushed as https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k branch for reproducibility. I used test_http_download.sh from http_download_test.zip for testing (with trivial modifications, like dumping no. of current iteration).

I started with testing against revision 9bbbf17, which is by now a bit dated, but: a) I did extensive testing against that revision with pristine dumb_http_server/big_http_download; b) I'd like to triage intermediate changes, to avoid possible regressions.

  1. So, with setup like in https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k, i.e. response_100k.html.bin being served, while the rest of options is per the original prj.conf, it's very easy to reproduce an issue of truncated output with curl, both on qemu_x86 and frmd_k64f. This is the crux of the issue, and more on this below.

  2. Then, the (seemingly) only relevant override option from http_download_test.zip is CONFIG_NET_BUF_TX_COUNT=256. This option masks the issue of receiving incomplete file. I was able to run 40000 iterations of the test_http_download.sh script on both qemu_x86 and frdm_k64f.

So, regarding test scenario 1 above. What I see is that send() may fail with ENOMEM (errno 12). Well, it shouldn't do that so easily. In the default socket configuration (no non-blocking socket, no socket timeout set), send() should keep trying to send at least some data out (forever). Socket stack may return an error like ENOMEM only if it's sure that no further progress will be ever made, e.g. if a deadlock is detected. In this case, the socket should be marked as having an "irrecoverable error state" (remember, the stack decided that the situation is irrecoverable, because if it's recoverable, it should just keep trying). "Irrecoverable error state" on a socket in turn should lead to sending TCP RST to the peer - preferrably ASAP, if not "soon", but definitely "eventually".

Because here we hit on concept of "transient" vs "irrecoverable" POSIX errors. The only transient error is EAGAIN. It means that current operation didn't succeed, but user can retry later. Any other error is effectively fatal - user should finish lifetime of the current object (socket in this case), and may retry with a new instance. But user application relies on the system to handle lifecycle properly, including handling any associated error states. In the case of a TCP socket, that includes sending a RST packet to peer.

Because what we have currently is that user application received ENOMEM from send(), expectedly treats that as a fatal error, and closes the socket. But the IP stack doesn't send RST to peer, but a normal FIN, so peer has incomplete data received, but no indication that it's invalid/incomplete.

To repeat, returning EAGAIN would be another option, but it should not be returned unsolicited, it's enabled by O_NONBLOCK or SO_SNDTIMEO options. So, there're only 2 choices in the current situation:

  1. Don't return ENOMEM, and patiently keep trying sending data.
  2. If guaranteed "no progress" condition is detected, make sure that RST will be sent on a socket, and return ENOMEM.

I wasn't able to detect data corruption (i.e. changed bytes) issue in my testing so far (with qemu_x86/frdm_k64f), so as already hinted by @jukkar, I'd suggest that the issue is specific to mimxrt1050-evk.

@pfalcon
Copy link
Contributor

pfalcon commented Oct 8, 2020

For reference, I retested everything with tag v2.4.0 (rev 7a3b253). The picture is the same as was reported above: (dumb_http_server, big_http_download) x (qemu_x86, frdm_k64f) work as expected, with 100K requests tested for dumb_http_server and ~100MB downloaded for big_http_download.

As then retested https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k branch rebased to v2.4.0, and the picture is also as reported above: with default settings it doesn't work as discussed in #28587 (comment), and with CONFIG_NET_BUF_TX_COUNT=256 I tested ~40K requests to be OK.

@github-actions
Copy link

github-actions bot commented Dec 8, 2020

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Dec 8, 2020
@MaureenHelm MaureenHelm removed the Stale label Dec 8, 2020
@carlescufi
Copy link
Member

@jukkar did you have a chance to look at this one, according to @pfalcon this was still an issue in 2.4.0.

@jukkar
Copy link
Member

jukkar commented Jan 26, 2021

I fixed lot of things in tcp2 that helped the issues. But I occacionally saw still issues last time when running the sample app against ab (apache bench) tester. Also if the system runs out of network buffers (like what easily happens with http benchmark tools), it becomes very fragile, so it is difficult to say exactly what issue is causing the problem.

@jukkar jukkar added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Jan 26, 2021
@carlescufi
Copy link
Member

Closing since, according to @jukkar, this is potentially fixed in master now. @armandciejak please reopen if you reproduce this with Zephyr 2.5.x or later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants