Data corruption while serving large files via HTTP with TCP2 #28587

armandciejak · 2020-09-22T08:04:01Z

Describe the bug
While testing TCP2 with the dumb_http_server sample application on the MIMXRT1050_EVK board I regularly receive corrupted file.

To Reproduce
Steps to reproduce the behavior:

extract http_download_test.zip
apply test_http_download.patch
west build -b mimxrt1050_evk zephyr/samples/net/sockets/dumb_http_server
west flash
ping the board on a terminal ping -A 192.168.1.241
use the test script for downloading the file via HTTP and validating the file (size and md5sum) ./test_http_download.sh
when a problem occurs, the script stops and the downloaded data is available in file test

Expected behavior
Receive non corrupted files.

Impact
Since we need to serve "large" files via HTTP this problem prevents us to deliver stable firmware to our customers.

Logs and console output
Diff between the original file and the received files.

--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted6	2020-09-22 08:52:32.679979057 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -468,7 +465,7 @@
 b2YgZmlmdHkgcGVyY2VudCAoNTAlKSBvciBtb3JlIG9mIHRoZQogICAgICBvdXRzdGFuZGluZyBz
 aGFyZXMsIG9yIChpaWkpIGJlbmVmaWNpYWwgb3duZXJzaGlwIG9mIHN1Y2ggZW50aXR5LgoKICAg
 ICAgIllvdSIgKG9yICJZb3VyIikgc2hhbGwgbWVhbiBhbiBpbmRpdmlkdWFsIG9yIExlZ2FsIEVu
-dGl0eQogICAgICBleGVyY2lzaW5nIHBlcm1pc3Npb25zIGdyYW50ZWQgYnkgdGhpcyBMaWNlbnNl
+dGlceedgICAgICBleGVyY2lzaW5nIHBlcm1pc3Npb25zIGdyYW50ZWQgYnkgdGhpcyBMaWNlbnNl
 LgoKICAgICAgIlNvdXJjZSIgZm9ybSBzaGFsbCBtZWFuIHRoZSBwcmVmZXJyZWQgZm9ybSBmb3Ig
 bWFraW5nIG1vZGlmaWNhdGlvbnMsCiAgICAgIGluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8g
 c29mdHdhcmUgc291cmNlIGNvZGUsIGRvY3VtZW50YXRpb24KICAgICAgc291cmNlLCBhbmQgY29u

--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted7	2020-09-22 09:09:52.446353114 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -508,7 +505,7 @@
 IGNvbnNwaWN1b3VzbHkgbWFya2VkIG9yIG90aGVyd2lzZQogICAgICBkZXNpZ25hdGVkIGluIHdy
 aXRpbmcgYnkgdGhlIGNvcHlyaWdodCBvd25lciBhcyAiTm90IGEgQ29udHJpYnV0aW9uLiIKCiAg
 ICAgICJDb250cmlidXRvciIgc2hhbGwgbWVhbiBMaWNlbnNvciBhbmQgYW55IGluZGl2aWR1YWwg
-b3IgTGVnYWwgRW50aXR5CiAgICAgIG9uIGJlaGFsZiBvZiB3aG9tIGEgQ29udHJpYnV0aW9uIGhh
+b3IW4gYSBsYgRW50aXR5CiAgICAgIG9uIGJlaGFsZiBvZiB3aG9tIGEgQ29udHJpYnV0aW9uIGhh
 cyBiZWVuIHJlY2VpdmVkIGJ5IExpY2Vuc29yIGFuZAogICAgICBzdWJzZXF1ZW50bHkgaW5jb3Jw
 b3JhdGVkIHdpdGhpbiB0aGUgV29yay4KCiAgIDIuIEdyYW50IG9mIENvcHlyaWdodCBMaWNlbnNl
 LiBTdWJqZWN0IHRvIHRoZSB0ZXJtcyBhbmQgY29uZGl0aW9ucyBvZgogICAgICB0aGlzIExpY2Vu

--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted8	2020-09-22 09:19:59.896412894 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -635,7 +632,7 @@
 b3duZXIKICAgICAgb3IgYnkgYW4gaW5kaXZpZHVhbCBvciBMZWdhbCBFbnRpdHkgYXV0aG9yaXpl
 ZCB0byBzdWJtaXQgb24gYmVoYWxmIG9mCiAgICAgIHRoZSBjb3B5cmlnaHQgb3duZXIuIEZvciB0
 aGUgcHVycG9zZXMgb2YgdGhpcyBkZWZpbml0aW9uLCAic3VibWl0dGVkIgogICAgICBtZWFucyBh
-bnkgZm9ybSBvZiBlbGVjdHJvbmljLCB2ZXJiYWwsIG9yIHdyaXR0ZW4gY29tbXVuaWNhdGlvbiBz
+bnkgZm9ybSBvZiBlbGVjdHJvbmljLCB2ZXJiYWwsIG9yIHdyaXR0ZW4gYSBvdXVuaWNhdGlvbiBz
 ZW50CiAgICAgIHRvIHRoZSBMaWNlbnNvciBvciBpdHMgcmVwcmVzZW50YXRpdmVzLCBpbmNsdWRp
 bmcgYnV0IG5vdCBsaW1pdGVkIHRvCiAgICAgIGNvbW11bmljYXRpb24gb24gZWxlY3Ryb25pYyBt
 YWlsaW5nIGxpc3RzLCBzb3VyY2UgY29kZSBjb250cm9sIHN5c3RlbXMsCiAgICAgIGFuZCBpc3N1

--- samples/net/sockets/dumb_http_server/src/response_100k.html.bin	2020-09-22 08:36:41.050513782 +0200
+++ test_corrupted9	2020-09-22 09:33:27.500521826 +0200
@@ -1,6 +1,3 @@
-HTTP/1.0 200 OK
-Content-Type: text/html; charset=utf-8
-
 <html>
 <head>
 <style>
@@ -959,7 +956,8 @@
 U291cmNlIGZvcm0gb2YgYW55IERlcml2YXRpdmUgV29ya3MKICAgICAgICAgIHRoYXQgWW91IGRp
 c3RyaWJ1dGUsIGFsbCBjb3B5cmlnaHQsIHBhdGVudCwgdHJhZGVtYXJrLCBhbmQKICAgICAgICAg
 IGF0dHJpYnV0aW9uIG5vdGljZXMgZnJvbSB0aGUgU291cmNlIGZvcm0gb2YgdGhlIFdvcmssCiAg
-ICAgICAgICBleGNsdWRpbmcgdGhvc2Ugbm90aWNlcyB0aGF0IGRvIG5vdCBwZXJ0YWluIHRvIGFu
+ICAgICAgICBleGNsdWRpbmcgdGhvc2Ugbm90aoYXQgcmVtYWluCiAgICAg
+IXJ0YWluIHRvIGFu
 eSBwYXJ0IG9mCiAgICAgICAgICB0aGUgRGVyaXZhdGl2ZSBXb3JrczsgYW5kCgogICAgICAoZCkg
 SWYgdGhlIFdvcmsgaW5jbHVkZXMgYSAiTk9USUNFIiB0ZXh0IGZpbGUgYXMgcGFydCBvZiBpdHMK
 ICAgICAgICAgIGRpc3RyaWJ1dGlvbiwgdGhlbiBhbnkgRGVyaXZhdGl2ZSBXb3JrcyB0aGF0IFlv

Environment (please complete the following information):

OS: Kubuntu 20.04
Toolchain: Zephyr SDK 0.11.4, west 0.7.3
Zephyr: v2.4.0-rc2

The text was updated successfully, but these errors were encountered:

pfalcon · 2020-09-22T08:34:43Z

extract http_download_test.zip
apply test_http_download.patch

I'd suggest to develop a separate sample for such test scenarios, with configurable download size, and context generated programmatically as a pseudo-random sequence (e.g. a hash of incrementing value).

Overall comments on this repro case: it mixes too much variables at once: a) it uses mimxrt1050_evk; b) but it even patches it; c) it patches network stack parameters.

And the issue can be a result of any of the changes above. The baseline of the testing should be qemu_x86 + standard (SLIP) networking (which is hard case btw), and default Zephyr config options. Does that work or not? Only after answering that question, can start varying parameters, one by one. (Well, monte-carlo way of testing with betting on zero may also work with enough effort of course ;-) ).

jukkar · 2020-09-22T09:34:18Z

I could replicate this, it took 500 HTTP requests in my test run to see the issue. The difference of the files seems to be quite small, in your examples from 4 to 24 bytes, in my test it was 24 bytes.

jukkar · 2020-09-22T12:51:34Z

@armandciejak Please try the PR #28595, it has a locking fix that might help here. At least my device is running fine and so far has served 2500 reqs without any issues.

jukkar · 2020-09-22T15:22:28Z

I have been running the http curl script attached to this issue and seen following symptoms:

mimxrt1050_evk

have seen one data mismatch (after 1200 reqs) even with the net: tcp2: Lock connection when running from work queue #28595 applied
have seen one crash, fix in net: tcp2: Lock connection when running from work queue #28595
disappointing transfer speed which also changes a lot, from 107kb/sec to 350kb/sec, typically around 150kb/sec

nucleo_f767zi

did not see any data mismatch errors
have seen one crash, fix in net: tcp2: Lock connection when running from work queue #28595

sam-e70

have not seen any issues even after 12000 http reqs
disappointing and changing transfer rate from 100kb/sec to 300kb/sec, but typically the rate was around 150kb/sec

frdm-k64f

transferred 5600 http reqs, no issues found
frdm-k64f had the fastest transfer rate of the devices I tried, constantly 590kb/sec even if it is not the fastest device in this set

All of the tests above were with #28595 applied.

The issue might be in NXP mcux Ethernet driver, at least it is the only platform where this data mismatch has been seen in my tests. On the other hand both the mimxrt1050-evk and frdm-k64f are using the same Ethernet driver but I did not see issues in frdm-k64f.

pfalcon · 2020-09-22T17:10:37Z

@jukkar:

On the other hand both the mimxrt1050-evk and frdm-k64f are using the same Ethernet driver but I did not see issues in frdm-k64f.

But before that, you wrote:

All of the tests above were with #28595 applied.

So, do you mean that before testing frdm_k64f with #28595 applied, as described above, you also tested this board without patch, and didn't see any issues either?

jukkar · 2020-09-23T06:34:23Z

So, do you mean that before testing frdm_k64f with #28595 applied, as described above, you also tested this board without patch, and didn't see any issues either?

No, I did not use frdm_k64f in my earlier tests.

pfalcon · 2020-09-23T07:00:58Z

@jukkar:

No, I did not use frdm_k64f in my earlier tests.

Thanks fro clarifying. Do you have all the code for the test somewhere in a branch, could you push to github? I'd like to run it on frdm_k64f (with and without #28595), but my concern is reproducibility starting from @armandciejak's instructions in the description of this ticket - my concern that doing things manually, each of us will make various deviations, and in the end we'll be testing slightly different things. It would be nice to have single setup to start from. Thanks.

jukkar · 2020-09-23T07:18:23Z

Do you have all the code for the test somewhere in a branch

I used the http_download_test.zip file found in the beginning of this PR, and modified the IP addresses to suit my network setup.

jukkar · 2020-09-23T10:31:43Z

If it is useful, I modified the test_http_download.sh script so I can run multiple runs against different hosts at the same time.

$ cat test_http_download.sh
#!/bin/sh

#SERVER='192.168.178.122:8080'
SERVER=$1:8080
EXPECTED_FILE_SIZE=108222
FILE=test.$SERVER

echo "d969441601d473db911f28dae62101b7 $FILE" > MD5SUM.$FILE

while true; do
    curl http://$SERVER --max-time 10 --output $FILE || exit $?
    FILE_SIZE=$(stat --printf=%s $FILE)
    if [ $FILE_SIZE -ne $EXPECTED_FILE_SIZE ]; then
        echo "wrong size $FILE, expected $EXPECTED_FILE_SIZE, got $FILE_SIZE"
        exit 1
    fi
    md5sum -c MD5SUM.$FILE || exit $?
done

Based on zephyrproject-rtos#28587. Signed-off-by: Paul Sokolovsky <[email protected]>

pfalcon · 2020-09-26T08:54:07Z

Ok, so I tried to do testing from my side. I took http_download_test.zip attached to this ticket, removed unrelated changes, which literally makes it replacing response_big.html.bin to a symlink to response_100k.html.bin, and pushed as https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k branch for reproducibility. I used test_http_download.sh from http_download_test.zip for testing (with trivial modifications, like dumping no. of current iteration).

I started with testing against revision 9bbbf17, which is by now a bit dated, but: a) I did extensive testing against that revision with pristine dumb_http_server/big_http_download; b) I'd like to triage intermediate changes, to avoid possible regressions.

So, with setup like in https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k, i.e. response_100k.html.bin being served, while the rest of options is per the original prj.conf, it's very easy to reproduce an issue of truncated output with curl, both on qemu_x86 and frmd_k64f. This is the crux of the issue, and more on this below.
Then, the (seemingly) only relevant override option from http_download_test.zip is CONFIG_NET_BUF_TX_COUNT=256. This option masks the issue of receiving incomplete file. I was able to run 40000 iterations of the test_http_download.sh script on both qemu_x86 and frdm_k64f.

So, regarding test scenario 1 above. What I see is that send() may fail with ENOMEM (errno 12). Well, it shouldn't do that so easily. In the default socket configuration (no non-blocking socket, no socket timeout set), send() should keep trying to send at least some data out (forever). Socket stack may return an error like ENOMEM only if it's sure that no further progress will be ever made, e.g. if a deadlock is detected. In this case, the socket should be marked as having an "irrecoverable error state" (remember, the stack decided that the situation is irrecoverable, because if it's recoverable, it should just keep trying). "Irrecoverable error state" on a socket in turn should lead to sending TCP RST to the peer - preferrably ASAP, if not "soon", but definitely "eventually".

Because here we hit on concept of "transient" vs "irrecoverable" POSIX errors. The only transient error is EAGAIN. It means that current operation didn't succeed, but user can retry later. Any other error is effectively fatal - user should finish lifetime of the current object (socket in this case), and may retry with a new instance. But user application relies on the system to handle lifecycle properly, including handling any associated error states. In the case of a TCP socket, that includes sending a RST packet to peer.

Because what we have currently is that user application received ENOMEM from send(), expectedly treats that as a fatal error, and closes the socket. But the IP stack doesn't send RST to peer, but a normal FIN, so peer has incomplete data received, but no indication that it's invalid/incomplete.

To repeat, returning EAGAIN would be another option, but it should not be returned unsolicited, it's enabled by O_NONBLOCK or SO_SNDTIMEO options. So, there're only 2 choices in the current situation:

Don't return ENOMEM, and patiently keep trying sending data.
If guaranteed "no progress" condition is detected, make sure that RST will be sent on a socket, and return ENOMEM.

I wasn't able to detect data corruption (i.e. changed bytes) issue in my testing so far (with qemu_x86/frdm_k64f), so as already hinted by @jukkar, I'd suggest that the issue is specific to mimxrt1050-evk.

pfalcon · 2020-10-08T13:23:37Z

For reference, I retested everything with tag v2.4.0 (rev 7a3b253). The picture is the same as was reported above: (dumb_http_server, big_http_download) x (qemu_x86, frdm_k64f) work as expected, with 100K requests tested for dumb_http_server and ~100MB downloaded for big_http_download.

As then retested https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k branch rebased to v2.4.0, and the picture is also as reported above: with default settings it doesn't work as discussed in #28587 (comment), and with CONFIG_NET_BUF_TX_COUNT=256 I tested ~40K requests to be OK.

github-actions · 2020-12-08T01:11:30Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

carlescufi · 2021-01-26T17:31:32Z

@jukkar did you have a chance to look at this one, according to @pfalcon this was still an issue in 2.4.0.

jukkar · 2021-01-26T18:18:15Z

I fixed lot of things in tcp2 that helped the issues. But I occacionally saw still issues last time when running the sample app against ab (apache bench) tester. Also if the system runs out of network buffers (like what easily happens with http benchmark tools), it becomes very fragile, so it is difficult to say exactly what issue is causing the problem.

carlescufi · 2021-02-09T17:27:07Z

Closing since, according to @jukkar, this is potentially fixed in master now. @armandciejak please reopen if you reproduce this with Zephyr 2.5.x or later.

armandciejak added the bug The issue is a bug, or the PR is fixing a bug label Sep 22, 2020

armandciejak mentioned this issue Sep 22, 2020

socket: send fails instead of blocking when there are no more net buffers #28216

Closed

jukkar added area: Networking priority: medium Medium impact/importance bug labels Sep 22, 2020

jukkar mentioned this issue Sep 22, 2020

net: tcp2: Lock connection when running from work queue #28595

Merged

MaureenHelm assigned jukkar Sep 22, 2020

pfalcon added a commit to pfalcon/zephyr that referenced this issue Sep 26, 2020

DNM dumb_http_server: Serve 100K file.

f3fce5d

Based on zephyrproject-rtos#28587. Signed-off-by: Paul Sokolovsky <[email protected]>

pfalcon mentioned this issue Oct 1, 2020

samples: sockets: dumb_http_server: If send() fails, print errno #28838

Merged

github-actions bot added the Stale label Dec 8, 2020

MaureenHelm removed the Stale label Dec 8, 2020

jukkar added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Jan 26, 2021

carlescufi closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption while serving large files via HTTP with TCP2 #28587

Data corruption while serving large files via HTTP with TCP2 #28587

armandciejak commented Sep 22, 2020

pfalcon commented Sep 22, 2020

jukkar commented Sep 22, 2020

jukkar commented Sep 22, 2020

jukkar commented Sep 22, 2020 •

edited

Loading

pfalcon commented Sep 22, 2020

jukkar commented Sep 23, 2020

pfalcon commented Sep 23, 2020

jukkar commented Sep 23, 2020

jukkar commented Sep 23, 2020

pfalcon commented Sep 26, 2020

pfalcon commented Oct 8, 2020

github-actions bot commented Dec 8, 2020

carlescufi commented Jan 26, 2021

jukkar commented Jan 26, 2021

carlescufi commented Feb 9, 2021

Data corruption while serving large files via HTTP with TCP2 #28587

Data corruption while serving large files via HTTP with TCP2 #28587

Comments

armandciejak commented Sep 22, 2020

pfalcon commented Sep 22, 2020

jukkar commented Sep 22, 2020

jukkar commented Sep 22, 2020

jukkar commented Sep 22, 2020 • edited Loading

pfalcon commented Sep 22, 2020

jukkar commented Sep 23, 2020

pfalcon commented Sep 23, 2020

jukkar commented Sep 23, 2020

jukkar commented Sep 23, 2020

pfalcon commented Sep 26, 2020

pfalcon commented Oct 8, 2020

github-actions bot commented Dec 8, 2020

carlescufi commented Jan 26, 2021

jukkar commented Jan 26, 2021

carlescufi commented Feb 9, 2021

jukkar commented Sep 22, 2020 •

edited

Loading