-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption while serving large files via HTTP with TCP2 #28587
Comments
I'd suggest to develop a separate sample for such test scenarios, with configurable download size, and context generated programmatically as a pseudo-random sequence (e.g. a hash of incrementing value). Overall comments on this repro case: it mixes too much variables at once: a) it uses mimxrt1050_evk; b) but it even patches it; c) it patches network stack parameters. And the issue can be a result of any of the changes above. The baseline of the testing should be qemu_x86 + standard (SLIP) networking (which is hard case btw), and default Zephyr config options. Does that work or not? Only after answering that question, can start varying parameters, one by one. (Well, monte-carlo way of testing with betting on zero may also work with enough effort of course ;-) ). |
I could replicate this, it took 500 HTTP requests in my test run to see the issue. The difference of the files seems to be quite small, in your examples from 4 to 24 bytes, in my test it was 24 bytes. |
@armandciejak Please try the PR #28595, it has a locking fix that might help here. At least my device is running fine and so far has served 2500 reqs without any issues. |
I have been running the http curl script attached to this issue and seen following symptoms: mimxrt1050_evk
nucleo_f767zi
sam-e70
frdm-k64f
All of the tests above were with #28595 applied. The issue might be in NXP mcux Ethernet driver, at least it is the only platform where this data mismatch has been seen in my tests. On the other hand both the mimxrt1050-evk and frdm-k64f are using the same Ethernet driver but I did not see issues in frdm-k64f. |
But before that, you wrote:
So, do you mean that before testing frdm_k64f with #28595 applied, as described above, you also tested this board without patch, and didn't see any issues either? |
No, I did not use frdm_k64f in my earlier tests. |
Thanks fro clarifying. Do you have all the code for the test somewhere in a branch, could you push to github? I'd like to run it on frdm_k64f (with and without #28595), but my concern is reproducibility starting from @armandciejak's instructions in the description of this ticket - my concern that doing things manually, each of us will make various deviations, and in the end we'll be testing slightly different things. It would be nice to have single setup to start from. Thanks. |
I used the |
If it is useful, I modified the
|
Based on zephyrproject-rtos#28587. Signed-off-by: Paul Sokolovsky <[email protected]>
Ok, so I tried to do testing from my side. I took http_download_test.zip attached to this ticket, removed unrelated changes, which literally makes it replacing I started with testing against revision 9bbbf17, which is by now a bit dated, but: a) I did extensive testing against that revision with pristine dumb_http_server/big_http_download; b) I'd like to triage intermediate changes, to avoid possible regressions.
So, regarding test scenario 1 above. What I see is that send() may fail with ENOMEM (errno 12). Well, it shouldn't do that so easily. In the default socket configuration (no non-blocking socket, no socket timeout set), send() should keep trying to send at least some data out (forever). Socket stack may return an error like ENOMEM only if it's sure that no further progress will be ever made, e.g. if a deadlock is detected. In this case, the socket should be marked as having an "irrecoverable error state" (remember, the stack decided that the situation is irrecoverable, because if it's recoverable, it should just keep trying). "Irrecoverable error state" on a socket in turn should lead to sending TCP RST to the peer - preferrably ASAP, if not "soon", but definitely "eventually". Because here we hit on concept of "transient" vs "irrecoverable" POSIX errors. The only transient error is EAGAIN. It means that current operation didn't succeed, but user can retry later. Any other error is effectively fatal - user should finish lifetime of the current object (socket in this case), and may retry with a new instance. But user application relies on the system to handle lifecycle properly, including handling any associated error states. In the case of a TCP socket, that includes sending a RST packet to peer. Because what we have currently is that user application received ENOMEM from send(), expectedly treats that as a fatal error, and closes the socket. But the IP stack doesn't send RST to peer, but a normal FIN, so peer has incomplete data received, but no indication that it's invalid/incomplete. To repeat, returning EAGAIN would be another option, but it should not be returned unsolicited, it's enabled by O_NONBLOCK or SO_SNDTIMEO options. So, there're only 2 choices in the current situation:
I wasn't able to detect data corruption (i.e. changed bytes) issue in my testing so far (with qemu_x86/frdm_k64f), so as already hinted by @jukkar, I'd suggest that the issue is specific to mimxrt1050-evk. |
For reference, I retested everything with tag As then retested https://github.com/pfalcon/zephyr/tree/dumb_http_server-100k branch rebased to |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
I fixed lot of things in tcp2 that helped the issues. But I occacionally saw still issues last time when running the sample app against ab (apache bench) tester. Also if the system runs out of network buffers (like what easily happens with http benchmark tools), it becomes very fragile, so it is difficult to say exactly what issue is causing the problem. |
Closing since, according to @jukkar, this is potentially fixed in master now. @armandciejak please reopen if you reproduce this with Zephyr 2.5.x or later. |
Describe the bug
While testing TCP2 with the dumb_http_server sample application on the MIMXRT1050_EVK board I regularly receive corrupted file.
To Reproduce
Steps to reproduce the behavior:
test_http_download.patch
west build -b mimxrt1050_evk zephyr/samples/net/sockets/dumb_http_server
west flash
ping -A 192.168.1.241
./test_http_download.sh
test
Expected behavior
Receive non corrupted files.
Impact
Since we need to serve "large" files via HTTP this problem prevents us to deliver stable firmware to our customers.
Logs and console output
Diff between the original file and the received files.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: