Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #984 #1155

Closed
wants to merge 2,588 commits into from
Closed

Fix #984 #1155

wants to merge 2,588 commits into from

Conversation

krizhanovsky
Copy link
Contributor

Actually there are several fixes around skb->truesize and splitting, please see the commits comments for the details. Now the tests from #984 pass.

krizhanovsky and others added 30 commits June 25, 2018 14:18
Move DTLS common routines to separate file;
TLS hanshake cleanups
…erse

proxy and EC J-PAKE as experimental and non-mandatory.
Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API.
Address #391.12: ss_skb_alloc() extended with an agrument for head room.
Many cleanups again.
 Fix #1013: Avoid gcc-7 warning in conditional ternary operator (#1013).
However, linux asymmetric and elliptic crypto seems incomplete, so I leave
the old RSA and ECDH for now. Also note that generally speaking cipher.[ch]
and md.[ch] wrappers are for elimination in future - now they just link
bunch of mbedTLS code w/ linux/crypto.
…d even TCP segment;

Multiple handshakes FSM fixes.
Fix #1033: Change header name format in HTTP tables configuration.
 Fix #772: Change 'keepalive' and 'client_body' timeouts applying.
   The page is used later as skb page fragment in TLS handshake.
2. Cleanups in PEM decoder.
3. Remove DHM routines called only for FS IO (unused).
4. Revert DTLS routines (dirty code) as it's supposed to used them for QUIC.
Fix #900: Change some comments and add unit tests.
vankoven and others added 17 commits December 28, 2018 04:24
split skb as soon http_parser returned TFW_PASS
Use standart bitops functions for message flags.
Fix code review comments in this patch since there is significant
overlaps in skb offsets.

1. chop skb before splitting in HTTP processing to chop all skbs.
2. offset on TLS layer comes from TCP seqnos overlap, so don't
   account it for TLS overhead;
3. chop skb TCP overhead before inserting the skb into the list of
   TLS record skbs.
so tfw_tls_msg_process() must store skb_list for further chopping
in tfw_tls_chop_skb_rec(). Since ttls_recv() zeroes almost full
IO, tfw_tls_chop_skb_rec() doesn't need to zero it.
Tempesta TLS performance optimizations
less bytes than the TLS overhead as well as allocate an extra skb.
1. @nsize was copy&pasted from tcp_fragment(), but the last one uses it only
   for fast path with skb w/o frags.

2. reserved_tailroom is in union with mark which we process separately, so the
   field isn't compatible with current Tempesta code. Also it's used for egress
   path only and we don't need it on ingress path where ss_skb_split() is called.

3. GSO segementation for skb wasn't accounted: make couple of comments in TLS
   code and initialize it for split skb. (Later kernel patch will bring small
   logic on it as well.)

Some cleanups. -jN builds sometetimes still fail in libtdb/tdbq dependence
(see commit 1fc007d).
1. accurately fix skb->truesize and TCP write memory in kernel
   by tcp_skb_unclone();

2. __split_pgfrag_del() if we just move pointers, then we do not free
   TCP write memory, so do not change skb->truesize.

3. ss_skb_unroll(): truesize and data_len/len are completely different counters,
   so do not mix them in ss_skb_adjust_data_len(). By the way, during the tests
   I saw crazy skb overheads - truesize can be larger than len in tens kilobytes.
   The explanation for such overheads is various fragments stoling (e.g. our
   __split_pgfrag_del) and cloning.

4. cleanup: move ss_skb coalescing functions closer to their calls.
skb->truesize -= n;

/*
* Initialize GSO segments counter to let TCP set it accoring to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accoring -> according


tfw_tls_tcp_add_overhead(sk, head_sz + tag_sz);
/*
* TLS record header is always allocated form the reserved skb headroom.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form -> from

/*
* TLS record header is always allocated form the reserved skb headroom.
* The room for the tag may also be allocated from the reserved tailroom
* or in a new page frament in slb_tail or next, probably new, skb.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frament -> fragment
slb_tail -> skb_tail

buff->truesize += nlen;
skb->truesize -= nlen;
n = skb->len - len;
buff->truesize += n;
Copy link
Contributor

@aleksostapenko aleksostapenko Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that without nsize (or asize) we have doubled '(skb_headlen(skb) - len)' part in buff->truesize, if 'len < skb_headlen(skb)': since 'skb->truesize = SKB_TRUESIZE(size)' in 'alloc_skb_fclone()->__alloc_skb_init()' and 'size == asize' in this case; so, we already included asize into skb->truesize during alloc_skb_fclone() and we need to subtract it from 'skb->len - len' value.

@@ -634,15 +634,17 @@ __split_pgfrag_del(struct sk_buff *skb_head, struct sk_buff *skb, int i, int off
if (likely(!off)) {
frag->page_offset += len;
skb_frag_size_sub(frag, len);
ss_skb_adjust_data_len(skb, -len);
skb->len -= len;
skb->data_len -= len;
Copy link
Contributor

@aleksostapenko aleksostapenko Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite clear why we do not adjust skb->truesize here during removal the part of fragment, but in case of the full fragment deletion (in previous block, above) - we adjust it.
E.g. in another but similar case - in ss_skb_split() during splitting the original skb - we adjust skb->truesize, including the cases when splitting boundary located in the middle of the fragment of original skb: this can be seen in skb_split() -> skb_split_no_header() and we set skb->truesize = skb->len - len in ss_skb_split().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the change is that skb->truesize must account all the memory used by the skb. The accounted memory is required to for socket memory accounting: how much memory do we use by all skbs in the socket. If we call skb_frag_size_sub(), then we just move the grament pointer and don't free any memory, so the allocated memory for the skb remains the same.

Regarding skb_split() -> skb_split_no_header() I'd say that the kernel is very inconsistent in accounting skb->truesize. If you print skb->len and skb->truesize for some benchmark, then you can see difference 10-30KB for 40-60KB skbs due to clones and the framgents playing. See for example skb_gro_receive() for skb->head_frag != NULL) - the skb head is stolen to head->frags, head truesize is incremented and the data will be processed as its frag, but skb continues to account the head data in truesize. See also pskb_expand_head() which not always update truesize.

If you run a test against vanilla kernel with a small patch:

root@debian:~/linux-4.14.32-tfw# git log
commit 07d7b6fe5a8973b79b4e1bff2d9da889e3fd4379
Author: Alexander K <[email protected]>
Date:   Thu Apr 5 16:28:25 2018 +0300

    Vanilla Linux 4.14.32
root@debian:~/linux-4.14.32-tfw# git diff
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cab4b935..87b7b9c1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1638,6 +1638,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
        if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
                goto discard_it;
 
+       if (skb->truesize > skb->len +2048)
+               pr_err("large truesize=%u > len=%u+2k\n", skb->truesize, skb->len);
+
        th = (const struct tcphdr *)skb->data;
 
        if (unlikely(th->doff < sizeof(struct tcphdr) / 4))

And run a benchmark envolvin heavy GRO, then you'll see truesizes much large than len. This is not the most large difference which I saw:

TCP: large truesize=19456 > len=14512+2k

@@ -297,6 +297,8 @@ tfw_tls_encrypt(struct sock *sk, struct sk_buff *skb, unsigned int limit)
* if there is no free frag slot in skb_tail, a new skb is allocated.
*/
next = skb_tail->next;
t_sz_curr = skb_tail->truesize;
t_sz_next = next != skb ? next->truesize : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like condition next != skb is always true here, since next can be either skb_tail->next (which definitely not the skb itself) or sk->sk_write_queue if skb_tail is last in the write queue.

/*
* A new frag is added to the end of the current skb or
* begin of the next skb.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass skb->next or skb_tail->next as skb_head to ss_skb_expand_head_tail(); code of __extend_pgfrags() procedure

if (nskb != skb_head && !skb_headlen(nskb)
in turn points that in this case new skb will be allocated instead of inserting frag in the next skb, and comment before ss_skb_expand_head_tail() also confirms that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants