Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mekhanik evgenii/1196 #17

Merged
merged 9 commits into from
Jun 27, 2024
Merged
71 changes: 8 additions & 63 deletions include/linux/skbuff.h
Original file line number Diff line number Diff line change
Expand Up @@ -731,12 +731,10 @@ struct sk_buff {
*/
unsigned long dev_scratch;
#ifdef CONFIG_SECURITY_TEMPESTA
struct {
__u8 present : 1;
__u8 tls_type : 7;
__u16 flags : 16;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need flags to notify when skb is transmitted, e.g. tempesta-tech/tempesta#2108.
Of course, on_tcp_entail can do this as well, but it is not currently used for this purpose, at least not for all control frames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should implement additional on_send_ function in tempesta for 2108, not use flags in skb. Moreover we remove additional callback from kernel code, so 2108 should be reworked

Copy link

@kingluo kingluo Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can't be in on_send callback, it only indicates entry into the write queue of sk, but the counter decrement must happen at the time of actual transmission, because only at that point in time the skb is moved to the TCP protocol transmission process, i.e. tcp_write_xmit or its callers, e.g. __tcp_push_pending_frames, and not accumulated in kernel memory for some reason, such as congestion or zero TCP window of the malicious peer. This is exactly the meaning of this counter, to avoid control frame flooding.

That is, for 2108, the solution is either to keep the counter decrement in sk_write_xmit, and the tfw_cb.flag is still needed, or to manipulate the counter directly in ss_skb_on_tcp_entail. No additional callbacks needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ss_skb_on_tcp_entail is called before pushing skb to socket write queue. It is a good place to decrement counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applications which works in user space and counted control frames don't know nothing about __tcp_push_pending_frames and there is no problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can postpone such frames if it is really necessary (and send them only when we have enough window) and count them in ss_skb_on_tcp_entail. I think that we move tls_encrypt to Tempesta code and there is no sense to have one more callback in linux kernel only to count control frames. User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

Copy link

@kingluo kingluo Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

@EvgeniiMekhanik @krizhanovsky @const-t

I have to update some thoughts.

No, the user program relies on sockbuf to determine whether the skb send is in-flight. If sockbuf is full, the send() system call will fail because the peer does not provide enough window, or in non-blocking fd, the write event is shelved, then the counter will accumulate, and then the user program will know that it is in a flood attack.

(Moreover, the user program can use TCP_INFO getsockopt to retrieve the TCP information to get known the in-flight bytes. In fact, Nginx uses it to define some variables, although not for our purpose exactly.)

However, in our case, it's a completely different story. tempesta does not use sockbuf, but directly enqueues to sk's sk_write_queue, so it will have memory risks unless we know that skb has started to be transmitted, that is, in-flight. How to know this fact? The only place is that tcp_transmit_skb successfully handles skb. So this is why I suggest that we should apply the counting logic there. Neither on_send (called in ss_tx_action) nor entail (corresponding to the original prepare_xmit), or even write_xmit callback, can meet our needs to mitigate the flood of control frame attacks. This is not a theory, but has been verified through testing.

And the method suggested by @EvgeniiMekhanik does not work either, because similar to entail, the skbs pushed when the window is available does not mean that all skbs can be sent immediately, because it does not know how large the network capacity is and how many skbs can be sent, so it's likely that some skbs will be hold in the memory for an undetermined time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call ss_skb_on_tcp_entail just before we push skb to socket write queue. For HEADERS and DATA frames we calculate tcp window in sk_fill_write_queue and don't call ss_skb_tcp_entail->ss_skb_on_tcp_entail if we have no enough window. To solve problem with control frames we should not call ss_skb_tcp_entail->ss_skb_on_tcp_entail for all other type of frames if there is no enough tcp window. We should push them to the queue of postponed frames and push them to socket write queue only when we have enough TCP window. In this case we can counted control frames in ss_skb_on_tcp_entail

Copy link

@kingluo kingluo Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. This is sufficient if we do not care about the strict accuracy of whether the skb was successfully sent by tcp_transimit_skb or was even eliminated from tcp_rtx_queue.

Copy link
Contributor

@krizhanovsky krizhanovsky Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the flags are replaced with the on_send callback in #1973, so it's not desired to keep flags, especially with only one user. Meantine, I also don't like the idea of additional queueing since to wasted memory decreasing performance and opens opportunities for DoS attacks.

However, in first place, I didn't get what do we fix with #2108, so I requested changes. Maybe I'm wrong and let's discuss the fix, but I proposed to control and limit flows to client in tempesta-tech/tempesta#498 (comment)

P.S. @kingluo @EvgeniiMekhanik thank you for inviting me to the discussion

unsigned int cb;
} tfw_cb;
struct {
__u8 present : 1;
__u8 tls_type : 7;
} tfw_cb;
#endif
};
};
Expand Down Expand Up @@ -806,7 +804,7 @@ struct sk_buff {
__u8 active_extensions;
#endif
#ifdef CONFIG_SECURITY_TEMPESTA
__u8 tail_lock:1;
__u8 tail_lock:1;
#endif
/* fields enclosed in headers_start/headers_end are copied
* using a single memcpy() in __copy_skb_header()
Expand Down Expand Up @@ -951,27 +949,6 @@ struct sk_buff {
#define SKB_ALLOC_NAPI 0x04

#ifdef CONFIG_SECURITY_TEMPESTA
enum {
/* This skb contains start of http2 frame. */
SS_F_HTTP2_FRAME_START = 0x01,
/* This skb contains new hpack dynamic table size. */
SS_F_HTTT2_HPACK_TBL_SZ_ENCODED = 0x02,
/* This skb contains headers frame. */
SS_F_HTTT2_FRAME_HEADERS = 0x04,
/* This skb contains data frame. */
SS_F_HTTT2_FRAME_DATA = 0x08,
/* This skb was already prepared. */
SS_F_HTTP2_FRAME_PREPARED = 0x10,
/* This skb acks new hpack dynamic tbl size. */
SS_F_HTTP2_ACK_FOR_HPACK_TBL_RESIZING = 0x20,
/*
* These flags should be cleared when we copy flags
* from one skb to another one.
*/
TEMPESTA_SKB_FLAG_CLEAR_MASK = SS_F_HTTP2_ACK_FOR_HPACK_TBL_RESIZING |
SS_F_HTTT2_HPACK_TBL_SZ_ENCODED |
SS_F_HTTP2_FRAME_START,
};

static inline unsigned long
skb_tfw_is_present(struct sk_buff *skb)
Expand All @@ -982,9 +959,9 @@ skb_tfw_is_present(struct sk_buff *skb)
static inline void
skb_set_tfw_tls_type(struct sk_buff *skb, unsigned char tls_type)
{
BUG_ON(tls_type > 0x7F);
skb->tfw_cb.present = 1;
skb->tfw_cb.tls_type = tls_type;
BUG_ON(tls_type > 0x7F);
skb->tfw_cb.present = 1;
skb->tfw_cb.tls_type = tls_type;
}

static inline unsigned char
Expand All @@ -993,38 +970,6 @@ skb_tfw_tls_type(struct sk_buff *skb)
return skb->tfw_cb.present ? skb->tfw_cb.tls_type : 0;
}

static inline void
skb_set_tfw_flags(struct sk_buff *skb, unsigned short flags)
{
skb->tfw_cb.present = 1;
skb->tfw_cb.flags |= flags;
}

static inline void
skb_clear_tfw_flag(struct sk_buff *skb, unsigned short flag)
{
skb->tfw_cb.flags &= ~flag;
}

static inline unsigned short
skb_tfw_flags(struct sk_buff *skb)
{
return skb->tfw_cb.present ? skb->tfw_cb.flags : 0;
}

static inline void
skb_set_tfw_cb(struct sk_buff *skb, unsigned int cb)
{
skb->tfw_cb.present = 1;
skb->tfw_cb.cb = cb;
}

static inline unsigned int
skb_tfw_cb(struct sk_buff *skb)
{
return skb->tfw_cb.present ? skb->tfw_cb.cb : 0;
}

static inline void
skb_copy_tfw_cb(struct sk_buff *dst, struct sk_buff *src)
{
Expand Down
27 changes: 20 additions & 7 deletions include/net/sock.h
Original file line number Diff line number Diff line change
Expand Up @@ -507,16 +507,28 @@ struct sock {
void (*sk_data_ready)(struct sock *sk);
void (*sk_write_space)(struct sock *sk);
#ifdef CONFIG_SECURITY_TEMPESTA
int (*sk_prepare_xmit)(struct sock *sk,
struct sk_buff *skb,
unsigned int mss_now,
unsigned int *limit,
unsigned int *skbs);
/*
* Tempesta FW callback to ecrypt one
* or more skb in socket write queue
* before sending.
*/
int (*sk_write_xmit)(struct sock *sk,
struct sk_buff *skb,
unsigned int mss_now,
unsigned int limit,
unsigned int skbs);
unsigned int limit);
/*
* Tempesta FW callback to prepare and push
* skbs from Tempesta FW private scheduler
* to socket write queue according sender
* and receiver window.
*/
int (*sk_fill_write_queue)(struct sock *sk,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document this breaking change, explaining why sk_prepare_xmit is deprecated, and why the new sk_fill_write_queue requires some somewhat complicated limit calculations. If not, maybe the author himself will forget about it after a few weeks. It's better to document it in the code as a comment, because tempesta-tech/tempesta#1196 only discusses some theoretical design (which is even outdated after the code rewrite), not the actual implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could make sense to document both the two callback, when they are called and what they do

Copy link

@kingluo kingluo Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't quite understand why we moved frames from prepare_xmit to tcp_push_pending_frames, which is the biggest change in this PR. In prepare_xmit, we already know that TCP has enough window to send data, but in tcp_push_pending_frames, we have to calculate it ourselves, which seems to be duplication of work. Could you clarify this further in comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add comment before merge PR. There were several problems with the previous approach:

  1. We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come.
  2. We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Now we push response skbs to our private scheduler queue. Then we call tcp_push_pending_frames where we run our scheduler to choose the most priority stream and send data according TCP window. Also tcp_push_pending_frames called from tcp_data_snd_check when we receive data from the peer and TCP window opened again.

unsigned int mss_now,
int ss_action);
/*
* Tempesta FW callback to free all private
* resources associated with socket.
*/
void (*sk_destroy_cb)(struct sock *sk);
#endif
void (*sk_error_report)(struct sock *sk);
Expand Down Expand Up @@ -876,6 +888,7 @@ enum sock_flags {
SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
#ifdef CONFIG_SECURITY_TEMPESTA
SOCK_TEMPESTA, /* The socket is managed by Tempesta FW */
SOCK_TEMPESTA_HAS_DATA /* The socket has data in Tempesta FW write queue */
#endif
};

Expand Down
42 changes: 40 additions & 2 deletions include/net/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -585,8 +585,6 @@ enum tcp_queue {
TCP_FRAG_IN_WRITE_QUEUE,
TCP_FRAG_IN_RTX_QUEUE,
};
int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
unsigned int mss_now, gfp_t gfp);
int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
struct sk_buff *skb, u32 len,
unsigned int mss_now, gfp_t gfp);
Expand Down Expand Up @@ -1877,11 +1875,51 @@ static inline void tcp_rtx_queue_unlink_and_free(struct sk_buff *skb, struct soc
sk_wmem_free_skb(sk, skb);
}

#ifdef CONFIG_SECURITY_TEMPESTA
/**
* This function is similar to `tcp_write_err` except that we send
* TCP RST to remote peer. We call this function when an error occurs
* while sending data from which we cannot recover, so we close the
* connection with TCP RST.
*/
static inline void
tcp_tfw_handle_error(struct sock *sk, int error)
{
tcp_send_active_reset(sk, GFP_ATOMIC);
sk->sk_err = error;
sk->sk_error_report(sk);
tcp_write_queue_purge(sk);
tcp_done(sk);
}
#endif

static inline void tcp_push_pending_frames(struct sock *sk)
{
#ifdef CONFIG_SECURITY_TEMPESTA
unsigned int mss_now = 0;

if (sock_flag(sk, SOCK_TEMPESTA_HAS_DATA)
&& sk->sk_fill_write_queue)
{
int result;

mss_now = tcp_current_mss(sk);
result = sk->sk_fill_write_queue(sk, mss_now, 0);
if (unlikely(result < 0 && result != -ENOMEM)) {
tcp_tfw_handle_error(sk, result);
return;
}
}
#endif
if (tcp_send_head(sk)) {
struct tcp_sock *tp = tcp_sk(sk);

#ifdef CONFIG_SECURITY_TEMPESTA
if (mss_now != 0) {
int nonagle = TCP_NAGLE_OFF | TCP_NAGLE_PUSH;
__tcp_push_pending_frames(sk, mss_now, nonagle);
} else
#endif
__tcp_push_pending_frames(sk, tcp_current_mss(sk), tp->nonagle);
}
}
Expand Down
Loading