-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete keep-alive timer on connection_drop hook #1429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the timer_pending()
check is unreliable, it turns out that we have much more serious problems with connections reference counting - need to explore and probably fix them.
msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000)); | ||
|
||
if (timer_pending(&cli_conn->timer)) | ||
mod_timer(&cli_conn->timer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, this check is unreliable: the timer can be deactivated right between the check and mod_timer()
. Why we ever call tfw_cli_conn_send()
on dropped connection? There are only 2 places calling the function:
-
tfw_http_resp_fwd()
->__tfw_http_resp_fwd()
- the former one doestfw_cli_conn_get()
, so the connection is alive; -
tfw_h2_resp_fwd()
- this one seems doesn't care about the client connection state.
The interesting thing is that there are only two places where connection reference counter is incremented: the 1st one in HTTP/1 forwarding and the 2nd one is tfw_http_conn_msg_alloc()
. Nobody calls tfw_srv_conn_get()
.
So it seems HTTP/2 is prone for the this and I think other similar referencing a dead connection bugs. Please test HTTP/2 code against #1428
There are two macros over tfw_connection_get()
, which are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what exactly this lock protects against? What party can call this function concurrently? A proper comment here would not hurt.
In my current view, once connection accounting in HTTP/2 case is fixed - which it is in this PR, this lock is not needed. The SS layer is stopped first what Tempesta is unloaded, as I recall.
UPDATE: Looks like the lock is needed because the timer deletion was moved from release()
to drop()
. While release()
is called when there are no other users, there's no such luxury with drop()
and the connection may still be in use by lingering threads of execution. If that's so, then a good comment in the commit (it won't be easy to understand in the code) would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer fix is good for now, but we need to fix the connection reference counting issue. Need review from @ikoveshnikov since there are plenty of subtle issues with the reference counting.
tempesta_fw/sock_clnt.c
Outdated
|
||
spin_lock_bh(&((TfwCliConn *)conn)->timer_lock); | ||
del_timer_sync(&((TfwCliConn *)conn)->timer); | ||
spin_unlock_bh(&((TfwCliConn *)conn)->timer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love moving the timer deletion here, the connection_drop
hook, which is called when we're going to terminate the connection.
tempesta_fw/sock_clnt.c
Outdated
mod_timer(&cli_conn->timer, | ||
jiffies + | ||
msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000)); | ||
spin_unlock_bh(&cli_conn->timer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keep alive timer lock is awkward, but I didn't find any better solution for now. Also it seems there is no much sense to put significant effort to eliminate the lock - this is just per connection lock after all.
However, the whole timers management is complex and has performance issues, so I added an appropriate comment in #736 (comment) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I also tried to find another solution, e.g. for server connections we have a special value TFW_CONN_DEATHCNT
in reference counter and state TFW_CONN_B_DEL
in TfwSrvConn->flags
. But this doesn't solve unordered del_timer/mod_timer
calls, which is the problem root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With @keshonok the PR looks good. I agree that comments are good to have. From my side I'd love to see comments for tfw_srv_conn_put()
calls that they are paired with tfw_srv_conn_get_if_live()
.
tempesta_fw/sock_clnt.c
Outdated
@@ -234,6 +243,11 @@ tfw_sock_clnt_drop(struct sock *sk) | |||
|
|||
T_DBG3("connection lost: close client socket: sk=%p, conn=%p, " | |||
"client=%p\n", sk, conn, conn->peer); | |||
|
|||
spin_lock_bh(&((TfwCliConn *)conn)->timer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is acquired and released only in SoftIRQ context and can be relaxed to spin_lock()
(without bh
)
tempesta_fw/sock_clnt.c
Outdated
mod_timer(&cli_conn->timer, | ||
jiffies + | ||
msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000)); | ||
spin_unlock_bh(&cli_conn->timer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I also tried to find another solution, e.g. for server connections we have a special value TFW_CONN_DEATHCNT
in reference counter and state TFW_CONN_B_DEL
in TfwSrvConn->flags
. But this doesn't solve unordered del_timer/mod_timer
calls, which is the problem root.
tempesta_fw/connection.h
Outdated
@@ -380,7 +379,10 @@ tfw_connection_put(TfwConn *conn) | |||
conn->destructor(conn); | |||
} | |||
|
|||
#define tfw_cli_conn_put(c) tfw_connection_put((TfwConn *)(c)) | |||
/* | |||
* Paired with tfw_srv_conn_get_if_live() via tfw_http_get_srv_conn() or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is misplaced. Not the tfw_srv_conn_put()
function itself is paired with tfw_srv_conn_get_if_live()
. Instead commit must be placed on that place where the function is used.
#1428
The keep-alive timer may work after https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/tls.c#L582, but before https://github.com/tempesta-tech/tempesta/blob /master/tempesta_fw/tls.c#L584 (where the timer is deleted) and we will catch the bug https://github.com/tempesta-tech/tempesta/blob/master/tls/ttls.c#L2242 (tfw_sock_cli_keepalive_timer_cb () tfw_connection_close () -> tfw_tls_conn_close () -> ttls_close_notify ()), because The tls context will be filled with zeros and tls-> conf == NULL after https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/tls.c#L582