Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete keep-alive timer on connection_drop hook #1429

Merged
merged 3 commits into from
Aug 27, 2020
Merged

Delete keep-alive timer on connection_drop hook #1429

merged 3 commits into from
Aug 27, 2020

Conversation

avbelov23
Copy link
Contributor

#1428

The keep-alive timer may work after https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/tls.c#L582, but before https://github.com/tempesta-tech/tempesta/blob /master/tempesta_fw/tls.c#L584 (where the timer is deleted) and we will catch the bug https://github.com/tempesta-tech/tempesta/blob/master/tls/ttls.c#L2242 (tfw_sock_cli_keepalive_timer_cb () tfw_connection_close () -> tfw_tls_conn_close () -> ttls_close_notify ()), because The tls context will be filled with zeros and tls-> conf == NULL after https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/tls.c#L582

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the timer_pending() check is unreliable, it turns out that we have much more serious problems with connections reference counting - need to explore and probably fix them.

msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000));

if (timer_pending(&cli_conn->timer))
mod_timer(&cli_conn->timer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, this check is unreliable: the timer can be deactivated right between the check and mod_timer(). Why we ever call tfw_cli_conn_send() on dropped connection? There are only 2 places calling the function:

  1. tfw_http_resp_fwd() -> __tfw_http_resp_fwd() - the former one does tfw_cli_conn_get(), so the connection is alive;

  2. tfw_h2_resp_fwd() - this one seems doesn't care about the client connection state.

The interesting thing is that there are only two places where connection reference counter is incremented: the 1st one in HTTP/1 forwarding and the 2nd one is tfw_http_conn_msg_alloc(). Nobody calls tfw_srv_conn_get().

So it seems HTTP/2 is prone for the this and I think other similar referencing a dead connection bugs. Please test HTTP/2 code against #1428

There are two macros over tfw_connection_get(), which are not needed.

Copy link
Contributor

@keshonok keshonok Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate what exactly this lock protects against? What party can call this function concurrently? A proper comment here would not hurt.
In my current view, once connection accounting in HTTP/2 case is fixed - which it is in this PR, this lock is not needed. The SS layer is stopped first what Tempesta is unloaded, as I recall.

UPDATE: Looks like the lock is needed because the timer deletion was moved from release() to drop(). While release() is called when there are no other users, there's no such luxury with drop() and the connection may still be in use by lingering threads of execution. If that's so, then a good comment in the commit (it won't be easy to understand in the code) would be nice.

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer fix is good for now, but we need to fix the connection reference counting issue. Need review from @ikoveshnikov since there are plenty of subtle issues with the reference counting.


spin_lock_bh(&((TfwCliConn *)conn)->timer_lock);
del_timer_sync(&((TfwCliConn *)conn)->timer);
spin_unlock_bh(&((TfwCliConn *)conn)->timer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love moving the timer deletion here, the connection_drop hook, which is called when we're going to terminate the connection.

mod_timer(&cli_conn->timer,
jiffies +
msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000));
spin_unlock_bh(&cli_conn->timer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keep alive timer lock is awkward, but I didn't find any better solution for now. Also it seems there is no much sense to put significant effort to eliminate the lock - this is just per connection lock after all.

However, the whole timers management is complex and has performance issues, so I added an appropriate comment in #736 (comment) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I also tried to find another solution, e.g. for server connections we have a special value TFW_CONN_DEATHCNT in reference counter and state TFW_CONN_B_DEL in TfwSrvConn->flags. But this doesn't solve unordered del_timer/mod_timer calls, which is the problem root.

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With @keshonok the PR looks good. I agree that comments are good to have. From my side I'd love to see comments for tfw_srv_conn_put() calls that they are paired with tfw_srv_conn_get_if_live().

@@ -234,6 +243,11 @@ tfw_sock_clnt_drop(struct sock *sk)

T_DBG3("connection lost: close client socket: sk=%p, conn=%p, "
"client=%p\n", sk, conn, conn->peer);

spin_lock_bh(&((TfwCliConn *)conn)->timer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is acquired and released only in SoftIRQ context and can be relaxed to spin_lock() (without bh)

mod_timer(&cli_conn->timer,
jiffies +
msecs_to_jiffies((long)tfw_cli_cfg_ka_timeout * 1000));
spin_unlock_bh(&cli_conn->timer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I also tried to find another solution, e.g. for server connections we have a special value TFW_CONN_DEATHCNT in reference counter and state TFW_CONN_B_DEL in TfwSrvConn->flags. But this doesn't solve unordered del_timer/mod_timer calls, which is the problem root.

@@ -380,7 +379,10 @@ tfw_connection_put(TfwConn *conn)
conn->destructor(conn);
}

#define tfw_cli_conn_put(c) tfw_connection_put((TfwConn *)(c))
/*
* Paired with tfw_srv_conn_get_if_live() via tfw_http_get_srv_conn() or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is misplaced. Not the tfw_srv_conn_put() function itself is paired with tfw_srv_conn_get_if_live(). Instead commit must be placed on that place where the function is used.

@avbelov23 avbelov23 merged commit 571136e into master Aug 27, 2020
@avbelov23 avbelov23 deleted the avb-1404 branch August 27, 2020 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants