Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #76 #886 #908 #911, many other fixes and optimizations #916

Merged
merged 13 commits into from
Mar 3, 2018

Conversation

krizhanovsky
Copy link
Contributor

Please read commits descriptions

   all the network and scheduler locks as well as RCU calls _bh.
2. Use RCU barriers to wait untill all memory is freed on configuration
   reloading before trying to allocate memory for new server groups.
3. Call RCU barrier once per a while to avoid massive CPU utilization
   by RCU callbacks (freeing throtling);
4. Move __tfw_sg_release_all_reconfig() to the unit test where it's
   supposed to be.

The system still hungs with ksoftirqd soft lockup on reloading 100K
server groups. It seems not all locks are fixed by _bh...
…ntexts;

2. Use GFP_KERNEL for configuration (process) context;
3. introduce tfw_srv_loop_sched_rcu() to let RCU make progress with long
   process context on !CONFIG_PREEMPT kernels;
4. use sempaphore in configuration (process) context instead of sg_lock;
5. remove sg->lock since it's only needed for configuration and procfs
   (process) context and there is no need to care about concurrency as
   well as no need to spend more memory for each group. Also sleepable
   context with tfw_srv_loop_sched_rcu() is very wished for RCU progress
   (see Documentation/RCU/stallwarn.txt);
procfs directory. With huge number of file procfs becomes almost
unusable, but this is choice for a system administrator or a TODO
for us to build a filesystem tree.
@krizhanovsky krizhanovsky changed the title Fix #76 #908 #911, many other fixes and optimizations Fix #76 #886 #908 #911, many other fixes and optimizations Feb 20, 2018
* to be forwarded. The main effort is put into servicing requests
* that are on time. Unlucky requests are just given another chance
* with minimal effort.
* That's the intended behaviour. Suche rescheduled requests are unlucky
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suche -> Such

@@ -414,10 +414,10 @@ tfw_http_conn_msg_free(TfwHttpMsg *hm)
static inline void
tfw_http_conn_req_clean(TfwHttpReq *req)
{
spin_lock(&((TfwCliConn *)req->conn)->seq_qlock);
spin_lock_bh(&((TfwCliConn *)req->conn)->seq_qlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is not necessity to insert *_bh spin locks into code that is called from SoftIRQ context only (tfw_http_conn_req_clean() and tfw_http_resp_fwd()), as SoftIRQ handlers are serialized on the same CPU and simple spin_lock without bottom halves disabling should be enough in these cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release of server connections may happen in user context. tfw_http_conn_release() also acquires the lock and works with seq_list that's why _bh() is used here.

Copy link
Contributor

@aleksostapenko aleksostapenko Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I didn't find any call from user context (e.g. for tfw_http_conn_req_clean()).
On the other hand it seems that using *_bh spin locks in SoftIRQ context does not affect either performance or lock contention. So this is just a discussion note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of process context calls: tfw_cfgop_update_sg_srv_list() -> tfw_sock_srv_start_srv() -> tfw_sock_srv_connect_srv() -> tfw_sock_srv_connect_try_later() -> tfw_connection_repair() -> conn_repair = tfw_http_conn_repair() -> tfw_http_conn_shrink_fwdq() -> tfw_http_req_zap_error() -> tfw_http_error_resp_switch() -> tfw_http_send_resp() -> tfw_http_resp_build_error() -> tfw_http_conn_req_clean().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i was wrong here. Missed this path.
Thanks.

@@ -1136,12 +1136,12 @@ ss_sock_create(int family, int type, int protocol, struct sock **res)
struct sock *sk = NULL;
const struct net_proto_family *pf;

rcu_read_lock();
rcu_read_lock_bh();
Copy link
Contributor

@aleksostapenko aleksostapenko Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand this replacement: it seems there are no spinlocks here (with possibility of user/softirq deadlock) that it was necessary to disable softirq; besides, synchronize_rcu() (not *_bh flavor) is used in kernel with net_families array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the reason is the same as for #916 (comment) : ss_sock_create() can be called from process contest on the system start as well as from softirq context on server reconnection attempt (timer context). Updated #736 (comment)

static LIST_HEAD(sg_list);
static LIST_HEAD(sg_list_reconfig);
static DEFINE_RWLOCK(sg_lock);
#define TFW_SG_HBITS 10
Copy link
Contributor

@aleksostapenko aleksostapenko Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this parameter should be configurable - for cases when server groups count is much less (or much larger) than 1K.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spend 4 pages for both of the hash tables, so reducing hash size won't give us big memory win. There is no need to move the hash size to configuration, since we're able to estimate the optimal hash size by the number of configured server groups. I have no idea whether it has sense to optimize the data structure further to handle 100K and more server groups, since profiling results has shown timers as the main bottleneck.

@@ -414,10 +414,10 @@ tfw_http_conn_msg_free(TfwHttpMsg *hm)
static inline void
tfw_http_conn_req_clean(TfwHttpReq *req)
{
spin_lock(&((TfwCliConn *)req->conn)->seq_qlock);
spin_lock_bh(&((TfwCliConn *)req->conn)->seq_qlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release of server connections may happen in user context. tfw_http_conn_release() also acquires the lock and works with seq_list that's why _bh() is used here.

struct list_head srv_list;
rwlock_t lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since lock is removed, no server group or server timers may happen during reconfiguration process. Just a note for #736 .

@vankoven
Copy link
Contributor

Still reviewing: there is some issue in reconfigurations code. Reconfiguration tests fail and crash Tempesta:

   40.543629] [tdb] Start Tempesta DB
[   40.554990] net_ratelimit: 1 callbacks suppressed
[   40.556483] [tempesta] Initializing Tempesta FW kernel module...
[   40.564618] [tempesta] Registering new scheduler: hash
[   40.570745] [tempesta] Registering new scheduler: http
[   40.581123] [tempesta] Registering new scheduler: ratio
[   40.626207] [tempesta] Prepearing for the configuration processing.
[   40.628830] [tempesta] Configuration processing is completed.
[   40.639633] [tdb] Opened table /opt/tempesta/db/filter.tdb: size=16777216 rec_size=20 base=ffff94157c400000
[   40.762709] [tdb] Opened table /opt/tempesta/db/cache.tdb: size=268435456 rec_size=0 base=ffff94156c400000
[   40.765524] [tempesta] Open listen socket on: 0.0.0.0
[   40.780090] [tempesta] modules are started
[   50.519206] [tempesta] Live reconfiguration of Tempesta.
[   50.521350] [tempesta] Prepearing for the configuration processing.
[   50.523342] [tempesta] Configuration processing is completed.
[   50.525629] ------------[ cut here ]------------
[   50.527083] kernel BUG at /home/user/tempesta/tempesta/tempesta_fw/http.c:1330!
[   50.528735] invalid opcode: 0000 [#1] SMP
[   50.528735] Modules linked in: tfw_sched_ratio(O) tfw_sched_http(O) tfw_sched_hash(O) tempesta_fw(O) tempesta_db(O) tempesta_tls(O) crct10dif_pclmul binfmt_misc crc32_pclmul snd_hda_codec_generic iTCO_wdt iTCO_vendor_support ghash_clmulni_intel ppdev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm evdev snd_timer snd lpc_ich serio_raw pcspkr soundcore mfd_core virtio_balloon sg virtio_console qxl parport_pc parport ttm drm_kms_helper button drm shpchp ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache sr_mod cdrom virtio_blk virtio_net crc32c_intel ahci libahci aesni_intel i2c_i801 i2c_smbus aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse uhci_hcd ehci_pci ehci_hcd libata usbcore usb_common sym53c8xx scsi_transport_spi scsi_mod virtio_pci virtio_ring virtio [last unloaded: tempesta_tls]
[   50.534804] CPU: 0 PID: 1453 Comm: sysctl Tainted: G           O    4.9.0-tempesta-amd64 #1 Debian 4.9.35-tfw4-1
[   50.534804] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
[   50.534804] task: ffff94155b449d00 task.stack: ffffb0c1816c8000
[   50.534804] RIP: 0010:[<ffffffffc085bd23>]  [<ffffffffc085bd23>] tfw_http_req_destruct+0x53/0x60 [tempesta_fw]
[   50.534804] RSP: 0018:ffff94157fc03c30  EFLAGS: 00010206
[   50.534804] RAX: ffff9415597e63f0 RBX: ffff94155aeba020 RCX: 000000010010000a
[   50.534804] RDX: ffff94155aeba3f0 RSI: ffffd99fc1667c40 RDI: ffff94155aeba020
[   50.534804] RBP: ffff94157fc03c50 R08: ffff9415599f1200 R09: 000000010010000a
[   50.534804] R10: ffff9415599f1400 R11: 2ce33e6c02ce3300 R12: ffff9415599b5678
[   50.534804] R13: ffff94157fc03c50 R14: ffff9415599b572c R15: ffff94157fc03c50
[   50.534804] FS:  00007f4d1a61b880(0000) GS:ffff94157fc00000(0000) knlGS:0000000000000000
[   50.534804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.534804] CR2: 00007f4d19d15d00 CR3: 000000005b8fb000 CR4: 00000000003406f0
[   50.534804] Stack:
[   50.534804]  ffff9415599b5678 ffffffffc085f579 ffff94155aeba020 ffffffffc085be92
[   50.534804]  ffff94157fc03c50 ffff94157fc03c50 0000000059645e70 212fd317b59b8a3f
[   50.534804]  0000000700000024 ffff9415599f0020 ffff94155aeba020 ffff94157fc1a540
[   50.534804] Call Trace:
[   50.534804]  <IRQ> [   50.534804]  [<ffffffffc085f579>] ? tfw_http_msg_free+0x89/0xa0 [tempesta_fw]
[   50.534804]  [<ffffffffc085be92>] ? tfw_http_resp_fwd+0x162/0x2c0 [tempesta_fw]
[   50.534804]  [<ffffffffc085c177>] ? tfw_http_send_resp+0x187/0x1a0 [tempesta_fw]
[   50.534804]  [<ffffffffc085cc2c>] ? tfw_http_conn_shrink_fwdq_resched+0x23c/0x2b0 [tempesta_fw]
[   50.534804]  [<ffffffffc085ceb4>] ? tfw_http_conn_release+0x214/0x240 [tempesta_fw]
[   50.534804]  [<ffffffffc0859334>] ? tfw_connection_release+0x24/0x50 [tempesta_fw]
[   50.534804]  [<ffffffffc087ed7e>] ? tfw_srv_conn_release+0xe/0x80 [tempesta_fw]
[   50.534804]  [<ffffffffc087bbcc>] ? ss_tx_action+0x38c/0x4b0 [tempesta_fw]
[   50.534804]  [<ffffffffb95048db>] ? net_tx_action+0x9b/0x1d0
[   50.534804]  [<ffffffffb9610d1a>] ? __do_softirq+0x10a/0x29a
[   50.534804]  [<ffffffffc087f090>] ? tfw_sock_srv_grace_shutdown_srv+0x110/0x110 [tempesta_fw]
[   50.534804]  [<ffffffffb960f16c>] ? do_softirq_own_stack+0x1c/0x30
[   50.534804]  <EOI> [   50.534804]  [<ffffffffb908056d>] ? do_softirq.part.18+0x3d/0x50
[   50.534804]  [<ffffffffb90805f8>] ? __local_bh_enable_ip+0x78/0x80
[   50.534804]  [<ffffffffc087ee73>] ? tfw_sock_srv_disconnect_srv+0x83/0xb0 [tempesta_fw]
[   50.534804]  [<ffffffffc087f07c>] ? tfw_sock_srv_grace_shutdown_srv+0xfc/0x110 [tempesta_fw]
[   50.534804]  [<ffffffffc087f0f7>] ? __tfw_cfgop_update_sg_srv_list+0x67/0x1a0 [tempesta_fw]
[   50.534804]  [<ffffffffc087f090>] ? tfw_sock_srv_grace_shutdown_srv+0x110/0x110 [tempesta_fw]
[   50.534804]  [<ffffffffc087f090>] ? tfw_sock_srv_grace_shutdown_srv+0x110/0x110 [tempesta_fw]
[   50.534804]  [<ffffffffc0879940>] ? __tfw_sg_for_each_srv+0x70/0xa0 [tempesta_fw]
[   50.534804]  [<ffffffffc087eac5>] ? tfw_sock_srv_start+0xf5/0x330 [tempesta_fw]
[   50.534804]  [<ffffffffc087784d>] ? tfw_ctlfn_state_io+0x36d/0x530 [tempesta_fw]
[   50.534804]  [<ffffffffb960b77e>] ? mutex_lock+0xe/0x30
[   50.534804]  [<ffffffffc0877524>] ? tfw_ctlfn_state_io+0x44/0x530 [tempesta_fw]
[   50.534804]  [<ffffffffb9282b43>] ? proc_sys_call_handler+0xe3/0x100
[   50.534804]  [<ffffffffb9208820>] ? vfs_write+0xb0/0x190
[   50.534804]  [<ffffffffb9209c12>] ? SyS_write+0x52/0xc0
[   50.534804]  [<ffffffffb960e1fb>] ? system_call_fast_compare_end+0xc/0x9b
[   50.534804] Code: 03 00 00 48 8d 97 e0 03 00 00 48 39 c2 75 1e 48 8b bf 58 03 00 00 48 85 ff 74 09 48 83 c4 08 e9 74 b0 01 00 48 83 c4 08 c3 0f 0b <0f> 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 
[   50.534804] RIP  [<ffffffffc085bd23>] tfw_http_req_destruct+0x53/0x60 [tempesta_fw]
[   50.534804]  RSP <ffff94157fc03c30>

The crash happened on reconf.test_stress_sched_hash.SchedHashCustomSg.test_hash_del_add_srvs. In the same time i have no issues with reconf tests on current master

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Mar 2, 2018

The crash #916 (comment) is due to assertion BUG_ON(!list_empty(&req->fwd_list)). The root casue is that tfw_http_conn_shrink_fwdq_resched() calls tfw_http_req_err() with srv_conn == NULL since it passes the whole srv_conn->fwd_queue to schq, so tfw_http_req_delist() isn't called for a request. This works fine if req is going to be added to the error queue, but we hit the crash if we try to send an error response immediately. The assertion fail is expected and harmless in general, but still must be fixed.

@krizhanovsky krizhanovsky merged commit bfe33c0 into master Mar 3, 2018
@krizhanovsky krizhanovsky deleted the ak-76 branch March 3, 2018 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants