Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in APM #752

Closed
krizhanovsky opened this issue Jun 16, 2017 · 2 comments
Closed

Crash in APM #752

krizhanovsky opened this issue Jun 16, 2017 · 2 comments
Assignees
Milestone

Comments

@krizhanovsky
Copy link
Contributor

From #692 (comment) . The problem seems happen at https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/apm.c#L223 , i.e. cnt isn't incremented in the loop above.

I got the oops at the below using workload ./wrk -t 16 -c 4096 -d 3600s http://centos:80/ in VM which also runs Apache HTTPD with connection reset each 100 requests.

    [ 4512.704257] divide error: 0000 [#1] SMP
    [ 4512.705113] Modules linked in: tfw_sched_rr(O) tfw_sched_http(O) tfw_sched_hash(O) tempesta_fw(O) tempesta_db(O) tempesta_tls(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_vs nf_conntrack dm_mirror dm_region_hash dm_log dm_mod xfs libcrc32c ppdev ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd input_leds led_class pcspkr i2c_piix4 i2c_core parport_pc parport acpi_cpufreq uinput ip_tables btrfs xor raid6_pq ata_generic pata_acpi crc32c_intel serio_raw e1000 ata_piix floppy ipv6 crc_ccitt autofs4
    [ 4512.711182] CPU: 0 PID: 12929 Comm: httpd Tainted: G           O    4.8.15-tfw #41
    [ 4512.711182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4512.711182] task: ffff8800357cc500 task.stack: ffff880023b1c000
    [ 4512.711182] RIP: 0010:[<ffffffffa068aea7>]  [<ffffffffa068aea7>] __tfw_stats_adjust+0x67/0x250 [tempesta_fw]
    [ 4512.711182] RSP: 0018:ffff88007fc03898  EFLAGS: 00010246
    [ 4512.711182] RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff88005b964688
    [ 4512.711182] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88005b964608
    [ 4512.735428] RBP: ffff88007fc038c8 R08: 0000000000000010 R09: 0000000000000000
    [ 4512.735428] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [ 4512.735428] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 4512.735428] FS:  00007f07c8d3e880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
    [ 4512.735428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 4512.735428] CR2: 00007f9a699ee640 CR3: 0000000026e40000 CR4: 00000000003406f0
    [ 4512.735428] Stack:
    [ 4512.735428]  0000000000000000 ffff88005b964608 0000000000000625 ffff88005b964008
    [ 4512.735428]  0000000000000000 0000000000000000 ffff88007fc038f8 ffffffffa068bddf
    [ 4512.735428]  ffff88001f78c608 ffff8800216ce020 ffff8800265da020 ffffffffa0693660
    [ 4512.735428] Call Trace:
    [ 4512.735428]  <IRQ> 
    [ 4512.735428]  [<ffffffffa068bddf>] tfw_apm_update+0x2cf/0x3c0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0693660>] ? tfw_http_send_500.isra.28+0xd0/0xd0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06936b2>] tfw_http_resp_cache_cb+0x52/0x80 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa068e157>] tfw_cache_process+0x117/0x2a0 [tempesta_fw]
    [ 4512.735428]  [<ffffffff810f0ce4>] ? __getnstimeofday64+0x94/0x140
    [ 4512.735428]  [<ffffffffa0693e8d>] tfw_http_resp_cache+0x7d/0xc0 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06942bc>] tfw_http_msg_process+0x1fc/0x750 [tempesta_fw]
    [ 4512.735428]  [<ffffffff8154ab4d>] ? __alloc_skb+0x4d/0x2a0
    [ 4512.735428]  [<ffffffffa069161c>] __gfsm_fsm_exec+0x5c/0x90 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0691844>] tfw_gfsm_dispatch+0x14/0x20 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa0690e3d>] tfw_connection_recv+0x1d/0x20 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06a84fc>] ss_tcp_process_data+0x17c/0x360 [tempesta_fw]
    [ 4512.735428]  [<ffffffffa06a8730>] ss_tcp_data_ready+0x50/0x90 [tempesta_fw]
    [ 4512.735428]  [<ffffffff815c68a6>] tcp_data_queue+0x6e6/0xd60
    [ 4512.735428]  [<ffffffff815c70fe>] tcp_rcv_established+0x1de/0x710
    [ 4512.735428]  [<ffffffff815d2bdc>] tcp_v4_do_rcv+0x10c/0x210
    [ 4512.735428]  [<ffffffff815d41bb>] tcp_v4_rcv+0xadb/0xcd0
    [ 4512.735428]  [<ffffffff815aa6b1>] ? ip_local_deliver_finish+0x41/0x3b0
    [ 4512.735428]  [<ffffffff815aa7a3>] ip_local_deliver_finish+0x133/0x3b0
    [ 4512.735428]  [<ffffffff815aa6b1>] ? ip_local_deliver_finish+0x41/0x3b0
    [ 4512.735428]  [<ffffffff815ab2d0>] ip_local_deliver+0xc0/0xd0
    [ 4512.735428]  [<ffffffff815aa670>] ? inet_del_offload+0x40/0x40
    [ 4512.735428]  [<ffffffff815aabcf>] ip_rcv_finish+0x1af/0x610
    [ 4512.735428]  [<ffffffff815ab5c9>] ip_rcv+0x2e9/0x410
    [ 4512.735428]  [<ffffffff815aaa20>] ? ip_local_deliver_finish+0x3b0/0x3b0
    [ 4512.735428]  [<ffffffff8155dd85>] __netif_receive_skb_core+0x3a5/0xb40
    [ 4512.735428]  [<ffffffff8155e660>] ? process_backlog+0xe0/0x230
    [ 4512.735428]  [<ffffffff8155e53d>] __netif_receive_skb+0x1d/0x60
    [ 4512.735428]  [<ffffffff8155e5f8>] process_backlog+0x78/0x230
    [ 4512.735428]  [<ffffffff8155e660>] ? process_backlog+0xe0/0x230
    [ 4512.735428]  [<ffffffff815607ce>] net_rx_action+0x14e/0x440
    [ 4512.735428]  [<ffffffff81653ca8>] __do_softirq+0xc8/0x473
    [ 4512.735428]  [<ffffffff815aeed7>] ? ip_finish_output2+0x217/0x5c0
    [ 4512.735428]  [<ffffffff8102066e>] __tempesta_do_softirq_fpusafe+0xe/0x10
    [ 4512.735428]  [<ffffffff816521ec>] do_softirq_own_stack+0x1c/0x30
    [ 4512.735428]  <EOI> 
    [ 4512.735428]  [<ffffffff8107065e>] do_softirq+0x4e/0x60
    [ 4512.735428]  [<ffffffff81070718>] __local_bh_enable_ip+0xa8/0xb0
    [ 4512.735428]  [<ffffffff815aef00>] ip_finish_output2+0x240/0x5c0
    [ 4512.735428]  [<ffffffff815afd46>] ? ip_finish_output+0x1c6/0x320
    [ 4512.735428]  [<ffffffff815afd46>] ip_finish_output+0x1c6/0x320
    [ 4512.735428]  [<ffffffff815afea9>] NF_HOOK_COND.part.33.constprop.47+0x9/0x10
    [ 4512.735428]  [<ffffffff815b1563>] ip_output+0xc3/0x110
    [ 4512.735428]  [<ffffffff815afb80>] ? ip_fragment.constprop.49+0x80/0x80
    [ 4512.735428]  [<ffffffff815b0a19>] ip_local_out+0x39/0x70
    [ 4512.735428]  [<ffffffff815b0e4b>] ip_queue_xmit+0x1eb/0x5a0
    [ 4512.735428]  [<ffffffff815b0c65>] ? ip_queue_xmit+0x5/0x5a0
    [ 4512.735428]  [<ffffffff815ca9f4>] tcp_transmit_skb+0x4a4/0x8b0
    [ 4512.735428]  [<ffffffff815caf61>] tcp_write_xmit+0x161/0xef0
    [ 4512.735428]  [<ffffffff815cbfd1>] __tcp_push_pending_frames+0x31/0xd0
    [ 4512.735428]  [<ffffffff815bd16b>] do_tcp_setsockopt.isra.37+0x6ab/0x860
    [ 4512.735428]  [<ffffffff815bd34c>] tcp_setsockopt+0x2c/0x30
    [ 4512.735428]  [<ffffffff815401a4>] sock_common_setsockopt+0x14/0x20
    [ 4512.735428]  [<ffffffff8153ef11>] SyS_setsockopt+0x71/0xc0
    [ 4512.735428]  [<ffffffff810029c8>] do_syscall_64+0x58/0x110
    [ 4512.735428]  [<ffffffff816513da>] entry_SYSCALL64_slow_path+0x25/0x25
    [ 4512.735428] Code: 48 63 11 49 83 c3 01 48 01 d0 4c 63 09 4d 39 d1 76 06 4c 63 11 4d 89 c5 49 83 c0 01 48 83 c1 04 49 83 f8 10 75 d2 48 01 c0 31 d2 <49> f7 f3 4c 39 d0 72 0f 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 
    [ 4512.735428] RIP  [<ffffffffa068aea7>] __tfw_stats_adjust+0x67/0x250 [tempesta_fw]
    [ 4512.735428]  RSP <ffff88007fc03898>
    [ 4512.852777] ---[ end trace 630a16268dd4e293 ]---
    [ 4512.854608] Kernel panic - not syncing: Fatal exception in interrupt
@krizhanovsky
Copy link
Contributor Author

Got somewhat different crash in APM after running

    # ulimit -n 32768; ./wrk -t 16 -c 4096 -d 3600s http://centos:80/

I used Linux patch https://github.com/tempesta-tech/linux-4.8.15-tfw/commit/9cea1ec0156145217ac6320c40189b40b8160780 and #692 -mitigating patch

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3191,6 +3191,12 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
                if (!fully_acked)
                        break;
 
+               if (!skb->next) {
+                       pr_err("AK_DBG peer=%x:%x state=%u sk_refcnt=%d flags=%lx\n",
+                               sk->sk_daddr, sk->sk_dport, sk->sk_state,
+                               atomic_read(&sk->sk_refcnt), sk->sk_flags);
+                       continue;
+               }
                tcp_unlink_write_queue(skb, sk);
                sk_wmem_free_skb(sk, skb);
                if (unlikely(skb == tp->retransmit_skb_hint))

Tempesta master commit b60cc3d) is used. The crash trace is

    [ 4307.593595] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 44s! [swapper/1:0]
    [ 4307.593595] Modules linked in: tfw_sched_ratio(O) tfw_sched_http(O) tfw_sched_hash(O) tempesta_fw(O) tempesta_db(O) tempesta_tls(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_vs nf_conntrack dm_mirror dm_region_hash dm_log dm_mod ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw xfs libcrc32c ppdev gf128mul ablk_helper cryptd input_leds led_class pcspkr parport_pc parport i2c_piix4 i2c_core acpi_cpufreq uinput ip_tables btrfs xor raid6_pq ata_generic pata_acpi crc32c_intel serio_raw e1000 ata_piix floppy ipv6 crc_ccitt autofs4
    [ 4307.593595] irq event stamp: 8232141
    [ 4307.593595] hardirqs last  enabled at (8232140): [<ffffffff81651d92>] restore_regs_and_iret+0x0/0x1d
    [ 4307.593595] hardirqs last disabled at (8232141): [<ffffffff81652ce4>] apic_timer_interrupt+0x84/0x90
    [ 4307.593595] softirqs last  enabled at (8229960): [<ffffffff8106f631>] _local_bh_enable+0x21/0x50
    [ 4307.593595] softirqs last disabled at (8229961): [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.8.15-tfw #45
    [ 4307.593595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4307.593595] task: ffff88005c8e1700 task.stack: ffff88005c8e8000
    [ 4307.593595] RIP: 0010:[<ffffffffa068fb82>]  [<ffffffffa068fb82>] tfw_apm_prcntl_tmfn+0x7a2/0x970 [tempesta_fw]
    [ 4307.593595] RSP: 0018:ffff88007fd03dc0  EFLAGS: 00000216
    [ 4307.593595] RAX: fc25fb3500000004 RBX: ffff88005946d600 RCX: 000000009b2945f3
    [ 4307.593595] RDX: 000000000000ff52 RSI: 000000000000ff52 RDI: 00000000fc9dfb35
    [ 4307.593595] RBP: ffff88007fd03e30 R08: 000000009b2945f4 R09: 00000000fc25fb35
    [ 4307.593595] R10: 000000000000fb35 R11: 0000000000000001 R12: ffff88005a3cbd90
    [ 4307.593595] R13: 0000000000000000 R14: 0000000000000015 R15: ffffe8ffffd04930
    [ 4307.593595] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
    [ 4307.593595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 4307.593595] CR2: 00007f99b61de000 CR3: 00000000220d2000 CR4: 00000000003406e0
    [ 4307.593595] Stack:
    [ 4307.593595]  ffff88007fd03e30 ffff88005946d000 000000015c8e1700 ffff88005946d180
    [ 4307.593595]  ffffffff810be849 ffff88005c8e1700 0000000000000246 0000000000000101
    [ 4307.593595]  ffff88007fd03e20 ffff88005946d110 ffff88005946d110 0000000000000101
    [ 4307.593595] Call Trace:
    [ 4307.593595]  <IRQ> 
    [ 4307.593595]  [<ffffffff810be849>] ? __lock_is_held+0x49/0x70
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e677e>] call_timer_fn+0x9e/0x310
    [ 4307.593595]  [<ffffffff810e66e5>] ? call_timer_fn+0x5/0x310
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e76b9>] run_timer_softirq+0x1c9/0x620
    [ 4307.593595]  [<ffffffff810f07a6>] ? ktime_get+0x96/0x120
    [ 4307.593595]  [<ffffffff810f7454>] ? clockevents_program_event+0x44/0x110
    [ 4307.593595]  [<ffffffff81653d68>] __do_softirq+0xc8/0x473
    [ 4307.593595]  [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595]  [<ffffffff81653a22>] smp_apic_timer_interrupt+0x42/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  <EOI> 
    [ 4307.593595]  [<ffffffff81028085>] ? default_idle+0x25/0x180
    [ 4307.593595]  [<ffffffff81028083>] ? default_idle+0x23/0x180
    [ 4307.593595]  [<ffffffff810289ef>] arch_cpu_idle+0xf/0x20
    [ 4307.593595]  [<ffffffff810b5563>] default_idle_call+0x33/0x50
    [ 4307.593595]  [<ffffffff810b57ba>] cpu_startup_entry+0x23a/0x420
    [ 4307.593595]  [<ffffffff8104311b>] start_secondary+0x10b/0x130
    [ 4307.593595] Code: 05 da cd e0 48 8b 45 98 f0 ff 80 00 01 00 00 4c 89 e7 e8 f2 16 fc e0 e9 e1 fa ff ff 48 89 c7 8d 48 01 48 c1 ef 20 49 89 f9 eb 03 <44> 89 c1 bf 0f 00 00 00 44 8d 41 01 d3 e7 44 01 cf 44 0f b7 d7 
    [ 4307.593595] Kernel panic - not syncing: softlockup: hung tasks
    [ 4307.593595] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O L  4.8.15-tfw #45
    [ 4307.593595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    [ 4307.593595]  0000000000000000 ffff88007fd03b78 ffffffff8135ecbe ffff88005c8e1700
    [ 4307.593595]  ffffffff819dcb0e ffff88007fd03bf0 ffffffff8117783e 0000000000000008
    [ 4307.593595]  ffff88007fd03c00 ffff88007fd03ba0 0000000000000001 0000000000000000
    [ 4307.593595] Call Trace:
    [ 4307.593595]  <IRQ>  [<ffffffff8135ecbe>] dump_stack+0x67/0x99
    [ 4307.593595]  [<ffffffff8117783e>] panic+0xda/0x220
    [ 4307.593595]  [<ffffffff81131100>] watchdog_timer_fn+0x1d0/0x1d0
    [ 4307.593595]  [<ffffffff81130f30>] ? watchdog+0x40/0x40
    [ 4307.593595]  [<ffffffff810e9ba0>] __hrtimer_run_queues+0x100/0x500
    [ 4307.593595]  [<ffffffff810ea38b>] hrtimer_interrupt+0xab/0x1b0
    [ 4307.593595]  [<ffffffff81044ae5>] local_apic_timer_interrupt+0x35/0x60
    [ 4307.593595]  [<ffffffff81653a1d>] smp_apic_timer_interrupt+0x3d/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  [<ffffffffa068fb82>] ? tfw_apm_prcntl_tmfn+0x7a2/0x970 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810be849>] ? __lock_is_held+0x49/0x70
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e677e>] call_timer_fn+0x9e/0x310
    [ 4307.593595]  [<ffffffff810e66e5>] ? call_timer_fn+0x5/0x310
    [ 4307.593595]  [<ffffffffa068f3e0>] ? tfw_apm_stats+0x90/0x90 [tempesta_fw]
    [ 4307.593595]  [<ffffffff810e76b9>] run_timer_softirq+0x1c9/0x620
    [ 4307.593595]  [<ffffffff810f07a6>] ? ktime_get+0x96/0x120
    [ 4307.593595]  [<ffffffff810f7454>] ? clockevents_program_event+0x44/0x110
    [ 4307.593595]  [<ffffffff81653d68>] __do_softirq+0xc8/0x473
    [ 4307.593595]  [<ffffffff81070828>] irq_exit+0x98/0xb0
    [ 4307.593595]  [<ffffffff81653a22>] smp_apic_timer_interrupt+0x42/0x50
    [ 4307.593595]  [<ffffffff81652ce9>] apic_timer_interrupt+0x89/0x90
    [ 4307.593595]  <EOI>  [<ffffffff81028085>] ? default_idle+0x25/0x180
    [ 4307.593595]  [<ffffffff81028083>] ? default_idle+0x23/0x180
    [ 4307.593595]  [<ffffffff810289ef>] arch_cpu_idle+0xf/0x20
    [ 4307.593595]  [<ffffffff810b5563>] default_idle_call+0x33/0x50
    [ 4307.593595]  [<ffffffff810b57ba>] cpu_startup_entry+0x23a/0x420
    [ 4307.593595]  [<ffffffff8104311b>] start_secondary+0x10b/0x130

The oops corresponds to following code in APM:

   # nm tempesta_fw.ko|grep tfw_apm_prcntl_tmfn
   00000000000013e0 t tfw_apm_prcntl_tmfn

   # addr2line -e tempesta_fw.ko -i `perl -e 'printf("%x", 0x13e0+0x7a2)'`
   /root/tempesta/tempesta_fw/apm.c:180
   /root/tempesta/tempesta_fw/apm.c:353
   /root/tempesta/tempesta_fw/apm.c:924

@keshonok
Copy link
Contributor

APM has been reworked in #753. Bugs that caused the crashes were eliminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants