Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 memcpy(), memcmp() & Ko #635

Closed
krizhanovsky opened this issue Nov 4, 2016 · 2 comments
Closed

AVX2 memcpy(), memcmp() & Ko #635

krizhanovsky opened this issue Nov 4, 2016 · 2 comments
Assignees

Comments

@krizhanovsky
Copy link
Contributor

Linux kernel uses dummy x86 assembly for memcpy(), memcmp() etc. Meantime Tempesta FW and Tempesta DB copy small data using the routines. Example of the data is HTTP headers - usually they small, but some of them (like Cookie) can reach enormous size and we have to copy them to be able to adjust.

Actually with #634 in mind we should not copy HTTP headers, but we still need to copy some small data.

@krizhanovsky krizhanovsky added this to the 0.5.0 Web Server milestone Nov 4, 2016
@krizhanovsky krizhanovsky modified the milestones: 0.6 WebOS, 0.5.0 Web Server Feb 12, 2017
@krizhanovsky krizhanovsky modified the milestones: backlog, 0.9 Web server Mar 9, 2018
@krizhanovsky krizhanovsky self-assigned this Mar 31, 2018
@krizhanovsky krizhanovsky modified the milestones: 0.9 Web server, 0.6 KTLS Mar 31, 2018
krizhanovsky added a commit to tempesta-tech/blog that referenced this issue Apr 16, 2018
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Apr 23, 2018

The PoCs for memcpy(), memset()/bzero(), and memcmp() with benchmarks for kernel and user spaces can be found here https://github.com/natsys/blog/tree/master/kstrings . (Replacing vector implementations by a kernel functions, e.g. memcpy_avx() in the test by __memcpy_avx() exported by the patched kernel, don't change the results). While the benchmarks show at least twice better performance, the resulting performance gain is good for cache mode only (when we use relatively many of the functions), however pure proxying mostly starves from many interrupts (I tested the optimizations in a VM with e1000 adater) and improvements are negligible.

I believe there are more optimization opportunities of the functions. Glibc uses much more complicated algorithms, e.g. https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;h=cbd0d077cf61718a8f558829bd368fb79505fa26;hb=HEAD

Wrk benchmark results

I used the same setup and workload as described in HTTP Requests Proxying article.

proxying of 1KB file

Optimized version (3 runs): 28436 RPS, 28617 RPS, 28515 RPS (perf for __bzero_avx and __memcpy_avx are ~0.81% and ~0.73 correspondingly)

Vanilla: 28611 RPS, 28826 RPS, 29064 RPS (perf for memcpy_erms ~ 0.70%)

Caching of 1KB file

Optimized: 51479 RPS, 53292 RPS, 53004 RPS (perf: __memcpy_avx ~2.17% , __bzero_avx ~1.7%)

Vanilla: 44149 RPS, 46638 RPS, 51809 RPS (perf: memcpy_erms ~3% (2nd)

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Apr 26, 2018

More performance data from the hardware testbed. In most of the tests mwait_idle is in the top, i.e. CPUs mostly idle and the 12-core machines isn't enough to fully load 8-core Tempesta FW machine, so absolute benchmark numbers don't make much sense. Also #984 makes some mess with tons of warnings at the end of benchmarks on connections closing, hopefully not on workload.

# VANILLA, PROXY, 3B
# ulimit -n 65536; for i in `seq 0 3`; do wrk -c 12288 -t 96 -d 30 http://192.168.0.1:80/; done
# Skip first result to fill in caches
# Format: RPS/timeout errors
228961/0 230919/1000 228408/0

# root@9011:~/ak/linux-4.14.32-tfw/tools/perf# ./perf record -a
     7.13%  swapper          [kernel.vmlinux]                 [k] mwait_idle
     1.32%  nginx            [kernel.vmlinux]                 [k] syscall_return_via_sysret
     1.16%  nginx            [kernel.vmlinux]                 [k] skb_release_data
     1.01%  nginx            [mlx4_en]                        [k] mlx4_en_process_rx_cq
     0.95%  nginx            [kernel.vmlinux]                 [k] _raw_spin_lock
     0.92%  nginx            nginx                            [.] ngx_http_parse_header_line
     0.89%  nginx            [kernel.vmlinux]                 [k] tcp_ack
     0.83%  nginx            [kernel.vmlinux]                 [k] native_irq_return_iret
     0.81%  nginx            [mlx4_core]                      [k] mlx4_eq_int
     0.79%  nginx            nginx                            [.] ngx_open_cached_file
     0.71%  nginx            [kernel.vmlinux]                 [k] __x86_indirect_thunk_rax
     0.68%  nginx            [kernel.vmlinux]                 [k] memcpy_erms
     0.65%  nginx            [mlx4_en]                        [k] mlx4_en_xmit
     0.64%  nginx            [kernel.vmlinux]                 [k] __inet_lookup_established
     0.59%  nginx            [kernel.vmlinux]                 [k] tcp_write_xmit
     0.59%  nginx            [kernel.vmlinux]                 [k] tcp_transmit_skb
     0.58%  nginx            nginx                            [.] ngx_http_parse_request_line
     0.53%  nginx            [kernel.vmlinux]                 [k] copy_user_generic_unrolled
     0.52%  nginx            nginx                            [.] ngx_vslprintf
     0.52%  nginx            [kernel.vmlinux]                 [k] _raw_spin_lock_bh
     0.50%  nginx            [kernel.vmlinux]                 [k] __local_bh_enable_ip
     0.49%  nginx            [kernel.vmlinux]                 [k] __alloc_skb
     ........
     0.22%  nginx            [kernel.vmlinux]                 [k] memset_erms


# VANILLA, CACHE, 3B
383591/0 380323/0 381964/0

# perf
    29.74%  swapper          [kernel.vmlinux]          [k] mwait_idle
     2.88%  swapper          [mlx4_en]                 [k] mlx4_en_process_rx_cq
     2.45%  swapper          [mlx4_core]               [k] mlx4_eq_int
     1.91%  swapper          [kernel.vmlinux]          [k] memcpy_erms
    ........
     0.07%  swapper          [kernel.vmlinux]          [k] memset_erms


# VANILLA, PROXY, 1401B
132466/8809 131659/0 138782/1858

# perf
    16.74%  nginx            [kernel.vmlinux]                 [k] read_hpet
     8.96%  swapper          [kernel.vmlinux]                 [k] read_hpet
     7.87%  swapper          [kernel.vmlinux]                 [k] mwait_idle
     4.99%  ksoftirqd/2      [kernel.vmlinux]                 [k] read_hpet
     1.46%  ksoftirqd/7      [kernel.vmlinux]                 [k] read_hpet
     0.72%  nginx            [mlx4_en]                        [k] mlx4_en_process_rx_cq
     0.69%  nginx            [kernel.vmlinux]                 [k] syscall_return_via_sysret
     0.56%  nginx            [kernel.vmlinux]                 [k] copy_user_generic_unrolled
     ....
     0.26%  nginx            [kernel.vmlinux]                 [k] memcpy_erms
     ....
     0.07%  nginx            [kernel.vmlinux]                 [k] memset_erms


# VANILLA, CACHE, 1401B
302142/0 306771/10252 298390/0

# perf
    30.94%  swapper          [kernel.vmlinux]          [k] mwait_idle
     3.56%  swapper          [mlx4_en]                 [k] mlx4_en_process_rx_cq
     2.78%  swapper          [mlx4_core]               [k] mlx4_eq_int
     2.17%  swapper          [kernel.vmlinux]          [k] __inet_lookup_established
     1.75%  swapper          [kernel.vmlinux]          [k] memcpy_erms
     ....
     0.06%  swapper          [kernel.vmlinux]          [k] memset_erms



# AK-635 OPTIMIZATION, PROXY, 3B
217537/0 234976/0 229096/0

# perf
     5.10%  swapper          [kernel.vmlinux]                 [k] mwait_idle
     1.39%  nginx            [kernel.vmlinux]                 [k] syscall_return_via_sysret
     1.08%  nginx            [kernel.vmlinux]                 [k] skb_release_data
     0.97%  nginx            [kernel.vmlinux]                 [k] tcp_ack
     0.92%  nginx            [kernel.vmlinux]                 [k] _raw_spin_lock
     0.92%  nginx            nginx                            [.] ngx_http_parse_header_line
     0.87%  nginx            [mlx4_en]                        [k] mlx4_en_process_rx_cq
     0.87%  nginx            [kernel.vmlinux]                 [k] native_irq_return_iret
     0.80%  nginx            [mlx4_core]                      [k] mlx4_eq_int
     0.77%  nginx            [kernel.vmlinux]                 [k] __x86_indirect_thunk_rax
     0.73%  nginx            [kernel.vmlinux]                 [k] __bzero_avx
     0.71%  nginx            nginx                            [.] ngx_open_cached_file
     0.65%  nginx            [kernel.vmlinux]                 [k] tcp_transmit_skb
     0.61%  nginx            [kernel.vmlinux]                 [k] __memcpy_avx


# AK-635 OPTIMIZATION, CACHE, 3B
356364/0 355789/0 353979/0

# perf
    37.95%  swapper          [kernel.vmlinux]          [k] mwait_idle
     2.49%  swapper          [mlx4_en]                 [k] mlx4_en_process_rx_cq
     2.33%  swapper          [mlx4_core]               [k] mlx4_eq_int
     1.77%  swapper          [kernel.vmlinux]          [k] __inet_lookup_established
     1.48%  swapper          [kernel.vmlinux]          [k] skb_release_data
     1.30%  swapper          [kernel.vmlinux]          [k] tcp_ack
     1.27%  swapper          [kernel.vmlinux]          [k] _raw_spin_lock
     1.07%  swapper          [mlx4_en]                 [k] mlx4_en_xmit
     0.92%  swapper          [kernel.vmlinux]          [k] _find_next_bit
     0.90%  swapper          [kernel.vmlinux]          [k] __memcpy_avx
     0.87%  swapper          [kernel.vmlinux]          [k] native_irq_return_iret
     0.84%  swapper          [kernel.vmlinux]          [k] tcp_v4_rcv
     0.83%  swapper          [kernel.vmlinux]          [k] __x86_indirect_thunk_rax
     0.81%  swapper          [kernel.vmlinux]          [k] __bzero_avx


# AK-635 OPTIMIZATION, PROXY, 1401B
187530/0 196462/8 192323

# perf

    10.22%  swapper          [kernel.vmlinux]                 [k] mwait_idle
     1.30%  nginx            [kernel.vmlinux]                 [k] syscall_return_via_sysret
     1.13%  nginx            [mlx4_en]                        [k] mlx4_en_process_rx_cq
     0.96%  nginx            [mlx4_core]                      [k] mlx4_eq_int
     0.85%  nginx            [kernel.vmlinux]                 [k] skb_release_data
     0.78%  nginx            [kernel.vmlinux]                 [k] native_irq_return_iret
     0.78%  nginx            [kernel.vmlinux]                 [k] _raw_spin_lock
     0.74%  nginx            nginx                            [.] ngx_http_parse_header_line
     0.71%  nginx            [kernel.vmlinux]                 [k] tcp_ack
     0.68%  nginx            nginx                            [.] ngx_open_cached_file
     0.62%  nginx            [kernel.vmlinux]                 [k] __x86_indirect_thunk_rax
     0.59%  nginx            [kernel.vmlinux]                 [k] __bzero_avx
     0.57%  nginx            nginx                            [.] ngx_http_parse_request_line
     0.52%  nginx            [kernel.vmlinux]                 [k] __inet_lookup_established
     0.51%  nginx            [kernel.vmlinux]                 [k] __memcpy_avx


# AK-635 OPTIMIZATION, CACHE, 1401B
293196/0 302198/13752 287412/0

# perf
    38.62%  swapper          [kernel.vmlinux]          [k] mwait_idle
     2.94%  swapper          [mlx4_en]                 [k] mlx4_en_process_rx_cq
     2.59%  swapper          [mlx4_core]               [k] mlx4_eq_int
     1.61%  swapper          [kernel.vmlinux]          [k] __inet_lookup_established
     1.45%  swapper          [kernel.vmlinux]          [k] _raw_spin_lock
     1.10%  swapper          [kernel.vmlinux]          [k] get_nohz_timer_target
     1.07%  swapper          [kernel.vmlinux]          [k] skb_release_data
     1.06%  swapper          [kernel.vmlinux]          [k] tcp_ack
     1.01%  swapper          [kernel.vmlinux]          [k] _find_next_bit
     0.97%  swapper          [kernel.vmlinux]          [k] native_irq_return_iret
     0.87%  swapper          [mlx4_en]                 [k] mlx4_en_xmit
     0.82%  swapper          [kernel.vmlinux]          [k] tcp_v4_rcv
     0.75%  swapper          [kernel.vmlinux]          [k] __x86_indirect_thunk_rax
     0.73%  swapper          [kernel.vmlinux]          [k] __memcpy_avx
     0.73%  swapper          [kernel.vmlinux]          [k] __bzero_avx

krizhanovsky added a commit that referenced this issue May 4, 2018
Fix #635: SIMD implementations for memcpy(), memset(), and memcmp()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant