Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bit vector to store CESU-8 lookup table, #794

Closed
wants to merge 1 commit into from

Conversation

huxinx
Copy link
Contributor

@huxinx huxinx commented Dec 30, 2015

JerryScript-DCO-1.0-Signed-off-by: Xin Hu [email protected]

Run ./tools/run-perf-test.sh 10 times,
OS, ubuntu 15.04, 32 bit
CPU, Intel(R) Celeron(R) CPU N2820 @ 2.13GHz, 2 core

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 120-> 120 (0.000) 1.092->1.08178 (0.936)
access-binary-trees.js 92-> 88 (4.348) 0.660444->0.660889 (-0.067)
access-fannkuch.js 44-> 44 (0.000) 3.51244->3.54133 (-0.823)
bitops-3bit-bits-in-byte.js 36-> 32 (11.111) 0.910222->0.920444 (-1.123)
bitops-bits-in-byte.js 36-> 32 (11.111) 1.19556->1.22444 (-2.416)
bitops-bitwise-and.js 40-> 36 (10.000) 1.27956->1.24711 (2.536)
controlflow-recursive.js 244-> 244 (0.000) 0.56-> 0.576 (-2.857)
crypto-aes.js 128-> 128 (0.000) 2.31156->2.25244 (2.558)
crypto-md5.js 196-> 192 (2.041) 12.3347-> 9.312 (24.506)
crypto-sha1.js 140-> 136 (2.857) 5.51956-> 4.296 (22.168)
date-format-xparb.js 80-> 76 (5.000) 0.64->0.644444 (-0.694)
math-cordic.js 44-> 44 (0.000) 1.32-> 1.332 (-0.909)
math-spectral-norm.js 44-> 44 (0.000) 0.793333->0.796444 (-0.392)
string-base64.js 172-> 168 (2.326) 97.4058->82.5538 (15.248)
string-fasta.js 56-> 56 (0.000) 2.27156->2.17556 (4.226)
Geometric mean: RSS reduction: 3.3416% Speed up: 4.6192%

Wed Dec 30 20:40:51 EST 2015

Another one for lit_get_unicode_char_size_by_utf8_first_byte.

Use bit vector to store the USEC-8 length table.
Since the value is 0,1,2,3, we can use two bits to represent one table item.
The whole set is 16*2 = 32 bit, that is just using one uint32 to store the USEC-8 table.
This way squeeze 12 bytes comparing to uint8 table version.
The average performance seems fine. crypto-md5 , crypto-sh1.js and string-base64.js are really good, but several others are not that good comparing to #793
Hard to say which one is better.

@zherczeg
Copy link
Member

This idea seems clever. However, I don't think it is little endian specific. The value is stored in a 32 bit register, and endianness is irrelevant for values in registers.

@egavrin egavrin added the enhancement An improvement label Dec 31, 2015
@egavrin
Copy link
Contributor

egavrin commented Dec 31, 2015

RP2:

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 112-> 112 (0.000) 3.0688->3.0816 (-0.417)
access-binary-trees.js 80-> 80 (0.000) 1.876->1.8768 (-0.043)
access-fannkuch.js 40-> 40 (0.000) 9.0992->9.1248 (-0.281)
access-nbody.js 48-> 48 (0.000) 4.184->4.1744 (0.229)
bitops-3bit-bits-in-byte.js 32-> 32 (0.000) 2.356->2.3624 (-0.272)
bitops-bits-in-byte.js 32-> 32 (0.000) 3.1264->3.1392 (-0.409)
bitops-bitwise-and.js 28-> 28 (0.000) 3.5096->3.4992 (0.296)
bitops-nsieve-bits.js 156-> 156 (0.000) 27.592->27.5952 (-0.012)
controlflow-recursive.js 236-> 236 (0.000) 1.6152->1.6232 (-0.495)
crypto-aes.js 120-> 120 (0.000) 5.4672-> 5.484 (-0.307)
crypto-md5.js 184-> 184 (0.000) 29.3912->29.3256 (0.223)
crypto-sha1.js 132-> 132 (0.000) 13.2592->13.2616 (-0.018)
date-format-xparb.js 72-> 72 (0.000) 1.6576->1.6568 (0.048)
math-cordic.js 40-> 40 (0.000) 3.388->3.3968 (-0.260)
math-partial-sums.js 36-> 36 (0.000) 1.9776->1.9632 (0.728)
math-spectral-norm.js 40-> 40 (0.000) 2.1384->2.1224 (0.748)
string-fasta.js 48-> 48 (0.000) 5.4864->5.4816 (0.087)
Geometric mean: RSS reduction: 0% Speed up: -0.008%

@@ -757,6 +757,17 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t *utf8_buf_p, /**< utf-8 stri
return code_unit;
} /* lit_utf8_string_code_unit_at */

/* CESU-8 number of bytes occupied lookup table */
#ifndef __LITTLE_ENDIAN
const __attribute__ ((aligned (CESU_8_TABLE_MEM_ALIGNMENT))) lit_utf8_byte_t table[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the alignment attribute?

@huxinx
Copy link
Contributor Author

huxinx commented Jan 3, 2016

@zherczeg , yes, you are right. Updated.

@huxinx
Copy link
Contributor Author

huxinx commented Jan 6, 2016

On x86, most of performance gains from 'inline' lit_get_unicode_char_size_by_utf8_first_byte()
even I just force lit_get_unicode_char_size_by_utf8_first_byte to be inlined,

--- a/jerry-core/lit/lit-strings.cpp
+++ b/jerry-core/lit/lit-strings.cpp
@@ -762,7 +762,7 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t utf8_buf_p, /*< utf-8 stri

-lit_utf8_size_t
+lit_utf8_size_t attr_always_inline_
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */

the performance is still pretty good.

OS, ubuntu 15.04, 32 bit
CPU, Intel(R) Celeron(R) CPU N2820 @ 2.13GHz, 2 core

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 120-> 116 (3.333) 1.1-> 1.1 (0.000)
access-binary-trees.js 88-> 88 (0.000) 0.656-> 0.676 (-3.049)
access-fannkuch.js 44-> 44 (0.000) 3.52-> 3.54 (-0.568)
bitops-3bit-bits-in-byte.js 36-> 36 (0.000) 0.912-> 0.912 (0.000)
bitops-bits-in-byte.js 36-> 32 (11.111) 1.196-> 1.172 (2.007)
bitops-bitwise-and.js 40-> 40 (0.000) 1.276-> 1.26 (1.254)
controlflow-recursive.js 244-> 244 (0.000) 0.56-> 0.564 (-0.714)
crypto-aes.js 128-> 128 (0.000) 2.312-> 2.304 (0.346)
crypto-md5.js 192-> 192 (0.000) 12.34-> 9.944 (19.417)
crypto-sha1.js 140-> 140 (0.000) 5.528-> 4.548 (17.728)
date-format-xparb.js 76-> 76 (0.000) 0.648-> 0.632 (2.469)
math-cordic.js 40-> 40 (0.000) 1.32-> 1.304 (1.212)
math-spectral-norm.js 44-> 44 (0.000) 0.8-> 0.792 (1.000)
string-base64.js 168-> 168 (0.000) 97.42->82.872 (14.933)
string-fasta.js 56-> 56 (0.000) 2.272-> 2.164 (4.754)
Geometric mean: RSS reduction: 1.0061% Speed up: 4.3189%

Wed Jan 6 11:57:36 EST 2016

Maybe lit_get_unicode_char_size_by_utf8_first_byte() is not bottleneck on ARM?

@egavrin , do you turn LTO on with RPi2?

@zherczeg
Copy link
Member

zherczeg commented Jan 6, 2016

It is strange, that this change improves x86 by 4% but it is negligible on ARM. Regardless, I like this optimization, and I have no objection for landing it. However, it would be good to figure out why the difference is so big. Any idea?

@huxinx
Copy link
Contributor Author

huxinx commented Jan 8, 2016

on Raspberry Pi 2

Just inline lit_get_unicode_char_size_by_utf8_first_byte()

--- a/jerry-core/lit/lit-strings.cpp
+++ b/jerry-core/lit/lit-strings.cpp
@@ -762,7 +762,7 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t utf8_buf_p, /*< utf-8 stri

-lit_utf8_size_t
+lit_utf8_size_t attr_always_inline_
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 132-> 132 (0.000) 3-> 3.02 (-0.667)
access-binary-trees.js 104-> 104 (0.000) 1.82667->1.84333 (-0.912)
access-fannkuch.js 60-> 60 (0.000) 9.05667-> 9.15 (-1.031)
access-nbody.js 68-> 68 (0.000) 4.07667-> 4.15 (-1.799)
bitops-3bit-bits-in-byte.js 52-> 52 (0.000) 2.25->2.26333 (-0.592)
bitops-bits-in-byte.js 52-> 52 (0.000) 3.05333-> 3.06 (-0.218)
bitops-bitwise-and.js 52-> 52 (0.000) 3.51333->3.43667 (2.182)
bitops-nsieve-bits.js 172-> 172 (0.000) 27.94->28.0367 (-0.346)
controlflow-recursive.js 240-> 240 (0.000) 1.54667->1.56333 (-1.077)
crypto-aes.js 140-> 140 (0.000) 5.98667->5.82667 (2.673)
crypto-md5.js 204-> 204 (0.000) 30.86->24.1933 (21.603)
crypto-sha1.js 152-> 152 (0.000) 13.8367->11.1367 (19.513)
date-format-xparb.js 96-> 96 (0.000) 1.65333-> 1.66 (-0.403)
math-cordic.js 60-> 60 (0.000) 3.31-> 3.36 (-1.511)
math-partial-sums.js 52-> 52 (0.000) 1.90333-> 1.95 (-2.452)
math-spectral-norm.js 56-> 56 (0.000) 2.06->2.07333 (-0.647)
string-base64.js 184-> 184 (0.000) 261.103->220.25 (15.646)
string-fasta.js 68-> 68 (0.000) 5.68->5.53667 (2.523)
Geometric mean: RSS reduction: 0% Speed up: 3.2224%

ARM benefits from inline too.

@huxinx
Copy link
Contributor Author

huxinx commented Jan 8, 2016

Raspberry Pi 2

Not inline, use bit vector version

-+lit_utf8_size_t attribute ((noinline))
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */
{
const uint32_t cesu_8_store = 0x3a005555;
int shift = (first_byte >> 4) << 1;

return (cesu_8_store >> shift) & 0x3;
} /* lit_get_unicode_char_size_by_utf8_first_byte */

Benchmark RSS
(+ is better)
Perf
(+ is better)
crypto-aes.js 140-> 140 (0.000) 6.35-> 6.3 (0.787)
crypto-md5.js 204-> 204 (0.000) 30.85-> 35.6 (-15.397)
crypto-sha1.js 152-> 152 (0.000) 13.8-> 15.74 (-14.058)
math-cordic.js 60-> 60 (0.000) 3.31-> 3.28 (0.906)
math-partial-sums.js 52-> 52 (0.000) 1.89-> 1.89 (0.000)
string-base64.js 184-> 184 (0.000) 263.09->293.75 (-11.654)
string-fasta.js 68-> 68 (0.000) 5.69-> 5.82 (-2.285)
Geometric mean: RSS reduction: 0% Speed up: -5.738%

It shows bit vector hurts ARM performance a lot.

Here is the objdump of bitvector version,

0000dec4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439>:
dec4: 4b03 ldr r3, [pc, #12] ; (ded4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439+0x10>)
dec6: 0900 lsrs r0, r0, #4
dec8: 0040 lsls r0, r0, #1
deca: fa43 f000 asr.w r0, r3, r0
dece: f000 0003 and.w r0, r0, #3
ded2: 4770 bx lr
ded4: 3a005555 .word 0x3a005555

This is the original version,
0000dec4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439>:
dec4: 0603 lsls r3, r0, #24
dec6: d506 bpl.n ded6 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439+0x12>
dec8: f000 00e0 and.w r0, r0, #224 ; 0xe0
decc: 28c0 cmp r0, #192 ; 0xc0
dece: bf14 ite ne
ded0: 2003 movne r0, #3
ded2: 2002 moveq r0, #2
ded4: 4770 bx lr
ded6: 2001 movs r0, #1
ded8: 4770 bx lr

I am not familiar with ARM, is this ldr damages performance?

@zherczeg
Copy link
Member

zherczeg commented Jan 8, 2016

Yes, since it is on the constant pool, and that load probably wastes a whole cache line.

I suspect this is an agressive "size" optimization, which hurts performance too much. ARM could produce this constant with two 4 byte long instruction (8 byte altogether), but it only takes 6 byte this way. However I don't think this optimization worth at all.

Is this ASM block helps?

uint32_t cesu_8_store;
asm ( " movw %0, 0x5555 \n movt %0, 0x3a00" : "=r" (cesu_8_store));

@huxinx
Copy link
Contributor Author

huxinx commented Jan 8, 2016

uint32_t cesu_8_store;
asm ( " movw %0, 0x5555 \n movt %0, 0x3a00" : "=r" (cesu_8_store));

does not help.

@huxinx
Copy link
Contributor Author

huxinx commented Jan 8, 2016

How about attr_always_inline_ , and just use bit vector for x86?
ARM benefits from inline,
x86 gets better performance with inline and bitvetor to remove if/else branch

lit_get_unicode_char_size_by_utf8_first_byte is a small function, not called everywhere.
After inline, binary size is still 568k,

Raspberry Pi 2, run perf.sh 6 times

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 132-> 132 (0.000) 3.16333->3.17333 (-0.316)
access-binary-trees.js 104-> 104 (0.000) 1.92667->1.94333 (-0.865)
access-fannkuch.js 60-> 60 (0.000) 9.57667-> 9.64 (-0.661)
access-nbody.js 68-> 68 (0.000) 4.31667-> 4.39 (-1.699)
bitops-3bit-bits-in-byte.js 52-> 52 (0.000) 2.37->2.37667 (-0.281)
bitops-bits-in-byte.js 52-> 52 (0.000) 3.2->3.22333 (-0.729)
bitops-bitwise-and.js 52-> 52 (0.000) 3.51333-> 3.45 (1.803)
bitops-nsieve-bits.js 172-> 172 (0.000) 27.9533-> 27.96 (-0.024)
controlflow-recursive.js 240-> 240 (0.000) 1.55->1.55667 (-0.430)
crypto-aes.js 140-> 140 (0.000) 5.97333->5.81333 (2.679)
crypto-md5.js 204-> 204 (0.000) 31.0433->24.1633 (22.163)
crypto-sha1.js 152-> 152 (0.000) 13.8933-> 11.19 (19.458)
date-format-xparb.js 96-> 96 (0.000) 1.65->1.66333 (-0.808)
math-cordic.js 60-> 60 (0.000) 3.30667->3.35333 (-1.411)
math-partial-sums.js 52-> 52 (0.000) 1.92-> 1.95 (-1.562)
math-spectral-norm.js 56-> 56 (0.000) 2.06-> 2.08 (-0.971)
string-base64.js 184-> 184 (0.000) 260.84->219.527 (15.838)
string-fasta.js 68-> 68 (0.000) 5.66->5.51667 (2.532)
Geometric mean: RSS reduction: 0% Speed up: 3.35%

Sun 22 Nov 21:36:59 UTC 2015

@sand1k
Copy link
Contributor

sand1k commented Jan 11, 2016

It is better just to inline lit_get_unicode_char_size_by_utf8_first_byte without introducing bit vector.

@sand1k
Copy link
Contributor

sand1k commented Jan 12, 2016

LGTM.

@galpeter
Copy link
Contributor

lgtm

@galpeter galpeter removed their assignment Jan 13, 2016
@ruben-ayrapetyan
Copy link
Contributor

Merged (7255d64)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants