Use bit vector to store CESU-8 lookup table, #794

huxinx · 2015-12-30T18:29:53Z

JerryScript-DCO-1.0-Signed-off-by: Xin Hu [email protected]

Run ./tools/run-perf-test.sh 10 times,
OS, ubuntu 15.04, 32 bit
CPU, Intel(R) Celeron(R) CPU N2820 @ 2.13GHz, 2 core

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	120-> 120 (0.000)	1.092->1.08178 (0.936)
access-binary-trees.js	92-> 88 (4.348)	0.660444->0.660889 (-0.067)
access-fannkuch.js	44-> 44 (0.000)	3.51244->3.54133 (-0.823)
bitops-3bit-bits-in-byte.js	36-> 32 (11.111)	0.910222->0.920444 (-1.123)
bitops-bits-in-byte.js	36-> 32 (11.111)	1.19556->1.22444 (-2.416)
bitops-bitwise-and.js	40-> 36 (10.000)	1.27956->1.24711 (2.536)
controlflow-recursive.js	244-> 244 (0.000)	0.56-> 0.576 (-2.857)
crypto-aes.js	128-> 128 (0.000)	2.31156->2.25244 (2.558)
crypto-md5.js	196-> 192 (2.041)	12.3347-> 9.312 (24.506)
crypto-sha1.js	140-> 136 (2.857)	5.51956-> 4.296 (22.168)
date-format-xparb.js	80-> 76 (5.000)	0.64->0.644444 (-0.694)
math-cordic.js	44-> 44 (0.000)	1.32-> 1.332 (-0.909)
math-spectral-norm.js	44-> 44 (0.000)	0.793333->0.796444 (-0.392)
string-base64.js	172-> 168 (2.326)	97.4058->82.5538 (15.248)
string-fasta.js	56-> 56 (0.000)	2.27156->2.17556 (4.226)
Geometric mean:	RSS reduction: 3.3416%	Speed up: 4.6192%

Wed Dec 30 20:40:51 EST 2015

Another one for lit_get_unicode_char_size_by_utf8_first_byte.

Use bit vector to store the USEC-8 length table.
Since the value is 0,1,2,3, we can use two bits to represent one table item.
The whole set is 16*2 = 32 bit, that is just using one uint32 to store the USEC-8 table.
This way squeeze 12 bytes comparing to uint8 table version.
The average performance seems fine. crypto-md5 , crypto-sh1.js and string-base64.js are really good, but several others are not that good comparing to #793
Hard to say which one is better.

zherczeg · 2015-12-31T05:02:06Z

This idea seems clever. However, I don't think it is little endian specific. The value is stored in a 32 bit register, and endianness is irrelevant for values in registers.

egavrin · 2015-12-31T08:52:59Z

RP2:

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	112-> 112 (0.000)	3.0688->3.0816 (-0.417)
access-binary-trees.js	80-> 80 (0.000)	1.876->1.8768 (-0.043)
access-fannkuch.js	40-> 40 (0.000)	9.0992->9.1248 (-0.281)
access-nbody.js	48-> 48 (0.000)	4.184->4.1744 (0.229)
bitops-3bit-bits-in-byte.js	32-> 32 (0.000)	2.356->2.3624 (-0.272)
bitops-bits-in-byte.js	32-> 32 (0.000)	3.1264->3.1392 (-0.409)
bitops-bitwise-and.js	28-> 28 (0.000)	3.5096->3.4992 (0.296)
bitops-nsieve-bits.js	156-> 156 (0.000)	27.592->27.5952 (-0.012)
controlflow-recursive.js	236-> 236 (0.000)	1.6152->1.6232 (-0.495)
crypto-aes.js	120-> 120 (0.000)	5.4672-> 5.484 (-0.307)
crypto-md5.js	184-> 184 (0.000)	29.3912->29.3256 (0.223)
crypto-sha1.js	132-> 132 (0.000)	13.2592->13.2616 (-0.018)
date-format-xparb.js	72-> 72 (0.000)	1.6576->1.6568 (0.048)
math-cordic.js	40-> 40 (0.000)	3.388->3.3968 (-0.260)
math-partial-sums.js	36-> 36 (0.000)	1.9776->1.9632 (0.728)
math-spectral-norm.js	40-> 40 (0.000)	2.1384->2.1224 (0.748)
string-fasta.js	48-> 48 (0.000)	5.4864->5.4816 (0.087)
Geometric mean:	RSS reduction: 0%	Speed up: -0.008%

ruben-ayrapetyan · 2015-12-31T11:26:59Z

jerry-core/lit/lit-strings.cpp

@@ -757,6 +757,17 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t *utf8_buf_p, /**< utf-8 stri
  return code_unit;
 } /* lit_utf8_string_code_unit_at */

+/* CESU-8 number of bytes occupied lookup table */
+#ifndef __LITTLE_ENDIAN
+const __attribute__ ((aligned (CESU_8_TABLE_MEM_ALIGNMENT))) lit_utf8_byte_t table[]


Why do we need the alignment attribute?

huxinx · 2016-01-03T09:28:48Z

@zherczeg , yes, you are right. Updated.

huxinx · 2016-01-06T09:14:46Z

On x86, most of performance gains from 'inline' lit_get_unicode_char_size_by_utf8_first_byte()
even I just force lit_get_unicode_char_size_by_utf8_first_byte to be inlined,

--- a/jerry-core/lit/lit-strings.cpp
+++ b/jerry-core/lit/lit-strings.cpp
@@ -762,7 +762,7 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t utf8_buf_p, /*< utf-8 stri

-lit_utf8_size_t
+lit_utf8_size_t attr_always_inline_
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */

the performance is still pretty good.

OS, ubuntu 15.04, 32 bit
CPU, Intel(R) Celeron(R) CPU N2820 @ 2.13GHz, 2 core

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	120-> 116 (3.333)	1.1-> 1.1 (0.000)
access-binary-trees.js	88-> 88 (0.000)	0.656-> 0.676 (-3.049)
access-fannkuch.js	44-> 44 (0.000)	3.52-> 3.54 (-0.568)
bitops-3bit-bits-in-byte.js	36-> 36 (0.000)	0.912-> 0.912 (0.000)
bitops-bits-in-byte.js	36-> 32 (11.111)	1.196-> 1.172 (2.007)
bitops-bitwise-and.js	40-> 40 (0.000)	1.276-> 1.26 (1.254)
controlflow-recursive.js	244-> 244 (0.000)	0.56-> 0.564 (-0.714)
crypto-aes.js	128-> 128 (0.000)	2.312-> 2.304 (0.346)
crypto-md5.js	192-> 192 (0.000)	12.34-> 9.944 (19.417)
crypto-sha1.js	140-> 140 (0.000)	5.528-> 4.548 (17.728)
date-format-xparb.js	76-> 76 (0.000)	0.648-> 0.632 (2.469)
math-cordic.js	40-> 40 (0.000)	1.32-> 1.304 (1.212)
math-spectral-norm.js	44-> 44 (0.000)	0.8-> 0.792 (1.000)
string-base64.js	168-> 168 (0.000)	97.42->82.872 (14.933)
string-fasta.js	56-> 56 (0.000)	2.272-> 2.164 (4.754)
Geometric mean:	RSS reduction: 1.0061%	Speed up: 4.3189%

Wed Jan 6 11:57:36 EST 2016

Maybe lit_get_unicode_char_size_by_utf8_first_byte() is not bottleneck on ARM?

@egavrin , do you turn LTO on with RPi2?

zherczeg · 2016-01-06T10:27:34Z

It is strange, that this change improves x86 by 4% but it is negligible on ARM. Regardless, I like this optimization, and I have no objection for landing it. However, it would be good to figure out why the difference is so big. Any idea?

huxinx · 2016-01-08T05:06:49Z

on Raspberry Pi 2

Just inline lit_get_unicode_char_size_by_utf8_first_byte()

--- a/jerry-core/lit/lit-strings.cpp
+++ b/jerry-core/lit/lit-strings.cpp
@@ -762,7 +762,7 @@ lit_utf8_string_code_unit_at (const lit_utf8_byte_t utf8_buf_p, /*< utf-8 stri

-lit_utf8_size_t
+lit_utf8_size_t attr_always_inline_
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	132-> 132 (0.000)	3-> 3.02 (-0.667)
access-binary-trees.js	104-> 104 (0.000)	1.82667->1.84333 (-0.912)
access-fannkuch.js	60-> 60 (0.000)	9.05667-> 9.15 (-1.031)
access-nbody.js	68-> 68 (0.000)	4.07667-> 4.15 (-1.799)
bitops-3bit-bits-in-byte.js	52-> 52 (0.000)	2.25->2.26333 (-0.592)
bitops-bits-in-byte.js	52-> 52 (0.000)	3.05333-> 3.06 (-0.218)
bitops-bitwise-and.js	52-> 52 (0.000)	3.51333->3.43667 (2.182)
bitops-nsieve-bits.js	172-> 172 (0.000)	27.94->28.0367 (-0.346)
controlflow-recursive.js	240-> 240 (0.000)	1.54667->1.56333 (-1.077)
crypto-aes.js	140-> 140 (0.000)	5.98667->5.82667 (2.673)
crypto-md5.js	204-> 204 (0.000)	30.86->24.1933 (21.603)
crypto-sha1.js	152-> 152 (0.000)	13.8367->11.1367 (19.513)
date-format-xparb.js	96-> 96 (0.000)	1.65333-> 1.66 (-0.403)
math-cordic.js	60-> 60 (0.000)	3.31-> 3.36 (-1.511)
math-partial-sums.js	52-> 52 (0.000)	1.90333-> 1.95 (-2.452)
math-spectral-norm.js	56-> 56 (0.000)	2.06->2.07333 (-0.647)
string-base64.js	184-> 184 (0.000)	261.103->220.25 (15.646)
string-fasta.js	68-> 68 (0.000)	5.68->5.53667 (2.523)
Geometric mean:	RSS reduction: 0%	Speed up: 3.2224%

ARM benefits from inline too.

huxinx · 2016-01-08T05:20:29Z

Raspberry Pi 2

Not inline, use bit vector version

-+lit_utf8_size_t attribute ((noinline))
lit_get_unicode_char_size_by_utf8_first_byte (const lit_utf8_byte_t first_byte) /**< buffer with characters */
{
const uint32_t cesu_8_store = 0x3a005555;
int shift = (first_byte >> 4) << 1;

return (cesu_8_store >> shift) & 0x3;
} /* lit_get_unicode_char_size_by_utf8_first_byte */

Benchmark	RSS (+ is better)	Perf (+ is better)
crypto-aes.js	140-> 140 (0.000)	6.35-> 6.3 (0.787)
crypto-md5.js	204-> 204 (0.000)	30.85-> 35.6 (-15.397)
crypto-sha1.js	152-> 152 (0.000)	13.8-> 15.74 (-14.058)
math-cordic.js	60-> 60 (0.000)	3.31-> 3.28 (0.906)
math-partial-sums.js	52-> 52 (0.000)	1.89-> 1.89 (0.000)
string-base64.js	184-> 184 (0.000)	263.09->293.75 (-11.654)
string-fasta.js	68-> 68 (0.000)	5.69-> 5.82 (-2.285)
Geometric mean:	RSS reduction: 0%	Speed up: -5.738%

It shows bit vector hurts ARM performance a lot.

Here is the objdump of bitvector version,

0000dec4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439>:
dec4: 4b03 ldr r3, [pc, #12] ; (ded4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439+0x10>)
dec6: 0900 lsrs r0, r0, #4
dec8: 0040 lsls r0, r0, #1
deca: fa43 f000 asr.w r0, r3, r0
dece: f000 0003 and.w r0, r0, #3
ded2: 4770 bx lr
ded4: 3a005555 .word 0x3a005555

This is the original version,
0000dec4 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439>:
dec4: 0603 lsls r3, r0, #24
dec6: d506 bpl.n ded6 <_Z44lit_get_unicode_char_size_by_utf8_first_byteh.9439+0x12>
dec8: f000 00e0 and.w r0, r0, #224 ; 0xe0
decc: 28c0 cmp r0, #192 ; 0xc0
dece: bf14 ite ne
ded0: 2003 movne r0, #3
ded2: 2002 moveq r0, #2
ded4: 4770 bx lr
ded6: 2001 movs r0, #1
ded8: 4770 bx lr

I am not familiar with ARM, is this ldr damages performance?

zherczeg · 2016-01-08T07:03:16Z

Yes, since it is on the constant pool, and that load probably wastes a whole cache line.

I suspect this is an agressive "size" optimization, which hurts performance too much. ARM could produce this constant with two 4 byte long instruction (8 byte altogether), but it only takes 6 byte this way. However I don't think this optimization worth at all.

Is this ASM block helps?

uint32_t cesu_8_store;
asm ( " movw %0, 0x5555 \n movt %0, 0x3a00" : "=r" (cesu_8_store));

huxinx · 2016-01-08T08:39:37Z

uint32_t cesu_8_store;
asm ( " movw %0, 0x5555 \n movt %0, 0x3a00" : "=r" (cesu_8_store));

does not help.

huxinx · 2016-01-08T08:55:57Z

How about attr_always_inline_ , and just use bit vector for x86?
ARM benefits from inline,
x86 gets better performance with inline and bitvetor to remove if/else branch

lit_get_unicode_char_size_by_utf8_first_byte is a small function, not called everywhere.
After inline, binary size is still 568k,

Raspberry Pi 2, run perf.sh 6 times

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	132-> 132 (0.000)	3.16333->3.17333 (-0.316)
access-binary-trees.js	104-> 104 (0.000)	1.92667->1.94333 (-0.865)
access-fannkuch.js	60-> 60 (0.000)	9.57667-> 9.64 (-0.661)
access-nbody.js	68-> 68 (0.000)	4.31667-> 4.39 (-1.699)
bitops-3bit-bits-in-byte.js	52-> 52 (0.000)	2.37->2.37667 (-0.281)
bitops-bits-in-byte.js	52-> 52 (0.000)	3.2->3.22333 (-0.729)
bitops-bitwise-and.js	52-> 52 (0.000)	3.51333-> 3.45 (1.803)
bitops-nsieve-bits.js	172-> 172 (0.000)	27.9533-> 27.96 (-0.024)
controlflow-recursive.js	240-> 240 (0.000)	1.55->1.55667 (-0.430)
crypto-aes.js	140-> 140 (0.000)	5.97333->5.81333 (2.679)
crypto-md5.js	204-> 204 (0.000)	31.0433->24.1633 (22.163)
crypto-sha1.js	152-> 152 (0.000)	13.8933-> 11.19 (19.458)
date-format-xparb.js	96-> 96 (0.000)	1.65->1.66333 (-0.808)
math-cordic.js	60-> 60 (0.000)	3.30667->3.35333 (-1.411)
math-partial-sums.js	52-> 52 (0.000)	1.92-> 1.95 (-1.562)
math-spectral-norm.js	56-> 56 (0.000)	2.06-> 2.08 (-0.971)
string-base64.js	184-> 184 (0.000)	260.84->219.527 (15.838)
string-fasta.js	68-> 68 (0.000)	5.66->5.51667 (2.532)
Geometric mean:	RSS reduction: 0%	Speed up: 3.35%

Sun 22 Nov 21:36:59 UTC 2015

sand1k · 2016-01-11T12:51:42Z

It is better just to inline lit_get_unicode_char_size_by_utf8_first_byte without introducing bit vector.

sand1k · 2016-01-12T10:19:38Z

LGTM.

- inline JerryScript-DCO-1.0-Signed-off-by: Xin Hu [email protected]

galpeter · 2016-01-13T16:05:57Z

lgtm

ruben-ayrapetyan · 2016-01-13T16:14:12Z

Merged (7255d64)

egavrin added the enhancement An improvement label Dec 31, 2015

egavrin added this to the Engine optimization & enhancement milestone Dec 31, 2015

ruben-ayrapetyan reviewed Dec 31, 2015
View reviewed changes

huxinx force-pushed the cesu88 branch from 1e0356b to 217df05 Compare January 3, 2016 09:24

huxinx force-pushed the cesu88 branch from 217df05 to fac0585 Compare January 8, 2016 08:40

huxinx force-pushed the cesu88 branch from fac0585 to 4519e0c Compare January 12, 2016 02:23

lit_get_unicode_char_size_by_utf8_first_byte performance improvement

4519e0c

- inline JerryScript-DCO-1.0-Signed-off-by: Xin Hu [email protected]

ruben-ayrapetyan assigned zherczeg and galpeter and unassigned zherczeg Jan 13, 2016

galpeter removed their assignment Jan 13, 2016

ruben-ayrapetyan closed this Jan 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bit vector to store CESU-8 lookup table, #794

Use bit vector to store CESU-8 lookup table, #794

huxinx commented Dec 30, 2015

zherczeg commented Dec 31, 2015

egavrin commented Dec 31, 2015

ruben-ayrapetyan Dec 31, 2015

huxinx commented Jan 3, 2016

huxinx commented Jan 6, 2016

zherczeg commented Jan 6, 2016

huxinx commented Jan 8, 2016

huxinx commented Jan 8, 2016

zherczeg commented Jan 8, 2016

huxinx commented Jan 8, 2016

huxinx commented Jan 8, 2016

sand1k commented Jan 11, 2016

sand1k commented Jan 12, 2016

galpeter commented Jan 13, 2016

ruben-ayrapetyan commented Jan 13, 2016

Use bit vector to store CESU-8 lookup table, #794

Use bit vector to store CESU-8 lookup table, #794

Conversation

huxinx commented Dec 30, 2015

zherczeg commented Dec 31, 2015

egavrin commented Dec 31, 2015

ruben-ayrapetyan Dec 31, 2015

Choose a reason for hiding this comment

huxinx commented Jan 3, 2016

huxinx commented Jan 6, 2016

zherczeg commented Jan 6, 2016

huxinx commented Jan 8, 2016

huxinx commented Jan 8, 2016

zherczeg commented Jan 8, 2016

huxinx commented Jan 8, 2016

huxinx commented Jan 8, 2016

sand1k commented Jan 11, 2016

sand1k commented Jan 12, 2016

galpeter commented Jan 13, 2016

ruben-ayrapetyan commented Jan 13, 2016