Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use heap allocated buffer instead of char collection in ecma_strings #678

Closed

Conversation

dbatyai
Copy link
Member

@dbatyai dbatyai commented Oct 20, 2015

Remove char collection from ecma_strings, and use a heap buffer instead, where we also store string size and length.

Results (measured on rpi2):

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 136 -> 136 (0) 3.728 -> 3.71 (0.4828)
3d-raytrace.js 304 -> 264 (13.1579) 5.918 -> 5.554 (6.1507)
access-binary-trees.js 88 -> 92 (-4.545) 2.72 -> 2.72 (0)
access-fannkuch.js 52 -> 52 (0) 10.03 -> 9.94 (0.8973)
access-nbody.js 68 -> 68 (0) 4.65 -> 4.63 (0.4301)
bitops-3bit-bits-in-byte.js 40 -> 40 (0) 3.18 -> 3.13 (1.5723)
bitops-bits-in-byte.js 40 -> 40 (0) 4.35 -> 4.36 (-0.23)
bitops-bitwise-and.js 36 -> 36 (0) 4.26 -> 4.26 (0)
controlflow-recursive.js 224 -> 224 (0) 3.23 -> 3.17 (1.8576)
crypto-aes.js 156 -> 152 (2.5641) 5.96 -> 5.71 (4.1946)
crypto-md5.js 216 -> 212 (1.8519) 33.57 -> 10.53 (68.6327)
crypto-sha1.js 156 -> 152 (2.5641) 15.1 -> 5.71 (62.1854)
date-format-xparb.js 104 -> 96 (7.6923) 2.03 -> 1.82 (10.3448)
math-cordic.js 48 -> 48 (0) 4.25 -> 4.23 (0.4706)
math-partial-sums.js 40 -> 40 (0) 2.55 -> 2.5 (1.9608)
math-spectral-norm.js 52 -> 52 (0) 3.22 -> 3.18 (1.2422)
string-base64.js 192 -> 164 (14.5833) 235.22 -> 38.6 (83.5898)
string-fasta.js 60 -> 60 (0) 5.54 -> 4.72 (14.8014)
Geometric mean: RSS reduction: 2.2246% Speed up: 21.6495%

@egavrin
Copy link
Contributor

egavrin commented Oct 20, 2015

3d-raytrace.js fails for me. Please check your measurements.

egavrin@ubuntu:~/jr$ ./jr-native-str ../jsperf/sunspider-1.0.2/3d-raytrace.js 
egavrin@ubuntu:~/jr$ echo $?
1

Measurements on rasp pi 2:

$ clear; setarch linux32 -R  ../tools/run-perf-test.sh ./jr-master ./jr-native-str 5 5000 ../jsperf/sunspider-1.0.2/
Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 132-> 136 (-3.030) 3.676-> 3.652 (0.653)
3d-morph.js
3d-raytrace.js
access-binary-trees.js 88-> 88 (0.000) 2.712-> 2.724 (-0.442)
access-fannkuch.js 40-> 40 (0.000) 9.824-> 9.772 (0.529)
access-nbody.js 64-> 64 (0.000) 4.608-> 4.604 (0.087)
access-nsieve.js
bitops-3bit-bits-in-byte.js 36-> 36 (0.000) 3.272-> 3.236 (1.100)
bitops-bits-in-byte.js 36-> 36 (0.000) 4.448-> 4.452 (-0.090)
bitops-bitwise-and.js 32-> 32 (0.000) 4.288-> 4.164 (2.892)
bitops-nsieve-bits.js
controlflow-recursive.js 224-> 224 (0.000) 3.196-> 3.172 (0.751)
crypto-aes.js 156-> 148 (5.128) 5.928-> 5.68 (4.184)
crypto-md5.js 208-> 200 (3.846) 31.132->10.072 (67.647)
crypto-sha1.js 148-> 148 (0.000) 13.9-> 5.54 (60.144)
date-format-tofte.js
date-format-xparb.js 100-> 88 (12.000) 2.012-> 1.808 (10.139)
math-cordic.js 48-> 48 (0.000) 4.852-> 4.836 (0.330)
math-partial-sums.js 40-> 40 (0.000) 2.544-> 2.616 (-2.830)
math-spectral-norm.js 52-> 52 (0.000) 3.172-> 3.176 (-0.126)
regexp-dna.js
string-base64.js 208-> 200 (3.846) 216.588->34.792 (83.936)
string-fasta.js 56-> 52 (7.143) 5.52-> 4.692 (15.000)
string-tagcloud.js
string-unpack-code.js
string-validate-input.js
Geometric mean: RSS reduction: 1.7676% Speed up: 21.9761%

This patch greatly improves performance, nice work!
Allocation of strings with continuous native array increases fragmentation, so engine won't be able to allocate new strings if there is not enough continuous heap space. Since. our primary concern is reduction of memory consumption and fragmentation, I'd propose to mix both approaches and implement strings on top of recordset, which effectively manages non-continuous memory.

Heap state before allocation of string:

+---------------------------------------+
|-------|       |-------|       |-------|
|-------|       |-------|       |-------|
+---------------------------------------+

Current implementation:

              +-----+-----+
          +---+     |     +---+
          |   +-----+-----+   |
         Fits.               Fits.
          |                   |
+---------v-------------------v---------+
|-------|       |-------|       |-------|
|-------|       |-------|       |-------|
+---------------------------------------+

PR's implementation:

              +-----------+
          +---+           +---+
          |   +-----------+   |
    Does not fit.        Does not fit.
          |                   |
+---------v-------------------v---------+
|-------|       |-------|       |-------|
|-------|       |-------|       |-------|
+---------------------------------------+

@egavrin
Copy link
Contributor

egavrin commented Oct 20, 2015

a = []

for (i = 0; i < 3000; i++)
{
   a[i] = i + '.';
}

Master: OK
PR: OUT_OF_MEMORY

@egavrin
Copy link
Contributor

egavrin commented Oct 20, 2015

Mem stats for the following code:

a = []

for (i = 0; i < 1000; i++)
{
   a[i] = i + '.';
}
./jerry.master --mem-stats ./t.js ; echo $?
Heap stats:
   Heap size = 258048 bytes
   Chunk size = 64 bytes
   Allocated chunks count = 0
   Allocated = 0 bytes
   Waste = 0 bytes
   Peak allocated chunks count = 636
   Peak allocated = 40632 bytes
   Peak waste = 205 bytes

Pools stats:
  Chunk size: 8
   Pools: 0
   Allocated chunks: 0
   Free chunks: 0
   Peak pools: 631
   Peak allocated chunks: 5041
./jerry.new --mem-stats ./t.js ; echo $?
Heap stats:
   Heap size = 258048 bytes
   Chunk size = 64 bytes
   Allocated chunks count = 0
   Allocated = 0 bytes
   Waste = 0 bytes
   Peak allocated chunks count = 1386
   Peak allocated = 33469 bytes
   Peak waste = 55235 bytes

Pools stats:
  Chunk size: 8
   Pools: 0
   Allocated chunks: 0
   Free chunks: 0
   Peak pools: 380
   Peak allocated chunks: 3039

The most important difference is the following, waste is 2 times higher than allocated memory:

                                        master -> new
   Peak allocated chunks count =           636 -> 1386
   Peak allocated =                40632 bytes -> 33469 bytes
   Peak waste =                      205 bytes -> 55235 bytes

@ruben-ayrapetyan
Copy link
Contributor

@dbatyai, could you, please, describe causes of the performance / memory consumption changes and also try to figure out ways to prevent fragmentation and reduce amount of waste space?

Maybe, improvement of "literal storage" and switching to it for storage of strings can solve the issues. Could you, please, share your opinion about this?

@dbatyai
Copy link
Member Author

dbatyai commented Oct 27, 2015

Most of the performance gain comes from storing the length of strings, and not having to iterate over them every time it's needed.
Using a continuous heap space for storing strings also caused notable improvement, since the previous char collection had significant overhead both in memory usage (2 bytes form each 8 byte chunk was used by a compressed pointer) and performance (copying character from and into the collection).
The least significant part is that this way string concatenation no longer requires intermediate buffers, instead it can be done directly inside the space of the new string.

The cause of the large amount of wasted memory is that blocks are allocated as 64 byte chunks, and since short strings don't need that much, the rest will be unused.
Using literal storage could solve this issue, as it manages this unused space pretty well. The problem is that is currently does not support freeing allocated records, and as far as I can tell, this would require serious changes. Also, we could no longer treat strings as continuous data.

Adding a special case for shorter strings, and only allocating a pool chunk if they fit into it could help in the test case mentioned by @egavrin , but there would still be other cases where wasted space would be large.
On the other hand, using smaller heap chunks would reduce the waste effectively, but that could affect other parts of the engine as well. (A little radical, but with 16 byte heap chunks the allocated memory with the patch applied was less than the current master, even with the additional waste.)

What do you think?

@ruben-ayrapetyan
Copy link
Contributor

Most of the performance gain comes from storing the length of strings

Length of string is in unit_number (see ecma_new_chars_collection).

/**
 * Description of a collection's header.
 */
typedef struct
{
  /** Number of elements in the collection */
  ecma_length_t unit_number;

  /** Compressed pointer to first chunk with collection's data */
  mem_cpointer_t first_chunk_cp;

  /** Compressed pointer to last chunk with collection's data */
  mem_cpointer_t last_chunk_cp;
} ecma_collection_header_t;

@dbatyai
Copy link
Member Author

dbatyai commented Oct 27, 2015

unit_number stored the size of the string, see ecma-helpers-string.cpp:73.
The actual length, as in number of unicode characters, was not stored previously.
Instead, ecma_get_chars_collection_length (ecma-helpers-string.cpp:113) was called every time it was requested.

@ruben-ayrapetyan
Copy link
Contributor

@dbatyai, I see. Thanks. Could we add the length to beginning of a string's storage?

@egavrin
Copy link
Contributor

egavrin commented Oct 27, 2015

The problem is that is currently does not support freeing allocated records, and as far as I can tell, this would require serious changes. 

I would say that is big, but needed change regardless the string optimization.

16 byte heap chunks the allocated memory with the patch applied was less than the current master

Could you share the exact measurements for sunspider using both run-perf-test and mem_stats?
My current understanding is that usage of 16 byte chunks will decrease performance, so we could try increase chunk size to 128 byte.

@egavrin egavrin assigned dbatyai and unassigned egavrin Oct 28, 2015
@LaszloLango
Copy link
Contributor

@dbatyai, what about this PR? The initial results were really promising. Could you update your patch?

@dbatyai
Copy link
Member Author

dbatyai commented Feb 11, 2016

Closing, as this is no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants