+= to concat strings is heavily underperforming #67

StringManolo · 2021-05-26T20:26:28Z

Using concat method or array push this operation takes around 0.015 - 0.030 seconds.

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";

const addStrings = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 + str2;
  }
  return str;
}

addStrings(30000); // 8 seconds

This code in node takes 0.013 seconds to run.
In quickjs takes 8.672 seconds

If you want to test, you can run next command:

curl https://raw.githubusercontent.com/StringManolo/javascript-performance-study/main/joinStrings/joinStrings.js -O && node joinStrings.js && qjs joinStrings.js

Notice qjs takes a few seconds to run the code.

The text was updated successfully, but these errors were encountered:

ibraheemdev · 2021-07-18T18:18:43Z

I would guess v8 is optimizing the code and executing it at compile-time (the same program with a console.log is similarly fast). Quickjs runs it on my machine in 1.071 seconds. An equivalent ruby program is executed in 1.209 seconds, so I think it's more a testament to how good v8 is than quickjs being slow - execution speeds are comparable to other non-jit interpreters.

fstirlitz · 2021-08-11T11:40:42Z

So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.

The problems are as follows:

No opcode for in-place concatenation of lexically-scoped variables

QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously result += str1 + str2;. This is the unoptimised bytecode this statement compiles to:
```
;;     result += str1 + str2;

   42:  get_loc_check 0: result
        get_var str1
        get_var str2
        add
        add
        dup
        put_loc_check 0: result
        drop
```
The get_loc_check opcode puts the value of a lexically-scoped variable on the stack, while put_loc_check removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (var) variables are get_loc and put_loc. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for var variables: its name is add_loc. There is no equivalent add_loc_check for let variables.
Simplistic optimiser performs peephole analysis only

We can avoid the above flaw by declaring result with the var keyword. But all this does is change get_loc_check and put_loc_check into get_loc and put_loc. An optimisation step is necessary to transform the opcode stream to use add_loc.

QuickJS does, in fact, have an optimiser. It is able to simplify the dup / put_loc / drop sequence into just a put_loc. To be able to use add_loc, in full generality it ought to match on get_loc (X) / {Y} / add / put_loc (X) and transform it to {Y} / add_loc (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by get_loc. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns.

We can avoid this flaw by putting the concatenation of str1 and str2 in an intermediate variable. Adding a variable to another variable generates an opcode sequence get_loc (X) / get_loc (Y) / add / put_loc (X) that QuickJS optimises into get_loc (Y) / add_loc (X):
```
;;     var tmp = str1 + str2;

        get_var str1
        get_var str2
        add
        put_loc2 2: tmp

;;     result += tmp;

        get_loc2 2: tmp
        add_loc 0: result
```
Spurious internal reference-count increments

Using the add_loc opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway:
```
                    op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1);
                    if (JS_IsException(op1))
                        goto exception;
                    set_value(ctx, pv, op1);
```
While the call to JS_ConcatString is active, there are two references to the string: one from the temporary (created by JS_DupValue), the other from the variable. JS_ConcatString drops the former, set_value drops the latter.

We can address this with a kludge:
```
                    op1 = JS_ConcatString(ctx, *pv, op1);
                    if (JS_IsException(op1))
                        goto exception;
                    *pv = op1;
```
Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.
No memory allocation amortisation

Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings.
```
    str = js_malloc_rt(rt, sizeof(JSString) + (max_len << is_wide_char) + 1 - is_wide_char);
```
When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even js_realloc is used to reuse the old object’s data.

By adding js_realloc on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)

bellard · 2021-08-12T08:28:51Z

Good analysis. Our conclusion at the time was that it was not worth continuing on this track because there will still be corner cases which will give a large slowdown. Instead, a rope based string representation may be a simpler solution.

…

On 8/11/21 1:40 PM, Felix S. wrote: So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation <https://blog.ganssle.io/articles/2019/11/string-concat.html>. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code. The problems are as follows: * *No opcode for in-place concatenation of lexically-scoped variables* QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously |result += str1 + str2;|. This is the unoptimised bytecode this statement compiles to: |;; result += str1 + str2; 42: get_loc_check 0: result get_var str1 get_var str2 add add dup put_loc_check 0: result drop | The |get_loc_check| opcode puts the value of a lexically-scoped variable on the stack, while |put_loc_check| removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (|var|) variables are |get_loc| and |put_loc|. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for |var| variables: its name is |add_loc|. There is no equivalent |add_loc_check| for |let| variables. * *Simplistic optimiser performs peephole analysis only* We can avoid the above flaw by declaring |result| with the |var| keyword. But all this does is change |get_loc_check| and |put_loc_check| into |get_loc| and |put_loc|. An optimisation step is necessary to transform the opcode stream to use |add_loc|. QuickJS does, in fact, have an optimiser. It is able to simplify the |dup| / |put_loc| / |drop| sequence into just a |put_loc|. To be able to use |add_loc|, in full generality it ought to match on |get_loc| (X) / {Y} / |add| / |put_loc| (X) and transform it to {Y} / |add_loc| (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by |get_loc|. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns. We can avoid this flaw by putting the concatenation of |str1| and |str2| in an intermediate variable. Adding a variable to another variable generates an opcode sequence |get_loc| (X) / |get_loc| (Y) / |add| / |put_loc| (X) that QuickJS optimises into |get_loc| (Y) / |add_loc| (X): |;; var tmp = str1 + str2; get_var str1 get_var str2 add put_loc2 2: tmp ;; result += tmp; get_loc2 2: tmp add_loc 0: result | * *Spurious /internal/ reference-count increments* Using the |add_loc| opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway: op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1); if (JS_IsException(op1)) goto exception; set_value(ctx, pv, op1); While the call to |JS_ConcatString| is active, there are two references to the string: one from the temporary (created by |JS_DupValue|), the other from the variable. |JS_ConcatString| drops the former, |set_value| drops the latter. We can address this with a kludge: op1 = JS_ConcatString(ctx, *pv, op1); if (JS_IsException(op1)) goto exception; *pv = op1; Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time. * *No memory allocation amortisation* Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings. str = js_malloc_rt(rt,sizeof(JSString) + (max_len << is_wide_char) +1 - is_wide_char); When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even |js_realloc| is used to reuse the old object’s data. By adding |js_realloc| on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABRQQIERDTW54YJMARWGZELT4JOUNANCNFSM45SZ7GCA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

SGrondin · 2022-09-19T13:52:53Z

I've been in awe at the excellent performance of quickjs, I really appreciate when performance emerges from simplicity and lightness, as opposed to complex layers upon layers of JIT and runtime analysis.

The only issue I'm running into is the fact that a good number of JS programmers assume that string concatenation is optimized. An example of this is the official Pug parser. A 1,000-line Pug file takes over 55 seconds to parse in quickjs versus ~100 ms in Node/V8.

Consider this a "+1" for this issue, I suppose! I love quickjs and I'd love to be able to trust it will work on any codebase, but until someone has time to implement the rope-based string representation I'll have to restrict my usage to the programs that won't exhibit problematic slowdowns. Thanks again for quickjs!

mingodad · 2023-01-06T14:04:09Z

Doing tests I found that using += individually is faster see changes bellow and output:

/usr/bin/time qjs  joinStrings.js 

Config:
  - Running 30000 times

Results:
strPlusStr -> 3732ms
strPlusStr2 -> 1689ms
strToArrayPush -> 8ms
strPush -> 5ms
strConcatStr -> 8ms

2.41user 3.03system 0:05.44elapsed 99%CPU (0avgtext+0avgdata 6880maxresident)k
0inputs+0outputs (0major+2113473minor)pagefaults 0swaps

joinStrings.js:

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";


const strPlusStr = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 +  str2;
  }
  return str;
}

const strPlusStr2 = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1;
    str += str2;
  }
  return str;
}

const strToArrayPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1);
    str.push(str2);
  }
  return str.join("");
}

const strPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1, str2);
  }
  return str.join("");
}

const strConcatStr = times => {
  let str = [];;
  for (let i = 0; i < times; ++i) {
    str.push(str1.concat(str2));
  }
  return str.join("");
}


const _ = console.log;
//const _ = alert; // For Android Browser Testing


const test = times => {
  let time;
  let results = [];

  time = new Date();
  strPlusStr(times);
  results.push(new Date() - time);

  time = new Date();
  strPlusStr2(times);
  results.push(new Date() - time);

  time = new Date();
  strToArrayPush(times);
  results.push(new Date() - time);

  time = new Date();
  strPush(times);
  results.push(new Date() - time);

  time = new Date();
  strConcatStr(times);
  results.push(new Date() - time);

  _(`
Config:
  - Running ${times} times

Results:
strPlusStr -> ${results[0]}ms
strPlusStr2 -> ${results[1]}ms
strToArrayPush -> ${results[2]}ms
strPush -> ${results[3]}ms
strConcatStr -> ${results[4]}ms
`);
}

/*
test(1)
test(10)
test(100)
test(1000)
test(10000)
test(100000) */
//test(1000000)
test(30000); // qjs bug on strPlusStr function

/* > qjs joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 8100ms           HERE
strToArrayPush -> 19ms
strConcatStr -> 24ms


> node joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 13ms        THIS IS ALL GOOD
strToArrayPush -> 14ms
strConcatStr -> 13ms

*/

mingodad · 2023-01-06T14:47:04Z

Hello @fstirlitz I didn't saw your changes in your fork at https://github.com/fstirlitz/quickjs , could you please attach or copy and paste your diff/patch here ?

funny-falcon · 2024-02-09T04:53:39Z

@bellard was it fixed in 2024.Jan release?

laishere · 2024-07-12T14:44:45Z

Tested with hermes engine. They do optimize += on string. However like @bellard said, there're corner cases += optimization won't work. For example, changing str = str + str1 to str = str1 + str (a rarely seen case in a loop I think), the hermes engine took 16778ms to complete vs qjs took 2407ms on my PC.

I agree the rope based string approach may be a correct direction for general string optimization.

The optimization is undoubtedly required, for example:

let longText;
...
// To append a newline to the end of long text, we have to duplicate the whole text.
// The time complexity is O(n) for appending one character, which is unacceptable!
longText += '\n';

// The below example takes 66531ms on my PC!
// Hermes finishs in 142ms
// And I think quickjs will behave better with similar optimization
let str = '';
for (let i = 0; i < 1000_000; i++) {
  str += 'a';
}

But do we need a general string optimization and does that perform better in practice?

Here're my thoughts:

I think str = str + str1 or str += str1 is a more commonly seen case. Now that string concat is used in almost everywhere, I think it would make a great improvement to the performance even we just optimize for this particular case.
It's easier to implement optimization for string assign equal operation than rope string.
Rope string may perform well in frequently string editing. But it may not perform as well as a linear string structure for read operations because of the Locality Principle (memory cache).

So in my opinions:

WRITE
Rope based string is a general optimization approach for writing operations. But appending a value to another string's tail is more commonly seen.
READ
Linear memory layout performs better for read operations.

laishere · 2024-07-12T16:01:37Z

I just learn that V8 doesn't adopt rope string too. Maybe we can implement similar optimization on quickjs too.

Here're my brief thoughts.

For tiny string, inline store to make use of memory cache.
For large string, use heap.
For string concats, use concat list to store references.
For string slices, use slice reference.

Four string formats in total.

Let's discuss the operations on these formats:

Type 1
Read: It is straight.
Concat:

In place if still tiny.
Reform to Type 3 otherwise.

Slice: Copy the slice range to new string.

Type 2
Read: It is straight.
Concat:

In place if append Type 1 string.
Reform to Type 3 otherwise.

Slice:

Copy the slice range to new string if tiny.
New string with Type 4 otherwise.

Type 3
Read: Estimate the cost to read. When it's cheaper to reform to Type 2 then do it. For example:

Reading first block or last block is cheap.
Frequently reading across inner blocks is expensive.

Concat: Remain Type 3.
Slice: New string with Type 4.

Type 4
Read: Do read on its reference(may cause its reference to reform but no change to itself).
Concat: Reform to Type 3.
Slice: New string with Type 4.

chqrlie · 2024-07-14T22:17:35Z

Here're my brief thoughts.

For tiny string, inline store to make use of memory cache.

For large string, use heap.

For string concats, use concat list to store references.

For string slices, use slice reference.

Thank you for your summary. Do string slices really deserve optimizing? I am not sure real world code would benefit from such an optimisation, adding more string types is somewhat painful and error prone.

laishere · 2024-07-15T01:36:03Z

Thanks for reply. I think slice does have some weight in string operations. For example, split and regexp. I think these two are commonly used.

I think you are right about it may introduce more difficulty and complexity in code implementation.
For example, we need to consider restricting the depth of references in both concat link and slice.

But I think, they can be done progressively. We can do it simpler at first. For example, before reading u.str8 and u.str16 of JSString struct, we do a preread operation, which is responsible to turn concat link or slice into normal string so we can directly access from u.str8 and u.str16. For Inline String or Heap String, we can use the pointer directly. In this way, we don't introduce too many changes in read process. As for the write process, we can do it in both the old way and the new way at first. If we do it in the old way, no changes are needed. If we do it in the new way, preread will make it right for reading. Anyway, it is progressive.

If simple optimizations are not good enough, we can progressively introduce more deeper optimizations.

Actually, I'm implementing them on my own when I have free time. It's a rough implementation, but I will update the benchmark comparison here after I finish. We will see if it is worthy.

swankjesse mentioned this issue Sep 20, 2021

Be sad about Kotlin’s StringBuilder cashapp/zipline#285

Closed

fstirlitz mentioned this issue Nov 10, 2021

feature request: lazy string #100

Closed

bellard mentioned this issue Dec 13, 2023

Loop related optimization issues #169

Closed

TooTallNate pushed a commit to TooTallNate/quickjs that referenced this issue Dec 18, 2023

Build examples on CMake (bellard#67)

da3688f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

+= to concat strings is heavily underperforming #67

+= to concat strings is heavily underperforming #67

StringManolo commented May 26, 2021

ibraheemdev commented Jul 18, 2021 •

edited

Loading

fstirlitz commented Aug 11, 2021

bellard commented Aug 12, 2021 via email

SGrondin commented Sep 19, 2022 •

edited

Loading

mingodad commented Jan 6, 2023

mingodad commented Jan 6, 2023

funny-falcon commented Feb 9, 2024

laishere commented Jul 12, 2024

laishere commented Jul 12, 2024 •

edited

Loading

chqrlie commented Jul 14, 2024

laishere commented Jul 15, 2024 •

edited

Loading

+= to concat strings is heavily underperforming #67

+= to concat strings is heavily underperforming #67

Comments

StringManolo commented May 26, 2021

ibraheemdev commented Jul 18, 2021 • edited Loading

fstirlitz commented Aug 11, 2021

bellard commented Aug 12, 2021 via email

SGrondin commented Sep 19, 2022 • edited Loading

mingodad commented Jan 6, 2023

mingodad commented Jan 6, 2023

funny-falcon commented Feb 9, 2024

laishere commented Jul 12, 2024

laishere commented Jul 12, 2024 • edited Loading

chqrlie commented Jul 14, 2024

laishere commented Jul 15, 2024 • edited Loading

ibraheemdev commented Jul 18, 2021 •

edited

Loading

SGrondin commented Sep 19, 2022 •

edited

Loading

laishere commented Jul 12, 2024 •

edited

Loading

laishere commented Jul 15, 2024 •

edited

Loading