Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+= to concat strings is heavily underperforming #67

Open
StringManolo opened this issue May 26, 2021 · 11 comments
Open

+= to concat strings is heavily underperforming #67

StringManolo opened this issue May 26, 2021 · 11 comments

Comments

@StringManolo
Copy link

Using concat method or array push this operation takes around 0.015 - 0.030 seconds.

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";

const addStrings = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 + str2;
  }
  return str;
}

addStrings(30000); // 8 seconds
  • This code in node takes 0.013 seconds to run.
  • In quickjs takes 8.672 seconds

If you want to test, you can run next command:

curl https://raw.githubusercontent.com/StringManolo/javascript-performance-study/main/joinStrings/joinStrings.js -O && node joinStrings.js && qjs joinStrings.js

Notice qjs takes a few seconds to run the code.

@ibraheemdev
Copy link

ibraheemdev commented Jul 18, 2021

I would guess v8 is optimizing the code and executing it at compile-time (the same program with a console.log is similarly fast). Quickjs runs it on my machine in 1.071 seconds. An equivalent ruby program is executed in 1.209 seconds, so I think it's more a testament to how good v8 is than quickjs being slow - execution speeds are comparable to other non-jit interpreters.

@fstirlitz
Copy link
Contributor

So this is kind of interesting. At first I thought this is because QuickJS lacks the unique-reference in-place concatenation optimisation. As it turns out, not quite: the optimisation is already there, but a number of implementation flaws (and some design flaws) in QuickJS prevent it from being actually useful, at least in interpreted code.

The problems are as follows:

  • No opcode for in-place concatenation of lexically-scoped variables

    QuickJS compiles JavaScript functions into bytecode, which is then interpreted by a stack-based virtual machine. The central performance bottleneck is obviously result += str1 + str2;. This is the unoptimised bytecode this statement compiles to:

    ;;     result += str1 + str2;
    
       42:  get_loc_check 0: result
            get_var str1
            get_var str2
            add
            add
            dup
            put_loc_check 0: result
            drop
    

    The get_loc_check opcode puts the value of a lexically-scoped variable on the stack, while put_loc_check removes a value from the stack and puts it in a lexically-scoped variable. The equivalents for function-scoped (var) variables are get_loc and put_loc. Values are generally reference-counted (with cycle detection): putting a value on the stack increments its reference count, removing it decrements it. As such, the string cannot be concatenated in-place, because there are two references to it: one from the stack, the other from the local variable. We would like to have an opcode which can update the variable in place, without incrementing the reference count as an intermediate step. Such an opcode exists for var variables: its name is add_loc. There is no equivalent add_loc_check for let variables.

  • Simplistic optimiser performs peephole analysis only

    We can avoid the above flaw by declaring result with the var keyword. But all this does is change get_loc_check and put_loc_check into get_loc and put_loc. An optimisation step is necessary to transform the opcode stream to use add_loc.

    QuickJS does, in fact, have an optimiser. It is able to simplify the dup / put_loc / drop sequence into just a put_loc. To be able to use add_loc, in full generality it ought to match on get_loc (X) / {Y} / add / put_loc (X) and transform it to {Y} / add_loc (X), for (Y) an arbitrary sequence of opcodes that touches neither the local variable (X) itself nor the stack slot in which its value is put by get_loc. The optimiser is nowhere near that sophisticated; it can only perform bounded-lookahead pattern matching against the opcode stream. There is no data flow analysis, just dumb substitution of fixed opcode patterns.

    We can avoid this flaw by putting the concatenation of str1 and str2 in an intermediate variable. Adding a variable to another variable generates an opcode sequence get_loc (X) / get_loc (Y) / add / put_loc (X) that QuickJS optimises into get_loc (Y) / add_loc (X):

    ;;     var tmp = str1 + str2;
    
            get_var str1
            get_var str2
            add
            put_loc2 2: tmp
    
    ;;     result += tmp;
    
            get_loc2 2: tmp
            add_loc 0: result
    
  • Spurious internal reference-count increments

    Using the add_loc opcode avoids putting the string on the stack, so you would think that it would avoid incrementing the reference count. Except that the opcode’s implementation internally increments the reference count anyway:

                        op1 = JS_ConcatString(ctx, JS_DupValue(ctx, *pv), op1);
                        if (JS_IsException(op1))
                            goto exception;
                        set_value(ctx, pv, op1);

    While the call to JS_ConcatString is active, there are two references to the string: one from the temporary (created by JS_DupValue), the other from the variable. JS_ConcatString drops the former, set_value drops the latter.

    We can address this with a kludge:

                        op1 = JS_ConcatString(ctx, *pv, op1);
                        if (JS_IsException(op1))
                            goto exception;
                        *pv = op1;

    Kind of ugly, but it gets the job done. Now the reference count stays at 1 the whole time.

  • No memory allocation amortisation

    Now the punchline: even if the reference count is 1, this may still not be enough to reuse the allocation of one of the strings.

        str = js_malloc_rt(rt, sizeof(JSString) + (max_len << is_wide_char) + 1 - is_wide_char);

    When allocating memory for strings, QuickJS requests no more space than required to store the string. The allocator is allowed to provide more storage; when QuickJS detects this, the allocation is reused. But far too often, there simply isn’t enough room, and the opportunity never arises. This triggers the slow path where a new string is allocated and the contents are copied manually: not even js_realloc is used to reuse the old object’s data.

    By adding js_realloc on top of the above patches, I managed to get execution time to drop on my machine from nearly 6 seconds to about 300 ms. (Your performance may vary: I compile QuickJS either with TinyCC or with ASan and UBSan enabled, either of which may slow things down.)

@bellard
Copy link
Owner

bellard commented Aug 12, 2021 via email

@SGrondin
Copy link

SGrondin commented Sep 19, 2022

I've been in awe at the excellent performance of quickjs, I really appreciate when performance emerges from simplicity and lightness, as opposed to complex layers upon layers of JIT and runtime analysis.

The only issue I'm running into is the fact that a good number of JS programmers assume that string concatenation is optimized. An example of this is the official Pug parser. A 1,000-line Pug file takes over 55 seconds to parse in quickjs versus ~100 ms in Node/V8.

Consider this a "+1" for this issue, I suppose! I love quickjs and I'd love to be able to trust it will work on any codebase, but until someone has time to implement the rope-based string representation I'll have to restrict my usage to the programs that won't exhibit problematic slowdowns. Thanks again for quickjs!

@mingodad
Copy link

mingodad commented Jan 6, 2023

Doing tests I found that using += individually is faster see changes bellow and output:

/usr/bin/time qjs  joinStrings.js 

Config:
  - Running 30000 times

Results:
strPlusStr -> 3732ms
strPlusStr2 -> 1689ms
strToArrayPush -> 8ms
strPush -> 5ms
strConcatStr -> 8ms

2.41user 3.03system 0:05.44elapsed 99%CPU (0avgtext+0avgdata 6880maxresident)k
0inputs+0outputs (0major+2113473minor)pagefaults 0swaps

joinStrings.js:

const str1 = "abc";
const str2 = "defghijklmnñopqrstuvwxyz";


const strPlusStr = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1 +  str2;
  }
  return str;
}

const strPlusStr2 = times => {
  let str = "";
  for(let i = 0; i < times; ++i) {
    str += str1;
    str += str2;
  }
  return str;
}

const strToArrayPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1);
    str.push(str2);
  }
  return str.join("");
}

const strPush = times => {
  let str = [];
  for(let i = 0; i < times; ++i) {
    str.push(str1, str2);
  }
  return str.join("");
}

const strConcatStr = times => {
  let str = [];;
  for (let i = 0; i < times; ++i) {
    str.push(str1.concat(str2));
  }
  return str.join("");
}


const _ = console.log;
//const _ = alert; // For Android Browser Testing


const test = times => {
  let time;
  let results = [];

  time = new Date();
  strPlusStr(times);
  results.push(new Date() - time);

  time = new Date();
  strPlusStr2(times);
  results.push(new Date() - time);

  time = new Date();
  strToArrayPush(times);
  results.push(new Date() - time);

  time = new Date();
  strPush(times);
  results.push(new Date() - time);

  time = new Date();
  strConcatStr(times);
  results.push(new Date() - time);

  _(`
Config:
  - Running ${times} times

Results:
strPlusStr -> ${results[0]}ms
strPlusStr2 -> ${results[1]}ms
strToArrayPush -> ${results[2]}ms
strPush -> ${results[3]}ms
strConcatStr -> ${results[4]}ms
`);
}

/*
test(1)
test(10)
test(100)
test(1000)
test(10000)
test(100000) */
//test(1000000)
test(30000); // qjs bug on strPlusStr function

/* > qjs joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 8100ms           HERE
strToArrayPush -> 19ms
strConcatStr -> 24ms


> node joinStrings.js

Config:
  - Running 30000 times

Results:
strPlusStr -> 13ms        THIS IS ALL GOOD
strToArrayPush -> 14ms
strConcatStr -> 13ms

*/

@mingodad
Copy link

mingodad commented Jan 6, 2023

Hello @fstirlitz I didn't saw your changes in your fork at https://github.com/fstirlitz/quickjs , could you please attach or copy and paste your diff/patch here ?

TooTallNate pushed a commit to TooTallNate/quickjs that referenced this issue Dec 18, 2023
@funny-falcon
Copy link

@bellard was it fixed in 2024.Jan release?

@laishere
Copy link

Tested with hermes engine. They do optimize += on string. However like @bellard said, there're corner cases += optimization won't work. For example, changing str = str + str1 to str = str1 + str (a rarely seen case in a loop I think), the hermes engine took 16778ms to complete vs qjs took 2407ms on my PC.

I agree the rope based string approach may be a correct direction for general string optimization.

The optimization is undoubtedly required, for example:

let longText;
...
// To append a newline to the end of long text, we have to duplicate the whole text.
// The time complexity is O(n) for appending one character, which is unacceptable!
longText += '\n';

// The below example takes 66531ms on my PC!
// Hermes finishs in 142ms
// And I think quickjs will behave better with similar optimization
let str = '';
for (let i = 0; i < 1000_000; i++) {
  str += 'a';
}

But do we need a general string optimization and does that perform better in practice?

Here're my thoughts:

  1. I think str = str + str1 or str += str1 is a more commonly seen case. Now that string concat is used in almost everywhere, I think it would make a great improvement to the performance even we just optimize for this particular case.
  2. It's easier to implement optimization for string assign equal operation than rope string.
  3. Rope string may perform well in frequently string editing. But it may not perform as well as a linear string structure for read operations because of the Locality Principle (memory cache).

So in my opinions:

WRITE
Rope based string is a general optimization approach for writing operations. But appending a value to another string's tail is more commonly seen.
READ
Linear memory layout performs better for read operations.

@laishere
Copy link

laishere commented Jul 12, 2024

I just learn that V8 doesn't adopt rope string too. Maybe we can implement similar optimization on quickjs too.

Here're my brief thoughts.

  1. For tiny string, inline store to make use of memory cache.
  2. For large string, use heap.
  3. For string concats, use concat list to store references.
  4. For string slices, use slice reference.

Four string formats in total.

Let's discuss the operations on these formats:

Type 1
Read: It is straight.
Concat:

  • In place if still tiny.
  • Reform to Type 3 otherwise.

Slice: Copy the slice range to new string.

Type 2
Read: It is straight.
Concat:

  • In place if append Type 1 string.
  • Reform to Type 3 otherwise.

Slice:

  • Copy the slice range to new string if tiny.
  • New string with Type 4 otherwise.

Type 3
Read: Estimate the cost to read. When it's cheaper to reform to Type 2 then do it. For example:

  • Reading first block or last block is cheap.
  • Frequently reading across inner blocks is expensive.

Concat: Remain Type 3.
Slice: New string with Type 4.

Type 4
Read: Do read on its reference(may cause its reference to reform but no change to itself).
Concat: Reform to Type 3.
Slice: New string with Type 4.

@chqrlie
Copy link
Collaborator

chqrlie commented Jul 14, 2024

Here're my brief thoughts.

  1. For tiny string, inline store to make use of memory cache.
  2. For large string, use heap.
  3. For string concats, use concat list to store references.
  4. For string slices, use slice reference.

Thank you for your summary. Do string slices really deserve optimizing? I am not sure real world code would benefit from such an optimisation, adding more string types is somewhat painful and error prone.

@laishere
Copy link

laishere commented Jul 15, 2024

Thanks for reply. I think slice does have some weight in string operations. For example, split and regexp. I think these two are commonly used.

I think you are right about it may introduce more difficulty and complexity in code implementation.
For example, we need to consider restricting the depth of references in both concat link and slice.

But I think, they can be done progressively. We can do it simpler at first. For example, before reading u.str8 and u.str16 of JSString struct, we do a preread operation, which is responsible to turn concat link or slice into normal string so we can directly access from u.str8 and u.str16. For Inline String or Heap String, we can use the pointer directly. In this way, we don't introduce too many changes in read process. As for the write process, we can do it in both the old way and the new way at first. If we do it in the old way, no changes are needed. If we do it in the new way, preread will make it right for reading. Anyway, it is progressive.

If simple optimizations are not good enough, we can progressively introduce more deeper optimizations.

Actually, I'm implementing them on my own when I have free time. It's a rough implementation, but I will update the benchmark comparison here after I finish. We will see if it is worthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants