Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.from() output breaks after optimizing in nodejs 22.7 #54521

Closed
unilynx opened this issue Aug 23, 2024 · 20 comments
Closed

Buffer.from() output breaks after optimizing in nodejs 22.7 #54521

unilynx opened this issue Aug 23, 2024 · 20 comments
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. regression Issues related to regressions. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.

Comments

@unilynx
Copy link

unilynx commented Aug 23, 2024

Version

v22.7.0

Platform

Darwin moe.fritz.box 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 arm64

Subsystem

Buffer

What steps will reproduce the bug?

let i = 0;
for(; i < 1_000_000; i++) {
  const ashex = Buffer.from("\x80").toString("hex");
  if (ashex === '80')
    break;
 else if(ashex !== 'c280')
    console.log("Unexpected return value", ashex); //this never happened for me
}

if(i<1_000_000) {
  console.log("FAILED after %d iterations",i);
  process.exit(1);
} else
  console.log("PASSED after %d iterations",i);

How often does it reproduce? Is there a required condition?

For me, it will consistently fail somewhere between 7000 to 1200 iterations.

What is the expected behavior? Why is that the expected behavior?

In node 20 this code does not fail , even after 1_000_000 iterations. Buffer.from("\x80").toString("hex") always returns c280 on node 20

What do you see instead?

Buffer.from("\x80").toString("hex") incorrectly returns 80 after sufficient iterations

Additional information

No response

@avivkeller avivkeller added buffer Issues and PRs related to the buffer subsystem. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. repro-exists labels Aug 23, 2024
@avivkeller
Copy link
Member

I managed to reproduce using your code snippet, and the slightly modified one below (for readability)

let i = 0;

while (i < 1_000_000) {
  const asHex = Buffer.from("\x80").toString("hex");

  if (asHex === '80') {
    break;
  } else if (asHex !== 'c280') {
    console.log("Unexpected return value:", asHex);
  }

  i++;
}

if (i < 1_000_000) {
  console.log("FAILED after %d iterations", i);
  process.exit(1);
} else {
  console.log("PASSED after %d iterations", i);
}
$ node repro.js
FAILED after 8827 iterations
$ node repro.js
FAILED after 5509 iterations
$ node repro.js
FAILED after 48741 iterations
$ node repro.js
FAILED after 7487 iterations
$ node repro.js
FAILED after 9189 iterations

@unilynx
Copy link
Author

unilynx commented Aug 23, 2024

bisecting gives me, as first bad commit:

commit c9dabe2
Author: Robert Nagy [email protected]
Date: Thu Aug 15 03:01:05 2024 +0200

buffer: use fast API for writing one-byte strings

however, that one's even worse and shows a lot of incorrect 'asHex' values:

Unexpected return value afdead0b
Unexpected return value 00dead0b
Unexpected return value 00000000
Unexpected return value 00dead0b
Unexpected return value 01dead0b
Unexpected return value 00000000
Unexpected return value 905e0150
Unexpected return value 00738148

so it looks like the more extreme bug got fixed (probably in 7800893) but then issues still remained for (eg) \x80

@avivkeller
Copy link
Member

CC @ronag

@unilynx
Copy link
Author

unilynx commented Aug 23, 2024

just to confirm the above - if I take the v22.x branch (5ae0260) but revert 7800893 and a5a60e6, the issue is no longer present

@ronag
Copy link
Member

ronag commented Aug 23, 2024

I'll have a look.

@ronag
Copy link
Member

ronag commented Aug 23, 2024

let i = 0;

while (i < 1_000_000) {
  const buf = Buffer.from("\x80")

  if (buf[0] !== 194 || buf[1] !== 128) {
    console.log("Unexpected return value:", buf, buf[0], buf[1]);
    break
  }

  i++;
}

if (i < 1_000_000) {
  console.log("FAILED after %d iterations", i);
  process.exit(1);
} else {
  console.log("PASSED after %d iterations", i);
}

@ronag
Copy link
Member

ronag commented Aug 23, 2024

This seems to be something in V8.

In

uint32_t FastWriteString(Local<Value> receiver,
                         const v8::FastApiTypedArray<uint8_t>& dst,
                         const v8::FastOneByteString& src,
                         uint32_t offset,
                         uint32_t max_length) {
}

src.length === 1 i.e. FastOneByteString thinks that Buffer.from(['\x80']) has a length of 1.

@targos @joyeecheung @nodejs/buffer

@ronag
Copy link
Member

ronag commented Aug 23, 2024

Which is kind of correct... is it the slow path that is broken?

@ronag
Copy link
Member

ronag commented Aug 23, 2024

actually @anonrig @lemire might have ideas?

@ronag
Copy link
Member

ronag commented Aug 23, 2024

Is the problem here that we are not handling incomplete utf8 sequences?

@ronag
Copy link
Member

ronag commented Aug 23, 2024

I'm not sure where the c2 from the expected c280 is coming from?

@unilynx
Copy link
Author

unilynx commented Aug 23, 2024

If 0x80 was a valid Unicode position, 0xC2 0x80 woud be its UTF8 Encoding. At the very least, this is a regression compared to earlier node (or V8?) versions.

I'm not sure about correctness or what the spec says - JavaScript has always kind of tolerated invalid unicode in its string (otherwise Buffer.from(xxx, "latin1") wouldn't be useful either). I think #54518 (comment) is coming from the same direction (I'm also seeing issues other Invalid UTF-8 encoding issues pop up - 0x80 was just the cleanest example of the regression)

Anyway, if we ignore 0x80 for now... 0xA0 (non breaking space)is certainly a valid Unicode character and perhaps the most often occurring Unicode codepoint above 128. It should encode to 0xC2 0xA0 in utf8

let i = 0;
let prev = '';
for(; i < 1_000_000; i++) {
  const ashex = Buffer.from("\xA0").toString("hex");
  if(prev !== ashex) {
    console.log("Got ", ashex);
    prev = ashex;
  }
}

returns

Got  c2a0
Got  a0

The encoding here also changes for the fast path

node 20 only returns c2a0

@ronag
Copy link
Member

ronag commented Aug 23, 2024

Looking at the slow implementation there is a REPLACE_INVALID_UTF8 flag which I guess is what is causing the difference.

@lemire does simdutf have a "memcpy" + replace invalid utf8 routine?

@lemire
Copy link
Member

lemire commented Aug 23, 2024

@ronag

does simdutf have a "memcpy" + replace invalid utf8 routine?

It does not. The simdutf library does not do replacement. It only transcodes valid inputs.

ronag added a commit to nxtedition/node that referenced this issue Aug 23, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
ronag added a commit to nxtedition/node that referenced this issue Aug 23, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
Refs: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Aug 23, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
@lemire
Copy link
Member

lemire commented Aug 23, 2024

@ronag

In simdutf, the expectation (currently) is that you assume that the input is valid. If it is not, then simdutf will tell you, and you can use a slow path. We expect most inputs to be valid strings...

@ronag
Copy link
Member

ronag commented Aug 23, 2024

so is there a way I could do:

if (!simd_utf8_copy(src, dst, size)) {
  // slow path
}

@lemire
Copy link
Member

lemire commented Aug 23, 2024

so is there a way I could do:

Let me have a look at the offending code.

ronag added a commit to nxtedition/node that referenced this issue Aug 23, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
@lemire
Copy link
Member

lemire commented Aug 23, 2024

@ronag Let me create a PR (for discussion).

ronag added a commit to nxtedition/node that referenced this issue Aug 27, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Aug 28, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Aug 28, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Aug 29, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
nodejs-github-bot pushed a commit that referenced this issue Aug 29, 2024
It should resolve the regressions while we work on fixing them.

Refs: #54521
PR-URL: #54565
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@avivkeller
Copy link
Member

Fixed by #54565

@avivkeller avivkeller closed this as not planned Won't fix, can't repro, duplicate, stale Aug 29, 2024
ronag added a commit to nxtedition/node that referenced this issue Aug 30, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Aug 30, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
It should resolve the regressions while we work on fixing them.

Refs: #54521
PR-URL: #54565
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
It should resolve the regressions while we work on fixing them.

Refs: #54521
PR-URL: #54565
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
It should resolve the regressions while we work on fixing them.

Refs: #54521
PR-URL: #54565
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
ronag added a commit to nxtedition/node that referenced this issue Aug 31, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Sep 2, 2024
Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Sep 2, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: nodejs#54521
PR-URL: nodejs#54525
ronag added a commit to nxtedition/node that referenced this issue Sep 2, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: nodejs#54521
nodejs-github-bot pushed a commit that referenced this issue Sep 4, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: #54521
PR-URL: #54526
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
aduh95 pushed a commit that referenced this issue Sep 12, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: #54521
PR-URL: #54526
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
louwers pushed a commit to louwers/node that referenced this issue Nov 2, 2024
It should resolve the regressions while we work on fixing them.

Refs: nodejs#54521
PR-URL: nodejs#54565
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
louwers pushed a commit to louwers/node that referenced this issue Nov 2, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: nodejs#54521
PR-URL: nodejs#54526
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
tpoisseau pushed a commit to tpoisseau/node that referenced this issue Nov 21, 2024
Re-enables fast Fast API for Buffer.write after fixing
UTF8 handling.

Fixes: nodejs#54521
PR-URL: nodejs#54526
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. regression Issues related to regressions. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants