-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer.from() output breaks after optimizing in nodejs 22.7 #54521
Comments
I managed to reproduce using your code snippet, and the slightly modified one below (for readability) let i = 0;
while (i < 1_000_000) {
const asHex = Buffer.from("\x80").toString("hex");
if (asHex === '80') {
break;
} else if (asHex !== 'c280') {
console.log("Unexpected return value:", asHex);
}
i++;
}
if (i < 1_000_000) {
console.log("FAILED after %d iterations", i);
process.exit(1);
} else {
console.log("PASSED after %d iterations", i);
} $ node repro.js
FAILED after 8827 iterations
$ node repro.js
FAILED after 5509 iterations
$ node repro.js
FAILED after 48741 iterations
$ node repro.js
FAILED after 7487 iterations
$ node repro.js
FAILED after 9189 iterations |
bisecting gives me, as first bad commit: commit c9dabe2
however, that one's even worse and shows a lot of incorrect 'asHex' values:
so it looks like the more extreme bug got fixed (probably in 7800893) but then issues still remained for (eg) \x80 |
CC @ronag |
I'll have a look. |
let i = 0;
while (i < 1_000_000) {
const buf = Buffer.from("\x80")
if (buf[0] !== 194 || buf[1] !== 128) {
console.log("Unexpected return value:", buf, buf[0], buf[1]);
break
}
i++;
}
if (i < 1_000_000) {
console.log("FAILED after %d iterations", i);
process.exit(1);
} else {
console.log("PASSED after %d iterations", i);
} |
This seems to be something in V8. In uint32_t FastWriteString(Local<Value> receiver,
const v8::FastApiTypedArray<uint8_t>& dst,
const v8::FastOneByteString& src,
uint32_t offset,
uint32_t max_length) {
}
@targos @joyeecheung @nodejs/buffer |
Which is kind of correct... is it the slow path that is broken? |
Is the problem here that we are not handling incomplete utf8 sequences? |
I'm not sure where the |
If 0x80 was a valid Unicode position, 0xC2 0x80 woud be its UTF8 Encoding. At the very least, this is a regression compared to earlier node (or V8?) versions. I'm not sure about correctness or what the spec says - JavaScript has always kind of tolerated invalid unicode in its string (otherwise Buffer.from(xxx, "latin1") wouldn't be useful either). I think #54518 (comment) is coming from the same direction (I'm also seeing issues other Invalid UTF-8 encoding issues pop up - 0x80 was just the cleanest example of the regression) Anyway, if we ignore 0x80 for now... 0xA0 (non breaking space)is certainly a valid Unicode character and perhaps the most often occurring Unicode codepoint above 128. It should encode to 0xC2 0xA0 in utf8
returns
The encoding here also changes for the fast path node 20 only returns |
Looking at the slow implementation there is a @lemire does simdutf have a "memcpy" + replace invalid utf8 routine? |
It does not. The simdutf library does not do replacement. It only transcodes valid inputs. |
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 Refs: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
In simdutf, the expectation (currently) is that you assume that the input is valid. If it is not, then simdutf will tell you, and you can use a slow path. We expect most inputs to be valid strings... |
so is there a way I could do: if (!simd_utf8_copy(src, dst, size)) {
// slow path
} |
Let me have a look at the offending code. |
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
@ronag Let me create a PR (for discussion). |
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
It should resolve the regressions while we work on fixing them. Refs: #54521 PR-URL: #54565 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Robert Nagy <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Fixed by #54565 |
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
It should resolve the regressions while we work on fixing them. Refs: #54521 PR-URL: #54565 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Robert Nagy <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
It should resolve the regressions while we work on fixing them. Refs: #54521 PR-URL: #54565 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Robert Nagy <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
It should resolve the regressions while we work on fixing them. Refs: #54521 PR-URL: #54565 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Robert Nagy <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Fast API handles invalid UTF differently than the slow API. Fixes: nodejs#54521 PR-URL: nodejs#54525
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: nodejs#54521 PR-URL: nodejs#54525
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: nodejs#54521
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: #54521 PR-URL: #54526 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]>
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: #54521 PR-URL: #54526 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]>
It should resolve the regressions while we work on fixing them. Refs: nodejs#54521 PR-URL: nodejs#54565 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Robert Nagy <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: nodejs#54521 PR-URL: nodejs#54526 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]>
Re-enables fast Fast API for Buffer.write after fixing UTF8 handling. Fixes: nodejs#54521 PR-URL: nodejs#54526 Reviewed-By: Daniel Lemire <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Paolo Insogna <[email protected]>
Version
v22.7.0
Platform
Subsystem
Buffer
What steps will reproduce the bug?
How often does it reproduce? Is there a required condition?
For me, it will consistently fail somewhere between 7000 to 1200 iterations.
What is the expected behavior? Why is that the expected behavior?
In node 20 this code does not fail , even after 1_000_000 iterations.
Buffer.from("\x80").toString("hex")
always returnsc280
on node 20What do you see instead?
Buffer.from("\x80").toString("hex")
incorrectly returns80
after sufficient iterationsAdditional information
No response
The text was updated successfully, but these errors were encountered: