Converting binary data to UTF-8 changes the data #3674

MatteoT9890 · 2022-01-02T15:55:58Z

Version

14.14.0

Platform

Linux matt 4.15.0-163-generic nodejs/node#171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

let buffer = fs.readFileSync("file.pdf")
let byteLength = Buffer.byteLength(buffer) // **Valid pdf, Length: 1440177**
let utf8String = buffer.toString("utf8") 
let bufferAgain = Buffer.from(utf8String,"utf8") // **Bad pdf, Length: 2551916*
buffer.equals(bufferAgain) // **Gives false**

How often does it reproduce? Is there a required condition?

Always with PDF files

What is the expected behavior?

Buffer of PDF file must be converted into string utf8, then converted again in the same inital buffer

What do you see instead?

The two buffer, one before conversion and one after conversion, are different.

Additional information

File pdf can be downloaded from this url.

When download, rename it to "file.pdf" in order to run the provided snippet code.

The text was updated successfully, but these errors were encountered:

tniessen · 2022-01-02T19:25:15Z

This is not a bug but expected behavior.

PDF files are binary files. As per the documentation of buffer.toString, the returned string may not be an accurate representation of binary data if it wasn't a valid UTF8 code point sequence in the first place:

If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.

Calling buffer.toString('utf8') is only safe and will only retain the original binary representation if the buffer was a valid UTF8 string. PDF files are not UTF8 strings.

When you convert the resulting string back to a Buffer, it is impossible to restore the original binary data from the string.

MatteoT9890 changed the title Jan 2, 2022

mscdex transferred this issue from nodejs/node Jan 2, 2022

tniessen closed this as completed Jan 2, 2022

tniessen changed the title ~~BUG in PDF conversion from utf8 string to buffer again~~ Converting binary data to UTF-8 changes the data Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting binary data to UTF-8 changes the data #3674

Converting binary data to UTF-8 changes the data #3674

MatteoT9890 commented Jan 2, 2022

tniessen commented Jan 2, 2022

Converting binary data to UTF-8 changes the data #3674

Converting binary data to UTF-8 changes the data #3674

Comments

MatteoT9890 commented Jan 2, 2022

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

tniessen commented Jan 2, 2022