Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting binary data to UTF-8 changes the data #3674

Closed
MatteoT9890 opened this issue Jan 2, 2022 · 1 comment
Closed

Converting binary data to UTF-8 changes the data #3674

MatteoT9890 opened this issue Jan 2, 2022 · 1 comment

Comments

@MatteoT9890
Copy link

Version

14.14.0

Platform

Linux matt 4.15.0-163-generic nodejs/node#171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

let buffer = fs.readFileSync("file.pdf")
let byteLength = Buffer.byteLength(buffer) // **Valid pdf, Length: 1440177**
let utf8String = buffer.toString("utf8") 
let bufferAgain = Buffer.from(utf8String,"utf8") // **Bad pdf, Length: 2551916*
buffer.equals(bufferAgain) // **Gives false**

How often does it reproduce? Is there a required condition?

Always with PDF files

What is the expected behavior?

Buffer of PDF file must be converted into string utf8, then converted again in the same inital buffer

What do you see instead?

The two buffer, one before conversion and one after conversion, are different.

Additional information

File pdf can be downloaded from this url.

When download, rename it to "file.pdf" in order to run the provided snippet code.

@MatteoT9890 MatteoT9890 changed the title Jan 2, 2022
@mscdex mscdex transferred this issue from nodejs/node Jan 2, 2022
@tniessen
Copy link
Member

tniessen commented Jan 2, 2022

This is not a bug but expected behavior.

PDF files are binary files. As per the documentation of buffer.toString, the returned string may not be an accurate representation of binary data if it wasn't a valid UTF8 code point sequence in the first place:

If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.

Calling buffer.toString('utf8') is only safe and will only retain the original binary representation if the buffer was a valid UTF8 string. PDF files are not UTF8 strings.

When you convert the resulting string back to a Buffer, it is impossible to restore the original binary data from the string.

@tniessen tniessen closed this as completed Jan 2, 2022
@tniessen tniessen changed the title BUG in PDF conversion from utf8 string to buffer again Converting binary data to UTF-8 changes the data Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants