-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow arbitrary binary data for triggerDataEvent/onData #2326
Comments
I'm I reading this right that there is no way to fix #1962 without something like this? It would really complicate the simple example if there were 2 data events |
@Tyriar (*) Thus my proposal to remove string output completely from the attach addon here #2295 (comment). We could just spit out binary data always, and do the string to Utf8Array conversion with a TextEncoder instance. |
😕 I think it would be especially confusing to the layman how |
Yeah building a similar interface to nodes buffer API might help here. Currently I agree, we need better naming to clearly separate things here. |
Proposal for a better naming scheme: /**
* interfaces
*/
// data input
// data is treated as UTF16 input by StringDecoder
write(data: string, cb?: () => void);
// bytes will be decoded with given encoding (currently we have only Utf8Decoder)
writeBytes(data: Uint8Array, encoding: 'UTF-8' | 'binary' | 'UTF-16', cb?: () => void);
// data output
// data should be treated as given by encoding
onData(handler: (data: string, encoding: 'UTF-8' | 'binary') => void);
// already decoded bytes, no further encoding hint needed
onDataBytes(handler: (data: Uint8Array) => void);
/**
* examples
*/
// usage of write
pty.setEncoding('UTF-8'); // translates raw bytes into JS string applying UTF-8
pty.on('data', data => term.write(data)); // string itself is UTF-16
// usage of writeBytes:
pty.setEncoding(null); // no auto decoding to JS string by node-pty
pty.on('data', data => term.writeBytes(data, 'UTF-8')); // assuming the OS pty sends UTF-8
pty.on('data', data => term.writeBytes(data, 'UTF-16')); // assuming the OS pty sends UTF-16
// with this we could theoretically support ancient encodings as well
pty.on('data', data => term.writeBytes(data, 'ancientXY'));
// usage of onData:
term.onData((data, encoding) => pty.write(Buffer.from(data, encoding)));
// usage of onDataBytes:
term.onDataBytes(data => pty.write(data)); As said above, not sure if we need a |
To get a better overview of whats going on:
And here the idea how it could look like with an IOService that hide/abstracts the encoding nightmare:
This also would work nice together with the flow control stuff and the thin protocol needed for websockets (#2295):
Summary: To be able to send/receive arbitrary byte sequences we need to get rid of the implicit string conversions done by
|
Now the question is - how would an IOService look like? Basically it would incorporate our current IO methods and events plus the needed Encoders/Decoders we want to support. Furthermore I think we should move our // public
interface IOService {
// encoding of incoming bytes
setInputEncoding(encoding: InputEncoding): void;
// encoding of outgoing string messages
setOutputEncoding(encoding: OutputEncoding): void;
// write some bytes to the terminal (conversion: inputEncoding --> UTF32)
write(data: Uint8Array, cb?: () => void);
// convenient method to write string data (assumes UTF16)
writeString(data: string, cb?: () => void);
// listen to bytes sent from the terminal (conversion: InternalFormat --> outputEncoding)
// contains all data (string and binary)
onData: IEvent<Uint8Array>;
// listen to strings sent from the terminal
// only contains string data
onDataString: IEvent<string>;
}
// internally:
// input
write(...) {
// buffering/batching
...
// async
setTimeout(() => {
// conversion: inputEncoding --> UTF32
// parse chunk
// callback
}, 0);
}
// output
triggerDataEvent(data: InternalFormat) {
if (hasOnDataListeners) {
// conversion: InternalFormat --> outputEncoding
}
if (hasOnDataStringListeners) {
// conversion: InternalFormat --> string
}
} The internal format is easy to find - currently we use 100% string for that (not being capable to annotate raw binary), thus we could stick with string and add a "binary" representation on top. Lets call it interface InternalString {
content: string;
binary: boolean;
}
// onData/triggerDataEvent:
...
let result: Uint8Array;
if (data.binary) {
// lets stick with "binarystring" for binary types
// e.g. only charCode 0-255 are allowed
result = new Uint8Array(data.content.length);
for (let i = 0; i < data.content.length; ++i) {
result[i] = data.content.charCodeAt(i) & 255;
}
} else {
// if data is string data (not binary), use UTF8
// or any other supported output encoding
// (prolly selfwritten since W3C removed all encodings
// beside UTF8 from TextEncoder)
const te = new TextEncoder();
result = te.encode(data.content);
}
... Alot to deal with 😸 ... |
Resolved by #2566. |
While poking around in the mouse realms I stumbled again over our UTF8 limitation with
onData
.Some background
onData
currently uses a string to represent data that is going to be sent to the pty. A JS string is UTF16/UCS-2 encoded which can hold values up to 0xFFFF (2 byte "characters"). On byte level a JS string'\x01\x02'
will look like[0x00, 0x01, 0x00, 0x02]
(UTF-16BE).On the contrary the OS pty is a raw byte sink, thus swallows whatever byte sequence was fed. The slave app typically expects UTF-8 encoding for string content, otherwise raw bytes.
Now what will happen if we feed
'\x01\x02'
withonData
to the pty?Short answer: It writes
[0x01, 0x02]
, prolly as expected in the first place.Long answer: It depends on the encoding applied. By default JS string are seen as UTF-8 content, thus get encoded/decoded with UTF-8 at "raw byte" interfaces. UTF-16 and UTF-8 are stable for ASCII chars, thus '\x01' (UTF-16) maps to '\x01' (UTF-8). On byte/memory level the differences are those padding bytes.
Now what will happen if we feed
'\x80\x81'
withonData
to the pty?Same as above, the string gets UTF-8 encoded -
[0xc2, 0x80, 0xc2, 0x81]
. Here now the UTF-8 multibyte encoding kicks in. Thats perfectly fine as long as we meant to send those as string characters.What is the actual problem?
We cannot send arbitrary raw bytes up to 255 (8bit). We are limited to 127 (7bit).
Do we need to send arbitrary bytes at all?
Currently we have one use case for them - X10 default mode mouse encoding. In general I think we should not be limited by default encoding assumptions at all, since a OS pty is still a raw byte sink.
How to fix this?
To get away with only small API changes I propose to extend
triggerDataEvent
/onData
with an optional flag indicating whether the data is meant as raw binary data:Now if
binary=true
the data string should be a "binary string" containing only values from 0-255. A consumer ofonData
may not apply UTF-8 encoding to it, e.g. node-pty should not feed it as string to the pty, but send a decoded buffer insteadpty.write(Buffer.from(string, 'binary'))
. Same for the attach addon - it should not apply the default string send method, but needs to escape the data somehow (my tiny protocol in #2295 can be used for this).Yes UTF-8 auto-assumption is everywhere and hard to overcome.
Edit: We could also keep string and binary data separated as we do for the input stuff like this:
which would save a few string to Uint8Array conversions, but bloat the API further.
The text was updated successfully, but these errors were encountered: