Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary binary data for triggerDataEvent/onData #2326

Closed
jerch opened this issue Jul 20, 2019 · 8 comments
Closed

Allow arbitrary binary data for triggerDataEvent/onData #2326

jerch opened this issue Jul 20, 2019 · 8 comments
Labels
area/api type/bug Something is misbehaving type/enhancement Features or improvements to existing features

Comments

@jerch
Copy link
Member

jerch commented Jul 20, 2019

While poking around in the mouse realms I stumbled again over our UTF8 limitation with onData.

Some background
onData currently uses a string to represent data that is going to be sent to the pty. A JS string is UTF16/UCS-2 encoded which can hold values up to 0xFFFF (2 byte "characters"). On byte level a JS string '\x01\x02' will look like [0x00, 0x01, 0x00, 0x02] (UTF-16BE).
On the contrary the OS pty is a raw byte sink, thus swallows whatever byte sequence was fed. The slave app typically expects UTF-8 encoding for string content, otherwise raw bytes.

Now what will happen if we feed '\x01\x02' with onData to the pty?

Short answer: It writes [0x01, 0x02], prolly as expected in the first place.
Long answer: It depends on the encoding applied. By default JS string are seen as UTF-8 content, thus get encoded/decoded with UTF-8 at "raw byte" interfaces. UTF-16 and UTF-8 are stable for ASCII chars, thus '\x01' (UTF-16) maps to '\x01' (UTF-8). On byte/memory level the differences are those padding bytes.

Now what will happen if we feed '\x80\x81' with onData to the pty?

Same as above, the string gets UTF-8 encoded - [0xc2, 0x80, 0xc2, 0x81]. Here now the UTF-8 multibyte encoding kicks in. Thats perfectly fine as long as we meant to send those as string characters.

What is the actual problem?
We cannot send arbitrary raw bytes up to 255 (8bit). We are limited to 127 (7bit).

Do we need to send arbitrary bytes at all?
Currently we have one use case for them - X10 default mode mouse encoding. In general I think we should not be limited by default encoding assumptions at all, since a OS pty is still a raw byte sink.

How to fix this?
To get away with only small API changes I propose to extend triggerDataEvent/onData with an optional flag indicating whether the data is meant as raw binary data:

public triggerDataEvent(
  data: string,
  binary: boolean = false,
  wasUserInput: boolean = false): void
{
  ...
  this._onData.fire(data, binary);
}
...
onData(): IEvent<string, boolean>;

Now if binary=true the data string should be a "binary string" containing only values from 0-255. A consumer of onData may not apply UTF-8 encoding to it, e.g. node-pty should not feed it as string to the pty, but send a decoded buffer instead pty.write(Buffer.from(string, 'binary')). Same for the attach addon - it should not apply the default string send method, but needs to escape the data somehow (my tiny protocol in #2295 can be used for this).

Yes UTF-8 auto-assumption is everywhere and hard to overcome.

Edit: We could also keep string and binary data separated as we do for the input stuff like this:

public triggerBinaryDataEvent(data: Uint8Array, wasUserInput: boolean = false): void {
  ...
  this._onBinaryData.fire(data);
}
...
onBinaryData(): IEvent<Uint8Array>;

which would save a few string to Uint8Array conversions, but bloat the API further.

@jerch jerch added type/bug Something is misbehaving type/enhancement Features or improvements to existing features area/api labels Jul 20, 2019
This was referenced Jul 21, 2019
@Tyriar
Copy link
Member

Tyriar commented Jul 23, 2019

I'm I reading this right that there is no way to fix #1962 without something like this? It would really complicate the simple example if there were 2 data events

@jerch
Copy link
Member Author

jerch commented Jul 23, 2019

@Tyriar
Yes and even worse - any integration would have to make the distinction down to the pty whether UTF8 encoding/decoding can be applied or not.*
To avoid complicatiing the API further I think we could go with the binary flag extension, the costs of the additional buffer <--> string transitions are not that high. (And binary data will not occur often.)

(*) Thus my proposal to remove string output completely from the attach addon here #2295 (comment). We could just spit out binary data always, and do the string to Utf8Array conversion with a TextEncoder instance.

@Tyriar
Copy link
Member

Tyriar commented Jul 23, 2019

😕

I think it would be especially confusing to the layman how writeUtf8 and the binary onData are different. Could we learn something from how node.js' Buffer APIs are structured? https://nodejs.org/api/buffer.html#buffer_class_method_buffer_from_string_encoding

@jerch
Copy link
Member Author

jerch commented Jul 23, 2019

Yeah building a similar interface to nodes buffer API might help here. Currently onData and writeUtf8 are not linked to each other (one deals with output, the other with input). The naming is confusing, I agree. In fact writeUtf8 means "write binary data as Uint8Array, that is guaranteed to be utf8 compliant". Our write method means "write JS string data, thats guaranteed to be utf16 compliant". The tricky part for the input is - the writeUtf8 method is the only one allowing non UTF8 bytes to be sent (since operating on raw bytes), but those would be cut off by the input decoder. Thus the name is abit misleading.

I agree, we need better naming to clearly separate things here.

@jerch
Copy link
Member Author

jerch commented Jul 23, 2019

Proposal for a better naming scheme:

/**
 * interfaces
 */
// data input
// data is treated as UTF16 input by StringDecoder
write(data: string, cb?: () => void);
// bytes will be decoded with given encoding (currently we have only Utf8Decoder)
writeBytes(data: Uint8Array, encoding: 'UTF-8' | 'binary' | 'UTF-16', cb?: () => void);


// data output
// data should be treated as given by encoding
onData(handler: (data: string, encoding: 'UTF-8' | 'binary') => void);
// already decoded bytes, no further encoding hint needed
onDataBytes(handler: (data: Uint8Array) => void);

/**
 * examples
 */
// usage of write
pty.setEncoding('UTF-8');  // translates raw bytes into JS string applying UTF-8
pty.on('data', data => term.write(data));  // string itself is UTF-16

// usage of writeBytes:
pty.setEncoding(null);  // no auto decoding to JS string by node-pty
pty.on('data', data => term.writeBytes(data, 'UTF-8'));  // assuming the OS pty sends UTF-8
pty.on('data', data => term.writeBytes(data, 'UTF-16')); // assuming the OS pty sends UTF-16
// with this we could theoretically support ancient encodings as well
pty.on('data', data => term.writeBytes(data, 'ancientXY'));


// usage of onData:
term.onData((data, encoding) => pty.write(Buffer.from(data, encoding)));

// usage of onDataBytes:
term.onDataBytes(data => pty.write(data));

As said above, not sure if we need a onDataBytes thingy, though it would save the additional conversion needed in the onData handler. (For the attach addon this would further save the base64 encoding on the data string). For output only one event is needed, we could also drop the string support and go with the bytes only (imho better, since the string thingy will be a chimera where the string has different meanings and content restrictions depending on the encoding hint).

@jerch
Copy link
Member Author

jerch commented Jul 24, 2019

To get a better overview of whats going on:

Current default deployment:

    +-----+         +---------+         +----------+          +------------+
    |     +--------->         +--------->          +---------->            |
    | APP |  bytes  |   PTY   |  bytes  | node-pty |  string  |  xterm.js  |
    |     <---------+         <---------+          <----------+            |
    +-----+         +---------+         +----------+          +------------+

expects/sends string                 data handled as UTF8      in:  converts data
data as raw bytes                    (UTF8 <--> string)             to UTF32 (StringDecoder)
in locale encoding                                             out: sends JS string
(99% UTF8)


With writeUtf8:

    +-----+         +---------+         +----------+  bytes   +------------+
    |     +--------->         +--------->          +---------->            |
    | APP |  bytes  |   PTY   |  bytes  | node-pty |          |  xterm.js  |
    |     <---------+         <---------+          <----------+            |
    +-----+         +---------+         +----------+  string  +------------+


expects/sends string                   ---> unchanged          in:  converts data
data as raw bytes                      <--- UTF8 encoded            to UTF32 (Utf8Decoder)
in locale encoding                                             out: sends JS string
(99% UTF8)

And here the idea how it could look like with an IOService that hide/abstracts the encoding nightmare:

IOService proposal:
                                                                        +---------------+
    +-----+         +---------+         +----------+         +-------------+            |
    |     +--------->         +--------->          +--------->             |            |
    | APP |  bytes  |   PTY   |  bytes  | node-pty |  bytes  |  IOService  |  xterm.js  |
    |     <---------+         <---------+          <---------+             |            |
    +-----+         +---------+         +----------+         +-------------+            |
                                                                        +---------------+

expects/sends string                   +--> unchanged        in:  converts bytes to UTF32
data as raw bytes                      <--+ unchanged             (based on InputEncoding)
in locale encoding                                           out: converts string to bytes
(99% UTF8)                                                        (based on OutputEncoding)
                                                                  writes raw bytes directly

This also would work nice together with the flow control stuff and the thin protocol needed for websockets (#2295):

IOService with websocket and flow control:
                                                  ThinProtocol                                
    +-----+         +---------+         +----------+        +-------------+        +--
    |     +--------->         +--------->          +-------->             +-------->
    | APP |  bytes  |   PTY   |  bytes  | node-pty |  bytes |  WebSocket  |  bytes |
    |     <---------+         <---------+          <--------+  (binary)   <--------+
    +-----+         +---------+         +----------+        +---+---------+        +--
                                                                |                   
                                                                +---> other consumer (bytes)
expects/sends string
data as raw bytes                               content unchanged
in locale encoding                              protocol adds one byte
(99% UTF8)                                      to distinguish message types

Summary: To be able to send/receive arbitrary byte sequences we need to get rid of the implicit string conversions done by node-pty and the browser. Easiest way to achieve this is to turn all data transports into bytes representations. It would be much harder to stick with the string transports, as we would have to circumvent those implicit conversions at every occurence (e.g. by wrapping binary data into base64). For our demo this means we would have to escape these implicit conversion steps for outgoing data:

  • browser to WebSocket (string messages are UTF8 encoded)
  • WebSocket to string repr in node-pty
  • string repr in node-pty to native pty

@jerch
Copy link
Member Author

jerch commented Jul 24, 2019

Now the question is - how would an IOService look like? Basically it would incorporate our current IO methods and events plus the needed Encoders/Decoders we want to support. Furthermore I think we should move our writeBuffer batching from Terminal.ts into this service as well. Thus an API could look like this:

// public
interface IOService {

  // encoding of incoming bytes
  setInputEncoding(encoding: InputEncoding): void;

  // encoding of outgoing string messages
  setOutputEncoding(encoding: OutputEncoding): void;

  // write some bytes to the terminal (conversion: inputEncoding --> UTF32)
  write(data: Uint8Array, cb?: () => void);

  // convenient method to write string data (assumes UTF16)
  writeString(data: string, cb?: () => void);

  // listen to bytes sent from the terminal (conversion: InternalFormat --> outputEncoding)
  // contains all data (string and binary)
  onData: IEvent<Uint8Array>;

  // listen to strings sent from the terminal
  // only contains string data
  onDataString: IEvent<string>;
}

// internally:
// input
write(...) {
  // buffering/batching
  ...
  // async
  setTimeout(() => {
    // conversion: inputEncoding --> UTF32
    // parse chunk
    // callback
  }, 0);
}

// output
triggerDataEvent(data: InternalFormat) {
  if (hasOnDataListeners) {
    // conversion: InternalFormat --> outputEncoding
  }
  if (hasOnDataStringListeners) {
    // conversion: InternalFormat --> string
  }
}

The internal format is easy to find - currently we use 100% string for that (not being capable to annotate raw binary), thus we could stick with string and add a "binary" representation on top. Lets call it InternalString for a moment. Its important to note that InternalString is not equivalent to a normal JS string due to the added binary representation, that has to be processed differently during the onData event creation:

interface InternalString {
  content: string;
  binary: boolean;
}

// onData/triggerDataEvent:
  ...
  let result: Uint8Array;
  if (data.binary) {
    // lets stick with "binarystring" for binary types
    // e.g. only charCode 0-255 are allowed
    result = new Uint8Array(data.content.length);
    for (let i = 0; i < data.content.length; ++i) {
      result[i] = data.content.charCodeAt(i) & 255;
    }
  } else {
      // if data is string data (not binary), use UTF8
      // or any other supported output encoding
      // (prolly selfwritten since W3C removed all encodings
      // beside UTF8 from TextEncoder)
      const te = new TextEncoder();
      result = te.encode(data.content);
  }
  ...

Alot to deal with 😸 ...

@jerch jerch mentioned this issue Aug 7, 2019
@jerch
Copy link
Member Author

jerch commented Feb 1, 2021

Resolved by #2566.

@jerch jerch closed this as completed Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api type/bug Something is misbehaving type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

2 participants