Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restyling wrapped strings #64

Closed
krlmlr opened this issue Mar 28, 2020 · 4 comments
Closed

Restyling wrapped strings #64

krlmlr opened this issue Mar 28, 2020 · 4 comments

Comments

@krlmlr
Copy link

krlmlr commented Mar 28, 2020

strwrap_ctl() appears to add escape sequences that clear style. This has led to weird printing in tibble.

What's the reason for adding \033[0m and \033[1m before and after newlines? What breaks if these sequences are omitted? Could strwrap_ctl() be more careful about inserting the "right" escape sequences at EOL?

The output of the first example, sent to cat(), looks great in both RStudio and in my terminal.

I have worked around in tibble, this may be too much of a corner case to worry about at all. For now we could add something like this to the documentation:

The output of strwrap_ctl() is intended to use "as is", because each line ends with a sequence that turns off all formatting applied previously. If you intend to apply further formatting to the wrapped output, make sure to apply the formatting to each line individually, and not to the entire result collapsed by e.g. paste(..., collapse = "\n").

library(fansi)
library(crayon)

options(crayon.enabled = TRUE)

split <- function(x) {
  unlist(strsplit(x, "\n", fixed = TRUE))
}

collapse <- function(x) {
  paste(x, collapse = "\n")
}

split(green(bold("xxx\nyyy")))
#> [1] "\033[32m\033[1mxxx"  "yyy\033[22m\033[39m"
split(green(collapse(strwrap_ctl(bold("xxx yyy"), width = 4))))
#> [1] "\033[32m\033[1mxxx\033[0m"  "\033[1myyy\033[22m\033[39m"
split(green(collapse(strwrap2_ctl(bold("xxx yyy"), width = 4))))
#> [1] "\033[32m\033[1mxxx\033[0m"  "\033[1myyy\033[22m\033[39m"

Created on 2020-03-28 by the reprex package (v0.3.0)

Follow-up to tidyverse/tibble#574 (comment).

@brodieG
Copy link
Owner

brodieG commented Mar 28, 2020

I do agree that at a minimum this should be clearly documented.

IIRC I add the "\033[m" for two reasons. First weird things happen when you start adding newlines in the middle of SGR formatted strings:

image

This is actually failing in a different way than I remember last where IIRC the color bled past the newline, not the last line.

Additionally, what is the desired outcome of:

writeLines(paste("| ", strwrap_ctl("\033[42mhello world\033[m", 5), " |"))

Without the closing escape the green would just bleed out. I think you could make the case for that being the correct outcome, but I find it less useful. A big use case of strwrap_ctl is to make narrower text column that can then be pasted with other text columns with or without formatting, and prevent the formatting from cross contaminating across columns. We can't just direct the user to manually add closing escapes between columns because they also need to resume the color for the column on the next line. Having completely self encapsulated lines as strwrap_ctl produces makes it easy to combine columns.

Note that the additional "\033[1m" you ask about is the resumption of the bolding in the subsequent line.

I think part of the problem is that the crayon model tries to mimic XML-like nested semantics (i.e. green(xxx) -> <span style='color: green'>xxx</span>) when in reality ANSI CSI SGR is a state based system so the metaphor and the expectations that come with it fails in cases like these. IIRC crayon does quite a bit internally to try to perpetuate the illusion.

We could try to be more selective in how we terminate the lines, i.e. instead of terminating all styles we can just terminate the styles that are known to be active, but that will require messing with C code that allocates memory for strings, which is not too bad but not something I'm particularly eager to do at the moment. I think if we do this crayon styles will have a better chance of applying as expected. Another option is for crayon to recognize "\033[m" and do the same thing it does for "\033[39m" and similar. This is probably less work (though still some) and if deemed important enough I could submit a PR to crayon.

@brodieG brodieG added this to the 0.4.2 milestone Mar 28, 2020
@brodieG
Copy link
Owner

brodieG commented Apr 5, 2020

Thinking about this more I think the "right" solution is to have fansi terminate only the active styles, and not all styles. Then at least crayon will behave more as expected. There might be some issues with some styles that don't have explicit terminators (and I think that's part of the reason I use \033[m, but I think those are rare).

@brodieG
Copy link
Owner

brodieG commented May 24, 2021

strtrim_ctl relies on strwrap_ctl, so two birds with one stone on that one.

substr_ctl does the termination in R, so we may need an interface to the closing string determination code for that. strsplit_ctl relies on substr_ctl (although it also has dormant C code).

Update: unfortunately I ran out of time due to a CRAN Fix or Else deadline, so this will have to wait for next release.

@brodieG
Copy link
Owner

brodieG commented Jun 6, 2021

@krlmlr I put in an experimental solution in the development branch:

image

normalize takes all SGR sequences and decomposes them into the simplest form, which then makes them compatible with crayon. Most fansi functions gain the normalize parameter, but there is also normalize_sgr which takes any arbitrary string and normalizes the SGR sequences in it, e.g.:

> normalize_sgr("\033[4;44mhello\033[mworld")
[1] "\033[4m\033[44mhello\033[24m\033[49mworld"

@gaborcsardi, this might be of interest as it makes it possible to combine arbitrary SGR strings with crayon, and not just those that are generated by crayon or programs that are careful to always pair the opening sequence with the corresponding closing one.

This is still a bit experimental and probably won't make it onto CRAN for a few weeks, so if you have any feedback let me know.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jun 18, 2023
# fansi Release Notes

## v1.0.4

CRAN compiled code warning suppression release.

* Fix void function declarations and definitions.
* Change `sprintf` to `snprintf`.

## v1.0.3

* Address problem uncovered by gcc-12 linters, although the issue itself could
  not manifest due to redundancy of checks in the code.

## v1.0.0-2

This is a major release and includes some behavior changes.

### Features

* New functions:
    * [#26](brodieG/fansi#26) Replacement forms of
      `substr_cl` (i.e `substr_ctl<-`).
    * `state_at_end` to compute active state at end of a string.
    * `close_state` to generate a closing sequence given an active state.
    * [#31](brodieG/fansi#31) `trimws_ctl` as an
      equivalent to `trimws`.
    * [#64](brodieG/fansi#64) `normalize_sgr` converts
      compound _Control Sequences_ into normalized form (e.g. "ESC[44;31m"
      becomes "ESC[31mESC[44m") for better compatibility with
      [`crayon`](https://github.com/r-lib/crayon).  Additionally, most functions
      gain a `normalize` parameter so that they may return their output in
      normalized form (h/t @krlmlr).
* [#74](https://github.com/brodieG/fansi/issues/74)`substr_ctl` and related
  functions are now all-C instead of a combination of C offset computations and
  R level `substr` operations.  This greatly improves performance, particularly
  for vectors with many distinct strings.  Despite documentation claiming
  otherwise, `substr_ctl` was quite slow in that case.
* [#66](brodieG/fansi#66) Improved grapheme support,
  including accounting for them in `type="width"` mode, as well as a
  `type="graphemes"` mode to measure in graphemes instead of characters.
  Implementation is based on heuristics designed to work in most common use
  cases.
* `html_esc` gains a `what` parameter to indicate which HTML special characters
  should be escaped.
* Many functions gain `carry` and `terminate` parameters to control how `fansi`
  generated substrings interact with surrounding formats.
* [#71](brodieG/fansi#71) Functions that write SGR and
  OSC are now more parsimonious (see "Behavior Changes" below).
* [#73](brodieG/fansi#73) Default parameter values
  retrieved with `getOption` now always have explicit fallback values defined
  (h/t @gadenbui).
* Better warnings and error messages, including more granular messages for
  `unhandled_ctl` for adjacent _Control Sequences_.
* `term.cap` parameter now accepts "all" as value, like the `ctl` parameter.

### Deprecated Functions

* All the "sgr" functions (e.g., `substr_sgr`, `strwrap_sgr`) are deprecated.
  They will likely live on indefinitely, but they are of limited usefulness and
  with the added support for OSC hyperlinks their name is misleading.
* `sgr_to_html` is now `to_html` with slight modifications to semantics; the old
  function remains and does not warn about unescaped "<" or ">" in the
  input string.

### Behavior Changes

The major intentional behavior change is to default `fansi` to always recognize
true color CSI SGR sequences (e.g. `"ESC[38;2;128;50;245m"`).  The prior
default was to match the active terminal capabilities, but it is unlikely that
the intent of a user manipulating a string with truecolor sequences is to
interpret them incorrectly, even if their terminal does.  `fansi` will continue
to warn in this case.  To keep the pre-1.0 behavior add `"old"` to the
`term.cap` parameter.

Additionally, `to_html` will now warn if it encounters unescaped HTML special
character "<" or ">" in the input string.

Finally, the 1.0 release is an extensive refactoring of many parts of the
SGR and OSC hyperlink controls (_Special Sequences_) intake and output
algorithms.  In some cases this means that some `fansi` functions will output
_Special Sequences_ slightly differently than they did before.  In almost all
cases the rendering of the output should remain unchanged, although there are
some corner cases with changes (e.g. in `strwrap_ctl` SGRs embedded in
whitespace sequences don't break the sequence).

The changes are a side effect of applying more consistent treatment of corner
cases around leading and trailing control sequences and (partially) invalid
control sequences.  Trailing _Special Sequences_ in the output is now omitted as
it would be immediately closed (assuming `terminate=TRUE`, the default).
Leading SGR is interpreted and re-output.

Normally output consistency alone would not be a reason to change behavior, but
in this case the changes should be almost always undetectable in the
**rendered** output, and maintaining old inconsistent behavior in the midst of a
complete refactoring of the internals was beyond my patience.  I apologize if
these behavior changes adversely affect your programs.

> WARNING: we will strive to keep rendered appearance of `fansi` outputs
> consistent across releases, but the exact bytes used in the output of _Special
> Sequences_ may change.

Other changes:

* Tests may no longer pass with R < 4.0 although the package should still
  function correctly.  This is primarily because of changes to the character
  width Unicode Database that ships with R, and many of the newly added grapheme
  tests touch parts of that database that changed (emoji).
* CSI sequences with more than one "intermediate" byte are now considered valid,
  even though they are likely to be very rare, and CSI sequences consume all
  subsequent bytes until a valid closing byte or end of string is encountered.
* `strip_ctl` only warns with malformed CSI and OSC if they are reported as
  supported via the `ctl` parameter.  If CSI and OSC are indicated as not
  supported, but two byte escapes are, the two initial bytes of CSI and OSCs
  will be stripped.
* "unknown" encoded strings are no longer translated to UTF-8 in UTF-8 locales
  (they are instead assumed to be UTF-8).
* `nchar_ctl` preserves `dim`, `dimnames`, and `names` as the base functions do.
* UTF-8 known to be invalid should not be output, even if present in input
  (UTF-8 validation is not complete, only sequences that are obviously wrong are
  detected).

### Bug Fixes

* Fix `tabs_as_spaces` to handle sequential tabs, and to perform better on very
  wide strings.
* Strings with invalid UTF-8 sequences with "unknown" declared encoding in UTF-8
  locales now cause errors instead of being silently translated into byte
  escaped versions (e.g. "\xf0\xc2" (2 bytes), used to be interpreted as
  "<f0><c2>" (four characters).  These now cause errors as they would have if
  they had had "UTF-8" declared encoding.
* In some cases true colors of form "38;2;x;x;x" and "48;2;x;x;x" would only be
  partially transcribed.

### Internal Changes

* More aggressive UTF-8 validation, also, invalid UTF-8 code points now advance
  only one byte instead of their putative width based on the initial byte.
* Reduce peak memory usage by making some intermediate buffers eligible for
  garbage collection prior to native code returning to R.
* Reworked internals to simplify buffer size computation and synchronization, in
  some cases this might cause slightly reduced performance.  Please report any
  significant performance regressions.
* `nchar_ctl(...)` is no longer a wrapper for `nchar(strip_ctl(...))` so that it
  may correctly support grapheme width calculations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants