Error message pretty printing is unaware of width of a character #370

suhdonghwi · 2019-09-15T04:54:22Z

errorBundlePretty is not working properly if input contains full-width character.

Character pointer (^) should be pointing '이', but it is pointing the wrong position. Full-width character should be replaced with two spaces, not one.

The text was updated successfully, but these errors were encountered:

mrkkrp · 2019-09-15T10:49:37Z

Possibly related to #362.

mrkkrp · 2019-10-26T15:49:45Z

@simonvandel Can you please provide the example as unicode text, not as an image?

suhdonghwi · 2019-10-26T15:54:38Z

@mrkkrp Did you mean to refer me? Then the example text is "123 구구 이면".

mrkkrp · 2019-10-26T15:55:11Z

Yes, sorry. Thanks for the example!

mrkkrp · 2019-10-26T17:22:00Z

OK, let's see. You're saying that full width characters should be replaced with two spaces. What we have at this point is column position. Simply put, we insert as many spaces as necessary to reach the same column position. So are you of the opinion that we need to increment current column by 2 instead of 1 for full width characters? Is this something that software working with Korean characters usually does?

suhdonghwi · 2019-10-26T17:53:10Z

Yes, it is usual practice because in monospaced font, two half-width spaces and one full-width character (ex. Korean syllable) are always the same width. So I think it is safe to put two spaces when it encounters full-width character.

mrkkrp · 2019-10-26T17:56:21Z

What I'm more concerned about, is that it may be confusing to see position like 1:5 when you only consumed two characters.

mrkkrp · 2019-10-26T18:05:35Z

Also need to find a good way to detect full-width characters.

suhdonghwi · 2019-10-26T18:15:31Z

I don't know how it's working inside so I can't really know what is proper solution for that problem. But maybe it is possible to make use of full-width space character? It is a single space character but full-width.

Regarding the detection of full-width characters, there is a method in ICU library to check if a character is full-width or not. But ICU library itself is pretty heavy, so you can extract the method from the library.

mrkkrp · 2019-10-26T18:21:18Z

But maybe it is possible to make use of full-width space character?

Alas, we do not have information which characters were full-width and which ones were normal by the time we're printing parse errors. As I mentioned, we only have column position, so we have to work with that.

mrkkrp · 2019-10-27T13:05:40Z

I looked at this a bit today and I do not know how to check for full-width characters efficiently and without depending on a library. Even in text-icu you can get Unicode category and check if it's HalfwidthAndFullwidthForms but according to the Wikipedia's page about this Unicode block (and it sort of follows from the name of the block) it's not guaranteed that those characters will be full width, it's in fact a mix of the two.

ivan-timokhin · 2019-11-24T10:32:53Z

The relevant property here is probably EastAsianWidth. The detailed description of individual property values is given in UAX #11, but the high-level summary seems to be

Attribute value	Width
`EANeutral`	narrow
`EAAmbiguous`	ambiguous
`EAHalf`	narrow
`EAFull`	wide
`EANarrow`	narrow
`EAWide`	wide
`EACount`	Not a valid value; probably an artefact of translation from C

If you don't want to depend on text-icu, all the relevant information is contained in EastAsianWidth.txt, in a fairly straightforward format.

obfusk · 2020-05-10T19:06:12Z

Using text-icu, a fix for errorBundlePretty seems pretty easy:

Change

rpadding = replicate rpshift ' '

to

rpadding = [ if isWide c then '　' else ' ' | c <- take rpshift sline ]

Using

import Data.Text.ICU.Char (property, EastAsianWidth(..), EastAsianWidth_(..))
isWide c = property EastAsianWidth c `elem` [EAFull, EAWide]

I'd be happy to make a PR for that.

mrkkrp · 2020-05-10T20:45:42Z

The question is only whether or not adding an extra dependency is worth it. In the past people were really upset about each and every extra dependency.

recursion-ninja · 2020-05-10T22:38:37Z

I'd prefer the extra dependency if it means better Unicode support in the error message rendering. One of the main attractions of megaparsec (for me and my team) is the high quality, effortless error messages.

obfusk · 2020-05-10T22:54:40Z

I get that. In that case I see 4 options:

~~don't do anything~~
add the text-icu dependency (easy but not ideal)
add code for EAW directly (ugly)
add an errorBundlePrettyThatHandlesCharacterWidth to the API that adds the isWide predicate as an extra parameter to errorBundlePretty (also not ideal but allows downstream users like me who don't mind depending on text-icu to handle these cases fairly easily)

I'd be happy to help with any of those :)

obfusk · 2020-09-07T13:29:44Z

I wrote a few lines of python to generate the relevant code point ranges:

#!/usr/bin/python3
import itertools as IT, unicodedata as UD

def grouper(iterable, n):
  args = [iter(iterable)] * n
  return IT.zip_longest(*args, fillvalue = None)

ranges, start = [], None
for i in range(0x10FFFF):
  if UD.category(chr(i)) != "Cn" and UD.east_asian_width(chr(i)) in "FW":
    if start is None: start = i
  else:
    if start is not None:
      ranges.append((start, i-1))
      start = None

print("(0, {}) [".format(len(ranges)-1))
print("    " + ",\n    ".join(
  ", ".join( "(0x{:06x}, 0x{:06x})".format(*r) for r in g if r )
  for g in grouper(ranges, 3)
))
print("  ]")

which then results in this bit of haskell:

#!/usr/bin/runhaskell

import Data.Char (ord)
import Data.Array (Array, Ix, (!), listArray, bounds)
import System.Environment (getArgs)

isWide :: Char -> Bool
isWide c = go $ bounds wideRanges
  where
    go (lo, hi)
        | hi < lo           = False
        | a <= n && n <= b  = True
        | n < a             = go (lo, pred mid)
        | otherwise         = go (succ mid, hi)
      where
        mid     = (lo + hi) `div` 2
        (a, b)  = wideRanges ! mid
    n = ord c

wideRanges :: Array Int (Int, Int)
wideRanges = listArray (0, 118) [
    (0x001100, 0x00115f), (0x00231a, 0x00231b), (0x002329, 0x00232a),
    (0x0023e9, 0x0023ec), (0x0023f0, 0x0023f0), (0x0023f3, 0x0023f3),
    (0x0025fd, 0x0025fe), (0x002614, 0x002615), (0x002648, 0x002653),
    (0x00267f, 0x00267f), (0x002693, 0x002693), (0x0026a1, 0x0026a1),
    (0x0026aa, 0x0026ab), (0x0026bd, 0x0026be), (0x0026c4, 0x0026c5),
    (0x0026ce, 0x0026ce), (0x0026d4, 0x0026d4), (0x0026ea, 0x0026ea),
    (0x0026f2, 0x0026f3), (0x0026f5, 0x0026f5), (0x0026fa, 0x0026fa),
    (0x0026fd, 0x0026fd), (0x002705, 0x002705), (0x00270a, 0x00270b),
    (0x002728, 0x002728), (0x00274c, 0x00274c), (0x00274e, 0x00274e),
    (0x002753, 0x002755), (0x002757, 0x002757), (0x002795, 0x002797),
    (0x0027b0, 0x0027b0), (0x0027bf, 0x0027bf), (0x002b1b, 0x002b1c),
    (0x002b50, 0x002b50), (0x002b55, 0x002b55), (0x002e80, 0x002e99),
    (0x002e9b, 0x002ef3), (0x002f00, 0x002fd5), (0x002ff0, 0x002ffb),
    (0x003000, 0x00303e), (0x003041, 0x003096), (0x003099, 0x0030ff),
    (0x003105, 0x00312f), (0x003131, 0x00318e), (0x003190, 0x0031ba),
    (0x0031c0, 0x0031e3), (0x0031f0, 0x00321e), (0x003220, 0x003247),
    (0x003250, 0x004db5), (0x004e00, 0x009fef), (0x00a000, 0x00a48c),
    (0x00a490, 0x00a4c6), (0x00a960, 0x00a97c), (0x00ac00, 0x00d7a3),
    (0x00f900, 0x00fa6d), (0x00fa70, 0x00fad9), (0x00fe10, 0x00fe19),
    (0x00fe30, 0x00fe52), (0x00fe54, 0x00fe66), (0x00fe68, 0x00fe6b),
    (0x00ff01, 0x00ff60), (0x00ffe0, 0x00ffe6), (0x016fe0, 0x016fe3),
    (0x017000, 0x0187f7), (0x018800, 0x018af2), (0x01b000, 0x01b11e),
    (0x01b150, 0x01b152), (0x01b164, 0x01b167), (0x01b170, 0x01b2fb),
    (0x01f004, 0x01f004), (0x01f0cf, 0x01f0cf), (0x01f18e, 0x01f18e),
    (0x01f191, 0x01f19a), (0x01f200, 0x01f202), (0x01f210, 0x01f23b),
    (0x01f240, 0x01f248), (0x01f250, 0x01f251), (0x01f260, 0x01f265),
    (0x01f300, 0x01f320), (0x01f32d, 0x01f335), (0x01f337, 0x01f37c),
    (0x01f37e, 0x01f393), (0x01f3a0, 0x01f3ca), (0x01f3cf, 0x01f3d3),
    (0x01f3e0, 0x01f3f0), (0x01f3f4, 0x01f3f4), (0x01f3f8, 0x01f43e),
    (0x01f440, 0x01f440), (0x01f442, 0x01f4fc), (0x01f4ff, 0x01f53d),
    (0x01f54b, 0x01f54e), (0x01f550, 0x01f567), (0x01f57a, 0x01f57a),
    (0x01f595, 0x01f596), (0x01f5a4, 0x01f5a4), (0x01f5fb, 0x01f64f),
    (0x01f680, 0x01f6c5), (0x01f6cc, 0x01f6cc), (0x01f6d0, 0x01f6d2),
    (0x01f6d5, 0x01f6d5), (0x01f6eb, 0x01f6ec), (0x01f6f4, 0x01f6fa),
    (0x01f7e0, 0x01f7eb), (0x01f90d, 0x01f971), (0x01f973, 0x01f976),
    (0x01f97a, 0x01f9a2), (0x01f9a5, 0x01f9aa), (0x01f9ae, 0x01f9ca),
    (0x01f9cd, 0x01f9ff), (0x01fa70, 0x01fa73), (0x01fa78, 0x01fa7a),
    (0x01fa80, 0x01fa82), (0x01fa90, 0x01fa95), (0x020000, 0x02a6d6),
    (0x02a700, 0x02b734), (0x02b740, 0x02b81d), (0x02b820, 0x02cea1),
    (0x02ceb0, 0x02ebe0), (0x02f800, 0x02fa1d)
  ]

main :: IO ()
main = mapM_ (print . isWide) . concat =<< getArgs

)

mrkkrp · 2024-07-11T20:40:34Z

Correct handling of wide characters will be available in the next release (9.7.0).

obfusk · 2024-07-11T21:16:43Z

Nice! Note that this works well for east asian full-width characters, but not for e.g. emoji (which with multiple code points combined by zero width joiners and variation selectors etc. is rather intractable). And probably also not for languages with more complex rules, or characters like İ which can take up multiple code points depending on unicode normalisation.

mrkkrp · 2024-07-12T08:16:39Z

Well, we have to start somewhere. I think it is already an improvement :-)

mrkkrp added the bug label Sep 15, 2019

mrkkrp added this to the 8.0.0 milestone Sep 15, 2019

mrkkrp removed this from the 8.0.0 milestone Oct 27, 2019

banacorn mentioned this issue Nov 28, 2019

Decouple error reporting from Stream #388

Closed

tomjaguarpaw pushed a commit to tomjaguarpaw/megaparsec that referenced this issue Sep 29, 2022

GA(deps): Update actions/setup-haskell requirement to v1.1.2 (mrkkrp#370

b989312

)

mrkkrp self-assigned this Jul 11, 2024

mrkkrp mentioned this issue Jul 11, 2024

Implement correct handling of wide Unicode characters #564

Merged

mrkkrp closed this as completed in #564 Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error message pretty printing is unaware of width of a character #370

Error message pretty printing is unaware of width of a character #370

suhdonghwi commented Sep 15, 2019

mrkkrp commented Sep 15, 2019

mrkkrp commented Oct 26, 2019

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 26, 2019 •

edited

Loading

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 27, 2019

ivan-timokhin commented Nov 24, 2019

obfusk commented May 10, 2020

mrkkrp commented May 10, 2020

recursion-ninja commented May 10, 2020

obfusk commented May 10, 2020

obfusk commented Sep 7, 2020

mrkkrp commented Jul 11, 2024

obfusk commented Jul 11, 2024

mrkkrp commented Jul 12, 2024

Error message pretty printing is unaware of width of a character #370

Error message pretty printing is unaware of width of a character #370

Comments

suhdonghwi commented Sep 15, 2019

mrkkrp commented Sep 15, 2019

mrkkrp commented Oct 26, 2019

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 26, 2019 • edited Loading

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

suhdonghwi commented Oct 26, 2019

mrkkrp commented Oct 26, 2019

mrkkrp commented Oct 27, 2019

ivan-timokhin commented Nov 24, 2019

obfusk commented May 10, 2020

mrkkrp commented May 10, 2020

recursion-ninja commented May 10, 2020

obfusk commented May 10, 2020

obfusk commented Sep 7, 2020

mrkkrp commented Jul 11, 2024

obfusk commented Jul 11, 2024

mrkkrp commented Jul 12, 2024

mrkkrp commented Oct 26, 2019 •

edited

Loading