-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error message pretty printing is unaware of width of a character #370
Comments
Possibly related to #362. |
@simonvandel Can you please provide the example as unicode text, not as an image? |
@mrkkrp Did you mean to refer me? Then the example text is "123 구구 이면". |
Yes, sorry. Thanks for the example! |
OK, let's see. You're saying that full width characters should be replaced with two spaces. What we have at this point is column position. Simply put, we insert as many spaces as necessary to reach the same column position. So are you of the opinion that we need to increment current column by 2 instead of 1 for full width characters? Is this something that software working with Korean characters usually does? |
Yes, it is usual practice because in monospaced font, two half-width spaces and one full-width character (ex. Korean syllable) are always the same width. So I think it is safe to put two spaces when it encounters full-width character. |
What I'm more concerned about, is that it may be confusing to see position like |
Also need to find a good way to detect full-width characters. |
I don't know how it's working inside so I can't really know what is proper solution for that problem. But maybe it is possible to make use of full-width space character? It is a single space character but full-width. Regarding the detection of full-width characters, there is a method in ICU library to check if a character is full-width or not. But ICU library itself is pretty heavy, so you can extract the method from the library. |
Alas, we do not have information which characters were full-width and which ones were normal by the time we're printing parse errors. As I mentioned, we only have column position, so we have to work with that. |
I looked at this a bit today and I do not know how to check for full-width characters efficiently and without depending on a library. Even in |
The relevant property here is probably EastAsianWidth. The detailed description of individual property values is given in UAX #11, but the high-level summary seems to be
If you don't want to depend on |
Using Change rpadding = replicate rpshift ' ' to rpadding = [ if isWide c then ' ' else ' ' | c <- take rpshift sline ] Using import Data.Text.ICU.Char (property, EastAsianWidth(..), EastAsianWidth_(..))
isWide c = property EastAsianWidth c `elem` [EAFull, EAWide] I'd be happy to make a PR for that. |
The question is only whether or not adding an extra dependency is worth it. In the past people were really upset about each and every extra dependency. |
I'd prefer the extra dependency if it means better Unicode support in the error message rendering. One of the main attractions of |
I get that. In that case I see 4 options:
I'd be happy to help with any of those :) |
I wrote a few lines of python to generate the relevant code point ranges: #!/usr/bin/python3
import itertools as IT, unicodedata as UD
def grouper(iterable, n):
args = [iter(iterable)] * n
return IT.zip_longest(*args, fillvalue = None)
ranges, start = [], None
for i in range(0x10FFFF):
if UD.category(chr(i)) != "Cn" and UD.east_asian_width(chr(i)) in "FW":
if start is None: start = i
else:
if start is not None:
ranges.append((start, i-1))
start = None
print("(0, {}) [".format(len(ranges)-1))
print(" " + ",\n ".join(
", ".join( "(0x{:06x}, 0x{:06x})".format(*r) for r in g if r )
for g in grouper(ranges, 3)
))
print(" ]") which then results in this bit of haskell: #!/usr/bin/runhaskell
import Data.Char (ord)
import Data.Array (Array, Ix, (!), listArray, bounds)
import System.Environment (getArgs)
isWide :: Char -> Bool
isWide c = go $ bounds wideRanges
where
go (lo, hi)
| hi < lo = False
| a <= n && n <= b = True
| n < a = go (lo, pred mid)
| otherwise = go (succ mid, hi)
where
mid = (lo + hi) `div` 2
(a, b) = wideRanges ! mid
n = ord c
wideRanges :: Array Int (Int, Int)
wideRanges = listArray (0, 118) [
(0x001100, 0x00115f), (0x00231a, 0x00231b), (0x002329, 0x00232a),
(0x0023e9, 0x0023ec), (0x0023f0, 0x0023f0), (0x0023f3, 0x0023f3),
(0x0025fd, 0x0025fe), (0x002614, 0x002615), (0x002648, 0x002653),
(0x00267f, 0x00267f), (0x002693, 0x002693), (0x0026a1, 0x0026a1),
(0x0026aa, 0x0026ab), (0x0026bd, 0x0026be), (0x0026c4, 0x0026c5),
(0x0026ce, 0x0026ce), (0x0026d4, 0x0026d4), (0x0026ea, 0x0026ea),
(0x0026f2, 0x0026f3), (0x0026f5, 0x0026f5), (0x0026fa, 0x0026fa),
(0x0026fd, 0x0026fd), (0x002705, 0x002705), (0x00270a, 0x00270b),
(0x002728, 0x002728), (0x00274c, 0x00274c), (0x00274e, 0x00274e),
(0x002753, 0x002755), (0x002757, 0x002757), (0x002795, 0x002797),
(0x0027b0, 0x0027b0), (0x0027bf, 0x0027bf), (0x002b1b, 0x002b1c),
(0x002b50, 0x002b50), (0x002b55, 0x002b55), (0x002e80, 0x002e99),
(0x002e9b, 0x002ef3), (0x002f00, 0x002fd5), (0x002ff0, 0x002ffb),
(0x003000, 0x00303e), (0x003041, 0x003096), (0x003099, 0x0030ff),
(0x003105, 0x00312f), (0x003131, 0x00318e), (0x003190, 0x0031ba),
(0x0031c0, 0x0031e3), (0x0031f0, 0x00321e), (0x003220, 0x003247),
(0x003250, 0x004db5), (0x004e00, 0x009fef), (0x00a000, 0x00a48c),
(0x00a490, 0x00a4c6), (0x00a960, 0x00a97c), (0x00ac00, 0x00d7a3),
(0x00f900, 0x00fa6d), (0x00fa70, 0x00fad9), (0x00fe10, 0x00fe19),
(0x00fe30, 0x00fe52), (0x00fe54, 0x00fe66), (0x00fe68, 0x00fe6b),
(0x00ff01, 0x00ff60), (0x00ffe0, 0x00ffe6), (0x016fe0, 0x016fe3),
(0x017000, 0x0187f7), (0x018800, 0x018af2), (0x01b000, 0x01b11e),
(0x01b150, 0x01b152), (0x01b164, 0x01b167), (0x01b170, 0x01b2fb),
(0x01f004, 0x01f004), (0x01f0cf, 0x01f0cf), (0x01f18e, 0x01f18e),
(0x01f191, 0x01f19a), (0x01f200, 0x01f202), (0x01f210, 0x01f23b),
(0x01f240, 0x01f248), (0x01f250, 0x01f251), (0x01f260, 0x01f265),
(0x01f300, 0x01f320), (0x01f32d, 0x01f335), (0x01f337, 0x01f37c),
(0x01f37e, 0x01f393), (0x01f3a0, 0x01f3ca), (0x01f3cf, 0x01f3d3),
(0x01f3e0, 0x01f3f0), (0x01f3f4, 0x01f3f4), (0x01f3f8, 0x01f43e),
(0x01f440, 0x01f440), (0x01f442, 0x01f4fc), (0x01f4ff, 0x01f53d),
(0x01f54b, 0x01f54e), (0x01f550, 0x01f567), (0x01f57a, 0x01f57a),
(0x01f595, 0x01f596), (0x01f5a4, 0x01f5a4), (0x01f5fb, 0x01f64f),
(0x01f680, 0x01f6c5), (0x01f6cc, 0x01f6cc), (0x01f6d0, 0x01f6d2),
(0x01f6d5, 0x01f6d5), (0x01f6eb, 0x01f6ec), (0x01f6f4, 0x01f6fa),
(0x01f7e0, 0x01f7eb), (0x01f90d, 0x01f971), (0x01f973, 0x01f976),
(0x01f97a, 0x01f9a2), (0x01f9a5, 0x01f9aa), (0x01f9ae, 0x01f9ca),
(0x01f9cd, 0x01f9ff), (0x01fa70, 0x01fa73), (0x01fa78, 0x01fa7a),
(0x01fa80, 0x01fa82), (0x01fa90, 0x01fa95), (0x020000, 0x02a6d6),
(0x02a700, 0x02b734), (0x02b740, 0x02b81d), (0x02b820, 0x02cea1),
(0x02ceb0, 0x02ebe0), (0x02f800, 0x02fa1d)
]
main :: IO ()
main = mapM_ (print . isWide) . concat =<< getArgs |
Correct handling of wide characters will be available in the next release (9.7.0). |
Nice! Note that this works well for east asian full-width characters, but not for e.g. emoji (which with multiple code points combined by zero width joiners and variation selectors etc. is rather intractable). And probably also not for languages with more complex rules, or characters like İ which can take up multiple code points depending on unicode normalisation. |
Well, we have to start somewhere. I think it is already an improvement :-) |
errorBundlePretty
is not working properly if input contains full-width character.Character pointer (
^
) should be pointing '이', but it is pointing the wrong position. Full-width character should be replaced with two spaces, not one.The text was updated successfully, but these errors were encountered: