u8 literals should support literal concatenation #60896

stephentoub · 2022-04-21T23:58:32Z

Version Used:
32a81dd

Steps to Reproduce:

using System;
public class C {
    public ReadOnlySpan<byte> M() => "abc"u8 + "def"u8;
}

Expected Behavior:
Produces a result equivalent to:

using System;
public class C {
    public ReadOnlySpan<byte> M() => "abcdef"u8;
}

This is valuable for very long UTF-8 literals such that you want to split them across lines.

Actual Behavior:
Fails to compile.

Relates to test plan #58848

The text was updated successfully, but these errors were encountered:

danmoseley · 2022-04-22T03:18:15Z

Here's a 500 line motivating example: https://github.com/dotnet/runtime/blob/1313a689fd5e1d282f8910a4c77fbe15315d6d98/src/libraries/System.Private.CoreLib/src/System/Globalization/IcuLocaleData.cs#L27-L606

jcouv · 2022-04-22T15:24:29Z

Request seems sensible and useful, but I don't remember that we discussed it in LDM. Assigned to @AlekseyTs to drive discussion/resolution.

AlekseyTs · 2022-05-04T13:45:39Z

@stephentoub, in order to expedite this and make chances of this happening feasible, I suggest you to put a concrete language proposal together. I.e. what specific language rules you propose to add/change around + operator, etc. taking into account the latest LDM decisions around the feature.

am11 · 2022-05-07T12:23:22Z

At first glance, inability to concatenate u8 literals appeared as a bug in implementation rather than missing language design; given how it is already possible to write it without the u8 suffix:

public ReadOnlySpan<byte> M() => "abc" + "def";
// without u8 suffix compiler still implicitly converts to UTF-8 bytes,
// instead of of Unicode / UTF-16 bytes (which I personally found unintuitive).

AlekseyTs · 2022-05-10T16:28:17Z

At first glance, inability to concatenate u8 literals appeared as a bug in implementation rather than missing language design; given how it is already possible to write it without the u8 suffix

That was not a bug, UTF-8 conversion of string constants and u8 suffix were two independent features, each with their own specified behavior. BTW, based on latest LDM decisions (https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-04-18.md#issues-with-utf8-string-literals), we are going to remove UTF-8 conversion of string constants.

stephentoub · 2022-05-17T22:06:57Z

I suggest you to put a concrete language proposal together. I.e. what specific language rules you propose to add/change around + operator, etc. taking into account the latest LDM decisions around the feature.

I'm proposing the language behave as if there's a:

u8literal operator +(u8literal x, u8literal y);

operator that produces a new literal that in turn yields the concatenation of all the bytes from x with all the bytes from y. So you could write:

ReadOnlySpan<byte> s = "abc"u8 + "def"u8;

and that would be compiled identically to:

ReadOnlySpan<byte> s = "abcdef"u8;

Similarly,

ReadOnlySpan<byte> s = "abc"u8 + "def"u8 + "ghi"u8;

would be compiled identically to:

ReadOnlySpan<byte> s = "abcdefghi"u8;

as would:

ReadOnlySpan<byte> s = ("abc"u8 + "def"u8) + "ghi"u8;

and

ReadOnlySpan<byte> s = "abc"u8 + ("def"u8 + "ghi"u8);

This would only apply to literals and not to the natural type ReadOnlySpan<byte> form, so:

ReadOnlySpan<byte> s = "abc"u8 + (ReadOnlySpan<byte>)"def"u8; // error

would be an error, as would:

ReadOnlySpan<byte> s1 = ...;
ReadOnlySpan<byte> s2 = "abc"u8 + s1; // error

AlekseyTs · 2022-05-20T20:52:47Z

@stephentoub

Just to confirm, this proposal requires each literal to be successfully convertible to UTF-8 byte representation individually. Correct?
So the code below, won't compile:

ReadOnlySpan<byte> span = "\uD83D"u8 +  // high surrogate
                                                 "\uDE00"u8;     // low surrogate;

Even though this code will:

ReadOnlySpan<byte> span = "\uD83D\uDE00"u8;

stephentoub · 2022-05-20T21:01:47Z

this proposal requires each literal to be successfully convertible to UTF-8 byte representation individually. Correct?

Yes. Otherwise a standalone "\uD83D"u8 is an invalid literal but concatenating it with an entirely separate literal somehow makes it valid, which feels weird to me. The downside is that you can't split a u8 literal at arbitrary points, only at valid boundaries, but I think that's the lesser evil. (There's an argument to be made that "\uD83D" is a valid string literal, so why not a u8 literal, but the compiler also isn't trying to interpret the contents of the string literal, whereas it is with u8 literals. Given that, I think every u8 literal needs to be able to stand on its own.)

@GrabYourPitchforks, thoughts?

GrabYourPitchforks · 2022-05-20T21:29:44Z

I'm totally fine with the compiler not allowing splits like this. If you're writing a u8 string literal, it's really weird (IMO) for people to use "\uXXXX\uYYYY" syntax anyway to represent supplementary code points. They should either use "\U00ZZZZZZ" syntax or just have the actual raw characters directly in source, which would prevent splits like this in the first place.

AlekseyTs · 2022-06-22T16:21:57Z

Implemented in #62044.

dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Apr 21, 2022

stephentoub mentioned this issue Apr 21, 2022

Use "..."u8 to simplify some ReadOnlySpan<byte> constructions dotnet/runtime#68334

Merged

jcouv assigned AlekseyTs Apr 22, 2022

jcouv added the Area-Language Design label Apr 22, 2022

jcouv added this to the 17.3 milestone Apr 22, 2022

jcouv added Feature Request and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Apr 22, 2022

AlekseyTs added the Feature - Utf8StringLiterals label Apr 22, 2022

AlekseyTs assigned stephentoub May 4, 2022

jcouv mentioned this issue May 7, 2022

Test plan for Utf8StringLiterals feature. #58848

Closed

53 tasks

AlekseyTs closed this as completed Jun 22, 2022

AlekseyTs added the Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented label Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

u8 literals should support literal concatenation #60896

u8 literals should support literal concatenation #60896

stephentoub commented Apr 21, 2022 •

edited by jcouv

Loading

danmoseley commented Apr 22, 2022

jcouv commented Apr 22, 2022

AlekseyTs commented May 4, 2022

am11 commented May 7, 2022

AlekseyTs commented May 10, 2022

stephentoub commented May 17, 2022 •

edited

Loading

AlekseyTs commented May 20, 2022

stephentoub commented May 20, 2022

GrabYourPitchforks commented May 20, 2022 •

edited

Loading

AlekseyTs commented Jun 22, 2022

u8 literals should support literal concatenation #60896

u8 literals should support literal concatenation #60896

Comments

stephentoub commented Apr 21, 2022 • edited by jcouv Loading

danmoseley commented Apr 22, 2022

jcouv commented Apr 22, 2022

AlekseyTs commented May 4, 2022

am11 commented May 7, 2022

AlekseyTs commented May 10, 2022

stephentoub commented May 17, 2022 • edited Loading

AlekseyTs commented May 20, 2022

stephentoub commented May 20, 2022

GrabYourPitchforks commented May 20, 2022 • edited Loading

AlekseyTs commented Jun 22, 2022

stephentoub commented Apr 21, 2022 •

edited by jcouv

Loading

stephentoub commented May 17, 2022 •

edited

Loading

GrabYourPitchforks commented May 20, 2022 •

edited

Loading