Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u8 literals should support literal concatenation #60896

Closed
stephentoub opened this issue Apr 21, 2022 · 10 comments
Closed

u8 literals should support literal concatenation #60896

stephentoub opened this issue Apr 21, 2022 · 10 comments
Assignees
Labels
Area-Compilers Area-Language Design Feature - Utf8StringLiterals Feature Request Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Apr 21, 2022

Version Used:
32a81dd

Steps to Reproduce:

using System;
public class C {
    public ReadOnlySpan<byte> M() => "abc"u8 + "def"u8;
}

Expected Behavior:
Produces a result equivalent to:

using System;
public class C {
    public ReadOnlySpan<byte> M() => "abcdef"u8;
}

This is valuable for very long UTF-8 literals such that you want to split them across lines.

Actual Behavior:
Fails to compile.

Relates to test plan #58848

@danmoseley
Copy link
Member

@jcouv jcouv added this to the 17.3 milestone Apr 22, 2022
@jcouv
Copy link
Member

jcouv commented Apr 22, 2022

Request seems sensible and useful, but I don't remember that we discussed it in LDM. Assigned to @AlekseyTs to drive discussion/resolution.

@jcouv jcouv added Feature Request and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Apr 22, 2022
@AlekseyTs
Copy link
Contributor

@stephentoub, in order to expedite this and make chances of this happening feasible, I suggest you to put a concrete language proposal together. I.e. what specific language rules you propose to add/change around + operator, etc. taking into account the latest LDM decisions around the feature.

@am11
Copy link
Member

am11 commented May 7, 2022

At first glance, inability to concatenate u8 literals appeared as a bug in implementation rather than missing language design; given how it is already possible to write it without the u8 suffix:

public ReadOnlySpan<byte> M() => "abc" + "def";
// without u8 suffix compiler still implicitly converts to UTF-8 bytes,
// instead of of Unicode / UTF-16 bytes (which I personally found unintuitive).

@AlekseyTs
Copy link
Contributor

At first glance, inability to concatenate u8 literals appeared as a bug in implementation rather than missing language design; given how it is already possible to write it without the u8 suffix

That was not a bug, UTF-8 conversion of string constants and u8 suffix were two independent features, each with their own specified behavior. BTW, based on latest LDM decisions (https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-04-18.md#issues-with-utf8-string-literals), we are going to remove UTF-8 conversion of string constants.

@stephentoub
Copy link
Member Author

stephentoub commented May 17, 2022

I suggest you to put a concrete language proposal together. I.e. what specific language rules you propose to add/change around + operator, etc. taking into account the latest LDM decisions around the feature.

I'm proposing the language behave as if there's a:

u8literal operator +(u8literal x, u8literal y);

operator that produces a new literal that in turn yields the concatenation of all the bytes from x with all the bytes from y. So you could write:

ReadOnlySpan<byte> s = "abc"u8 + "def"u8;

and that would be compiled identically to:

ReadOnlySpan<byte> s = "abcdef"u8;

Similarly,

ReadOnlySpan<byte> s = "abc"u8 + "def"u8 + "ghi"u8;

would be compiled identically to:

ReadOnlySpan<byte> s = "abcdefghi"u8;

as would:

ReadOnlySpan<byte> s = ("abc"u8 + "def"u8) + "ghi"u8;

and

ReadOnlySpan<byte> s = "abc"u8 + ("def"u8 + "ghi"u8);

This would only apply to literals and not to the natural type ReadOnlySpan<byte> form, so:

ReadOnlySpan<byte> s = "abc"u8 + (ReadOnlySpan<byte>)"def"u8; // error

would be an error, as would:

ReadOnlySpan<byte> s1 = ...;
ReadOnlySpan<byte> s2 = "abc"u8 + s1; // error

@AlekseyTs
Copy link
Contributor

@stephentoub

Just to confirm, this proposal requires each literal to be successfully convertible to UTF-8 byte representation individually. Correct?
So the code below, won't compile:

ReadOnlySpan<byte> span = "\uD83D"u8 +  // high surrogate
                                                 "\uDE00"u8;     // low surrogate;

Even though this code will:

ReadOnlySpan<byte> span = "\uD83D\uDE00"u8; 

@stephentoub
Copy link
Member Author

this proposal requires each literal to be successfully convertible to UTF-8 byte representation individually. Correct?

Yes. Otherwise a standalone "\uD83D"u8 is an invalid literal but concatenating it with an entirely separate literal somehow makes it valid, which feels weird to me. The downside is that you can't split a u8 literal at arbitrary points, only at valid boundaries, but I think that's the lesser evil. (There's an argument to be made that "\uD83D" is a valid string literal, so why not a u8 literal, but the compiler also isn't trying to interpret the contents of the string literal, whereas it is with u8 literals. Given that, I think every u8 literal needs to be able to stand on its own.)

@GrabYourPitchforks, thoughts?

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented May 20, 2022

I'm totally fine with the compiler not allowing splits like this. If you're writing a u8 string literal, it's really weird (IMO) for people to use "\uXXXX\uYYYY" syntax anyway to represent supplementary code points. They should either use "\U00ZZZZZZ" syntax or just have the actual raw characters directly in source, which would prevent splits like this in the first place.

@AlekseyTs
Copy link
Contributor

Implemented in #62044.

@AlekseyTs AlekseyTs added the Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented label Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Area-Language Design Feature - Utf8StringLiterals Feature Request Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented
Projects
None yet
Development

No branches or pull requests

6 participants