Unless /codepage is specified the build output might depend on the OS code page #44452

tmat · 2020-05-20T22:59:30Z

Version Used:
Version 16.7.0 Preview 2.0 [30112.204.master]

Steps to Reproduce:

csc a.cs

where a.cs is a file that contains bytes that are not valid UTF-8.

The compiler attempts to use UTF-8 encoding, fails, and falls back to the default OS ANSI code page.

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/EncodedStringText.cs#L24-L52

Proposal

One option is to set CodePage to UTF-8 by default when the project targets net5.

The text was updated successfully, but these errors were encountered:

tmat · 2020-05-20T23:03:31Z

@ryzngard @clairernovotny FYI. For our deterministic rebuild scenario we could capture the OS code page in the PDB. However, that doesn't completely solve the problem since the code page might not be available on the machine when rebuilding. In that case (if this was actually a real problem) we could in theory have a service (database) that provides all existing code pages and download it from there.

Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.

tmat · 2020-05-20T23:21:06Z

@jaredpar @agocke @gafter FYI

tmat · 2020-05-20T23:25:59Z

Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.

We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.

jaredpar · 2020-05-27T16:35:30Z

One option is to set CodePage to UTF-8 by default when the project targets net5.

Slight change: net5 or higher. Essentially this should be the new default going forward.

We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.

Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding? Both are host specific forms of non-determinism and it's possible, to some degree, to control both of them. For instance we could force a conv.r4 at every layer of floating point folding to make the value fairly deterministic.

tmat · 2020-05-27T17:36:27Z

Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding?

Isn't the floating point deterministic as long as we use the same version of the runtime? Or is there a dependency on the OS/CPU?

Adding the OS encoding would unnecessarily make the build dependent on OS configuration, where it may not today (say all files are UTF8 encoded and the OS encoding is never used by the compiler).

RikkiGibson · 2020-05-27T17:54:13Z

I'm a little confused by the problem described in the issue description. We failed to use a UTF-8 encoding, how can we reasonably fall back to UTF-8? I am not very familiar with CodePages so apologies if I'm missing something obvious.

gafter · 2020-05-27T18:26:49Z

We already do force a conv.r4 at every step of constant folding. I believe that is required by the C# language specification. Perhaps you're thinking of decimal, where we rely on the runtime library to do the math for us, and the library is different in different versions. dotnet/runtime#1611

tmat · 2020-05-27T18:31:19Z

@RikkiGibson we are falling back to the OS ANSI code page (unless /codepage is specified explicitly).

tmat · 2020-05-27T18:32:10Z

@gafter So the decimal operations would only depend on the version of corlib (that the compiler uses), correct?

jaredpar · 2020-05-27T18:43:30Z

@RikkiGibson

Our encoding story is complex and strongly influenced by how the native compiler handled encoding. Essentially though in the absence of an explicit encoding we do the following:

Use encoding specified via BOM
Use UTF-8
Fall back to OS default encoding

In the presence of an explicit /codePage argument the compiler will consider that code page only and no others.

What @tmat is asking for here is essentially that when no /codePage is specified and we're at net5 or higher then pretend like UTF-8 was explicitly specified. This would be done at our MSBuild layer where we make other TF based decisions.

gafter · 2020-05-27T18:44:42Z

@tmat Yes

tmat added Concept-Determinism The issue involves our ability to support determinism in binaries and PDBs created at build time. Area-Compilers labels May 20, 2020

clairernovotny mentioned this issue May 21, 2020

Epic: Deterministic build support dotnet/sourcelink#601

Open

14 tasks

jaredpar assigned RikkiGibson May 27, 2020

jaredpar added the Bug label May 27, 2020

RikkiGibson modified the milestones: 16.7.P3, 16.7 May 28, 2020

jaredpar assigned chsienki and unassigned RikkiGibson Jun 15, 2020

jaredpar modified the milestones: 16.7, 16.8 Jun 24, 2020

jaredpar modified the milestones: 16.8, Compiler.Next Aug 25, 2020

jaredpar modified the milestones: Compiler.Next, Backlog Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unless /codepage is specified the build output might depend on the OS code page #44452

Unless /codepage is specified the build output might depend on the OS code page #44452

tmat commented May 20, 2020

tmat commented May 20, 2020 •

edited

Loading

tmat commented May 20, 2020

tmat commented May 20, 2020

jaredpar commented May 27, 2020

tmat commented May 27, 2020

RikkiGibson commented May 27, 2020

gafter commented May 27, 2020 •

edited

Loading

tmat commented May 27, 2020

tmat commented May 27, 2020 •

edited

Loading

jaredpar commented May 27, 2020

gafter commented May 27, 2020

Unless /codepage is specified the build output might depend on the OS code page #44452

Unless /codepage is specified the build output might depend on the OS code page #44452

Comments

tmat commented May 20, 2020

tmat commented May 20, 2020 • edited Loading

tmat commented May 20, 2020

tmat commented May 20, 2020

jaredpar commented May 27, 2020

tmat commented May 27, 2020

RikkiGibson commented May 27, 2020

gafter commented May 27, 2020 • edited Loading

tmat commented May 27, 2020

tmat commented May 27, 2020 • edited Loading

jaredpar commented May 27, 2020

gafter commented May 27, 2020

tmat commented May 20, 2020 •

edited

Loading

gafter commented May 27, 2020 •

edited

Loading

tmat commented May 27, 2020 •

edited

Loading