Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unless /codepage is specified the build output might depend on the OS code page #44452

Open
tmat opened this issue May 20, 2020 · 11 comments
Open
Assignees
Labels
Area-Compilers Bug Concept-Determinism The issue involves our ability to support determinism in binaries and PDBs created at build time.
Milestone

Comments

@tmat
Copy link
Member

tmat commented May 20, 2020

Version Used:
Version 16.7.0 Preview 2.0 [30112.204.master]

Steps to Reproduce:

csc a.cs

where a.cs is a file that contains bytes that are not valid UTF-8.

The compiler attempts to use UTF-8 encoding, fails, and falls back to the default OS ANSI code page.

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/EncodedStringText.cs#L24-L52

Proposal

One option is to set CodePage to UTF-8 by default when the project targets net5.

@tmat tmat added Concept-Determinism The issue involves our ability to support determinism in binaries and PDBs created at build time. Area-Compilers labels May 20, 2020
@tmat
Copy link
Member Author

tmat commented May 20, 2020

@ryzngard @clairernovotny FYI. For our deterministic rebuild scenario we could capture the OS code page in the PDB. However, that doesn't completely solve the problem since the code page might not be available on the machine when rebuilding. In that case (if this was actually a real problem) we could in theory have a service (database) that provides all existing code pages and download it from there.

Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.

@tmat
Copy link
Member Author

tmat commented May 20, 2020

@jaredpar @agocke @gafter FYI

@tmat
Copy link
Member Author

tmat commented May 20, 2020

Another problem with capturing the OS code page is that by doing so we would make the PDB non-deterministic even in the case when all source files are UTF8 encoded but /codepage is not specified.

We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.

@jaredpar
Copy link
Member

One option is to set CodePage to UTF-8 by default when the project targets net5.

Slight change: net5 or higher. Essentially this should be the new default going forward.

We should only capture fallback encoding if the compiler actually used it and thus the compilation is already non-deterministic.

Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding? Both are host specific forms of non-determinism and it's possible, to some degree, to control both of them. For instance we could force a conv.r4 at every layer of floating point folding to make the value fairly deterministic.

@tmat
Copy link
Member Author

tmat commented May 27, 2020

Curious: why do we feel this is more severe than say the non-determinism that comes from floating point constant folding?

Isn't the floating point deterministic as long as we use the same version of the runtime? Or is there a dependency on the OS/CPU?

Adding the OS encoding would unnecessarily make the build dependent on OS configuration, where it may not today (say all files are UTF8 encoded and the OS encoding is never used by the compiler).

@RikkiGibson
Copy link
Contributor

I'm a little confused by the problem described in the issue description. We failed to use a UTF-8 encoding, how can we reasonably fall back to UTF-8? I am not very familiar with CodePages so apologies if I'm missing something obvious.

@gafter
Copy link
Member

gafter commented May 27, 2020

We already do force a conv.r4 at every step of constant folding. I believe that is required by the C# language specification. Perhaps you're thinking of decimal, where we rely on the runtime library to do the math for us, and the library is different in different versions. dotnet/runtime#1611

@tmat
Copy link
Member Author

tmat commented May 27, 2020

@RikkiGibson we are falling back to the OS ANSI code page (unless /codepage is specified explicitly).

@tmat
Copy link
Member Author

tmat commented May 27, 2020

@gafter So the decimal operations would only depend on the version of corlib (that the compiler uses), correct?

@jaredpar
Copy link
Member

@RikkiGibson

Our encoding story is complex and strongly influenced by how the native compiler handled encoding. Essentially though in the absence of an explicit encoding we do the following:

  1. Use encoding specified via BOM
  2. Use UTF-8
  3. Fall back to OS default encoding

In the presence of an explicit /codePage argument the compiler will consider that code page only and no others.

What @tmat is asking for here is essentially that when no /codePage is specified and we're at net5 or higher then pretend like UTF-8 was explicitly specified. This would be done at our MSBuild layer where we make other TF based decisions.

@gafter
Copy link
Member

gafter commented May 27, 2020

@tmat Yes

@RikkiGibson RikkiGibson modified the milestones: 16.7.P3, 16.7 May 28, 2020
@jaredpar jaredpar assigned chsienki and unassigned RikkiGibson Jun 15, 2020
@jaredpar jaredpar modified the milestones: 16.7, 16.8 Jun 24, 2020
@jaredpar jaredpar modified the milestones: 16.8, Compiler.Next Aug 25, 2020
@jaredpar jaredpar modified the milestones: Compiler.Next, Backlog Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Bug Concept-Determinism The issue involves our ability to support determinism in binaries and PDBs created at build time.
Projects
None yet
Development

No branches or pull requests

5 participants