Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Embed sources in PDBs #12625

Closed
3 tasks done
nguerrera opened this issue Jul 19, 2016 · 27 comments
Closed
3 tasks done

Proposal: Embed sources in PDBs #12625

nguerrera opened this issue Jul 19, 2016 · 27 comments

Comments

@nguerrera
Copy link
Contributor

nguerrera commented Jul 19, 2016

Implementation progress

Windows PDB support is tracked by #13707.

This proposal addresses #5397, which requests a feature for embedding source code inside of a PDB.

Scenarios

Recap from #5397

  • During the build, source code is auto-generated and then compiled. This auto-generated source does not exist on source control server and is often not preserved as a build artifact. Even if it is preserved, it can't be indexed on a symbol server making acquisition difficult at debug time.
  • A company is OK from an IP standpoint to release source for some of their projects, but their source control system is behind a firewall. Their IT security policies prevent giving any external access to the source control system, which prevents typical usage of source server. They already provide PDBs to customers, and by including source in the PDBs the customer's debugging experience improves with minimal additional work.
  • An Open Source project is doing all their development on GitHub and they current use source server to distribute source, but they don't like additional configuration necessary in VS to enable it. By distributing the source in the PDB they eliminate this additional configuration.

Also

  • See Proposal: Add compiler switch to embed PDB inside the assembly #12390, which requests embedding PDBs in PE files and argues for the power of combining that with this.
  • Binary analysis is often chosen due to the ease of acquiring binaries over integrating in to someone else's build, but comes at the cost of precision. This is a step towards having tools that can be pointed at a binaries, but analyze source, which was my primary motivation for contributing to this. There's more that I want to see in that direction: e.g. serialized compilation options, reference MVIDs in PDB -- ultimately enough to reproduce the compilation from a binary. Access to generated code was just one piece of that, but it overlaps with with the use cases noted above and provides substantial value on its own.

Command Line Usage

Since common usage will already leverage a source server and only require generated code to be embedded, we need to be able to specify the files to embed individually.

Proposal: Add a new /embed switch for vbc.exe and csc.exe:

  • /embed: embeds all source files in the PDB.
  • /embed:<file list>: embeds specific files in the PDB.
  • <file list> shall be parsed exactly as /additionalfile with semicolon separation and wildcard expansion.
  • If specific source files are to be embedded, they need to be specified as source files in the usual way AND passed to /embed.

    NOTE: Some care should be taken in the compiler not to read the same files twice. The approach we landed on in design review is that if the /embed argument and source argument expand to the exact same full path (without normalization applied and case-sensitively), then we will not re-read the text of the source file. However, in the edge case, different spelling of the same file on the command line can lead to reading the same file more than once. It may also lead to repeated document entries in the PDB unless the difference is eliminated by the path normalization or the language specific case-sensitivity policy in place by the underlying debug document table. An earlier version of this proposal attempted to address these issues by having distinct mechanism for embedding source files (without repeating their paths) and additional files. However, it was decided in design review that the complexity added to the command line and API was not worth the marginal gain.

  • It is not an error to pass a file that does not represent source in the compilation to /embed. Such files will simply be added to the PDB, which is a deliberate feature.
  • It is an error to pass /embed without /debug: we can't embed text in the PDB if we're not emitting a PDB.
  • All files passed to /embed shall be included in the PDB regardless of whether or not there are sequence points targeting it.

Examples

  • Embed no sources in PDB (default)
csc /debug+ *.cs 
  • Embed all sources in PDB
csc /debug+ /embed
  • Embed only some sources in PDB
csc /debug+ src\*.cs /embed:generated\*.cs

#line directives

There is also a scenario where debugging requires external files that are not part of the compilation and are lined up to the actual source code via #line directives.

Proposal: A file targeted by a #line directive shall be embedded in the PDB if either the target file or the referencing source file are embedded.

Example

source.cs

class P {
    static void Main() {
#line 1 "example.xyz"
          System.Console.WriteLine("Hello World");
    }
}

example.xyz

print "Hello World"
  • Compile source.cs and embed only example.xyz in pdb
  • Here we're explicitly asking to embed only example.xyz
csc source.cs /embed:example.xyz /debug+   
  • Compile source.cs and embed both source.cs
    • Here's we're asking to embed all source and some source further pulls in example.xyz via #line.
csc source.cs /embed /debug+
  • Compile source.cs and embed source.cs and example.xyz in pdb
    • Here we're explicitly asking to embed source.cs, which further pulls in example.xyz via #line.
csc source.cs /embed:source.cs /debug+

Source Generators

This feature would pair nicely with https://github.com/dotnet/roslyn/blob/features/source-generators/docs/features/generators.md if/when both land, allowing generator output to be debugged without any requirement to acquire (or regenerate) the output by some other means.

We might choose to handle embedding source generator output in one of 3 ways:

  1. Always embed generator output if a PDB is being emitted.
  2. Add a way to decorate a generator as opting in (or out) of having its output embedded.
  3. Add a command-line

After much discussion about an earlier version of this proposal, there was a strong desire to keep the command-line interface minimal, so I think (1) or (2) should be preferred. I personally think always embedding generator output is the best option as it means that generators get good debuggability with no fuss. We could always add a command-line or generator API opt-out later if there was anyone pushing back on embedding the generator output.

I propose that we open a separate follow-up issue to track how to integrate these two features after both have arrived in a common branch and discuss 1-3 or other alternatives there.

Command Line API

Proposal: Add a property to Microsoft.CodeAnalysis.CommandLineArguments to indicate a list of files to be embedded in the PDB.

public class CommandLineArguments {
    ...
    // New property: file to be embedded in the PDB.
    public IEnumerable<CommandLineSourceFile> EmbeddedFiles { get; }
}

Note that if /embed is specified without arguments it is surfaced here by appending the full set of source files to this list and not via a separate API.

Emit API

It should be possible to embed source and additional text via public API without routing through the command-line compiler interface.

Proposal:
NOTE: Additions of optional parameters below to be done in the usual binary-compat-preserving way.

namespace Microsoft.CodeAnalysis.Text {
     // ...
     public abstract class SourceText {
           //...
           public static SourceText From(
              // existing parameters
              Stream stream,
              Encoding encoding = null,
              SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1,
              bool throwIfBinaryDetected = false,

              // new parameter: capture enough information to save exact original bytes to PDB 
              bool canBeEmbedded = false);

        public static SourceText From(
              // existing parameters
              byte[] buffer, 
              int length, 
              Encoding encoding = null, 
              SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1,
              bool throwIfBinaryDetected = false,

              // new parameter: capture enough information to save exact original bytes to PDB 
              bool canBeEmbedded = false);

         // new property: indicates if it is possible to create EmbeddedText from instance. 
         // Either canBeEmbedded=true must have been specified with original bytes, or, 
         // if not constructed from bytes/stream, must have Encoding.
         public bool CanBeEmbedded { get; }
     }
}

namespace Microsoft.CodeAnalysis {
    public abstract class Compilation {
        // ...
        public EmitResult Emit(
            // Existing parameters 
            Stream peStream,
            Stream pdbStream = null,
            Stream xmlDocumentationStream = null,
            Stream win32Resources = null,
            IEnumerable<ResourceDescription> manifestResources = null,
            EmitOptions options = null,
            IMethodSymbol debugEntryPoint = null,

             // New parameter: specify the texts (with their paths) to embed
            IEnumerable<EmbeddedText> embeddedTexts = null,

            // Existing parameter
            CancellationToken cancellationToken = default(CancellationToken));
    }

    // new type
    public sealed class EmbeddedText {
        private  EmbeddedText();

        public string FilePath { get; }
        public SourceHashAlgorithm ChecksumAlgorithm { get; }
        public ImmutableArray<byte> Checksum { get; }

         // create embedded text from source text, SourceText.CanBeEmbedded must be true
        public static EmbeddedText FromSource(string filePath, SourceText text)

        // create embedded text from a stream (for file that is not source)
        public static EmbeddedText FromStream(string filePath, Stream stream, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1)

       // create embedded text from bytes in memory (for file that is not source)
       public static EmbeddedText FromBytes(string filePath, ArraySegment<byte> bytes, SourceHashAlgorithm checksumAlgorithm = SourceHashAlgorithm.Sha1)
    }
}

Note that it is the caller's responsibility to the gather source and non-source text as appropriate. Text will line up with corresponding source/sequence points by the existing mechanism for de-duping debug documents generated by source trees, #line, and #pragma checksum: i.e. paths will be normalized and then compared case-insensitively for VB and case-sensitively for C#.

Compression

Files beyond a trivial size should be compressed in the PDB. Deflate format will be used. Tiny files do not benefit from compression and can even waste cycles making the file bigger so we should have a threshold at which we start to compress.

Encoding

Any source text created from raw bytes/stream shall be copied (or compressed and copied) to the PDB without decoding and re-encoding bytes -> chars -> bytes. This is required since encodings do not always round-trip and the checksum must match the original stream.

A source text created by other means (e.g. string + encoding) in which its checksum will be calculated by encoding to bytes via SoruceText.Encoding, will have its text encoded with SourceText.Encoding.

See also CanBeEmbedded requirements above,

Portable PDB Representation

In portable PDBs, we will put the embedded source as a custom debug info entry (with a new GUID allocated for it) parented by the document entry.

The blob will have a leading int32, which when zero indicate the remaining bytes are the raw, uncompressed text, and when positive indicates that the remaining bytes are comrpessed by deflate and the positive value is the byte size when decompressed.

Portable PDB spec is being updated accordingly: dotnet/corefx#10560

Windows PDB Representation

The traditional Windows PDB already had a provision for embedded source, which we will use via ISymUnmanagedDocumentWriter::SetSource.

The corresponding method for reading back the embedded source returned E_NOTIMPL until recently, but I have made the change to implement it and an update to the nuget package is pending.

The blob format will be identical to the portable PDB. This is already a diasymreader custom PDB "injected source" so we can define the source portion as we wish. Using the same blob for Windows and portable PDBs opens up optimizations in the implementation (less copying) and also simplifies it.

@mjsabby
Copy link
Contributor

mjsabby commented Jul 20, 2016

I'm super stoked. Skimming through it most things look good. One comment is that while the proposal of additional files and embedding existing files is separate in implementation ... accepting it partially would be a pretty awkward situation. So I would recommend we review those pieces together.

@gafter
Copy link
Member

gafter commented Jul 21, 2016

Is this intended to support both csc and vbc?

@tmat
Copy link
Member

tmat commented Jul 21, 2016

Yes. Also F# at some point.

@nguerrera
Copy link
Contributor Author

@gafter Yes.

Also, I am rewriting this now to reflect the decisions made in our meeting today.

@erik-kallen
Copy link

I'd love to see this, but it would be even better if the #line case could be handled automatically (embed all files, including all files referenced by #line directives)

@nguerrera
Copy link
Contributor Author

@gafter @tmat @jaredpar @agocke I've updated the proposal text to match the outcome of our design review. Please review.

@erik-kallen Thanks for the feedback. The revised proposal addresses it. :)

@tmat
Copy link
Member

tmat commented Jul 25, 2016

LGTM. A couple of things:

  • I assume you haven't updated the above spec to replace GZip with Deflate yet but plan to do so?
  • "Files should be persisted in their original encoding as denoted by SourceText.Encoding." this might be a problem. We need to save the files in the exact representation as we used to calculate their checksums. Otherwise it won't match. This includes BOM for example. This makes it kind of complicated since we don't preserve the original blob in SourceText :(

@nguerrera
Copy link
Contributor Author

Yes, re deflate. Focusing on the rest first.

It appears to me that SourceText.Encoding captures BOM. Moreover, the checksum computation itself is re-encoding except for an optimization for small files AFAICT.

@tmat
Copy link
Member

tmat commented Jul 25, 2016

It is not re-encoding when we read the files from command line - the command line compiler calls http://source.roslyn.io/#Microsoft.CodeAnalysis/EncodedStringText.cs,81 thru http://source.roslyn.io/#Microsoft.CodeAnalysis/CommandLine/CommonCompiler.cs,157, which ultimately ends up here: http://source.roslyn.io/#Microsoft.CodeAnalysis/Text/SourceText.cs,102.
There we calculate the checksum upfront and save it to SourceText.

Not all encodings allow text to be round-tripped afaik.

@nguerrera
Copy link
Contributor Author

OK, my (incorrect) recollection was based on the < 80KB case taking another code path through which we read to a byte[] and there the comment that says it's easy to get checksum up front for this case so I was thinking we didn't do it up front for larger stream, but we do. I actually checked for exactly this reason a little while ago, but I read wrong. :(

I don't believe the BOM is an issue because the Encoding captures that in its "preamble". Another source of non-roundtripability would be invalid input for the given encoding. I wish that were just an error, but it looks like it isn't. :(

Just so I understand the full exposure, is there another case where round-tripping would fail? (Not saying one case isn't enough, I just want to make sure I understand all the cases before proposing a fix and that I add all the right test cases.)

@tmat
Copy link
Member

tmat commented Jul 26, 2016

If an encoding maps two different byte sequences to a single char then we have a problem.

@nguerrera
Copy link
Contributor Author

nguerrera commented Jul 26, 2016

OK. I'm going to try the following approach as discussed offline:

  1. Have a way to indicate that SourceText is "embeddable", which will create the compressed source blob up front when we still have the original bytes.
  2. Add back logic to correlate source files to embedded files in compiler based on normalized path and case-sensitivity policy and use it to make only the source text to be embedded embeddable.
  3. Change the shape of EmbeddedText to expose only algorithm, checksum, and blob. Have way to get EmbeddedText from SourceText iff SourceText is embeddable.
  4. Have a way to make EmbeddedText from a Stream and skip SourceText altogether. (Decoding and retaining the chars is a waste for non-source files.)

@jaredpar
Copy link
Member

Overall LGTM. One small question:

It is an error to pass /embed without /debug: we can't embed text in the PDB if we're not emitting a PDB.

Is this feature limited to portable PDB or does it apply to all PDBs?

@tmat
Copy link
Member

tmat commented Jul 26, 2016

All PDBs.

@nguerrera
Copy link
Contributor Author

Update:

  • I've fixed up the proposal to match all of the above feedback
  • I've sent PRs to update portable PDB spec and diasymreader-portable.
  • WIP PR implements all of the proposal as written for C#, but not yet for VB. I'll do VB as a follow-up change as it will be easier to have C# reviewed and approved before porting any language-specific parts to VB.
  • I found a bug in native PDB handling and will revert native PDB support from initial PR (amend error for now to require /debug:portable or /debug:embedded instead of any /debug). Native PDB support will be pushed to a follow-up change as well.
  • I'm working on tests

@nguerrera
Copy link
Contributor Author

Update: above is done and #12353 is ready for review.

@gafter gafter added 4 - In Review A fix for the issue is submitted for review. Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented and removed 3 - Working 4 - In Review A fix for the issue is submitted for review. labels Aug 9, 2016
@gafter gafter closed this as completed Aug 11, 2016
@nguerrera
Copy link
Contributor Author

Closing as the only missing piece (Windows PDB support) which does not impact the command-line or API spec is now tracked by #13707. The rest is in.

@roji
Copy link
Member

roji commented Mar 8, 2017

This looks really really nice, but... I can't see a way to pass the /embed switch to csc via the new csproj (nor can I see a way to pass extra args to csc...)...

@jdasilva
Copy link

So there's an answer to the /embed question over on Stack Overflow and as far as I can tell that seems to be working. DotPeek would find at least some of the source files and looks to me like it's identifying them as embedded. I can also see the embedded metadata. When debugging in VS2017 it doesn't find the source files though. (I checked Code very quickly also.) Is there some other step we need to take to make this work in a debugger? Does anyone know if VS2017 or Code supports this for debugging or has anyone gotten it to work?

Also, I do have SourceLink enabled (if that's even needed), and tried turning just my code off, and turned off source files must match exactly, but it made no difference.

@nguerrera
Copy link
Contributor Author

The answer on StackOverflow is correct as to how to pass /embed:X via csproj using @(EmbeddedFiles) items.

However, the VS debugger does not yet support embedded source. cc @gregg-miskelly.

@gregg-miskelly
Copy link
Contributor

Correct, the debugger doesn't yet support extracting sources from the PDB. At one point there was talk of some team creating a stand-alone tool to extract them out so you could point the debugger at them, not sure if anyone made this. The VS Debugger does support SourceLink if that works for your scenario.

cc @caslan

@tmat
Copy link
Member

tmat commented Apr 10, 2017

@gregg-miskelly The source extraction tool might be something that could be added to http://github.com/dotnet/symreader-converter.

@jdasilva
Copy link

For this scenario I'm part of team using Stash so I think we may be out of luck -ctaggart/SourceLink#39. Regardless, the idea of embedding the source and keeping it independent of how we host our repo was very appealing.

I was wondering, though, if it would be practical (as a workaround - direct debugger support would obviously be so much better) to use sourcelink with a URL to indicate the symbol file and path and then read and serve them up from an asp.net core project.

@gregg-miskelly
Copy link
Contributor

@jdasilva I am not certain if this is what you are suggesting, but you could certainly build yourself a tiny HTTP server that would take URLs like:

http://mymirror/<hash>/<filepath>

And run:

get fetch -p
get show <hash> -- <filepath>

And then stream back the results.

@jdasilva
Copy link

jdasilva commented Apr 10, 2017

@gregg-miskelly Are the get commands part of http://github.com/dotnet/symreader-converter? That's basically what I had in mind. If that library (or some other one) would read the source files out of the pdb, you could just read and deliver them as part of the little server. I was thinking an asp.net core project.

@KirillOsenkov
Copy link
Member

I've logged a new issue to expose the /embed option through MSBuild: #19127

haritha-mohan added a commit to haritha-mohan/xamarin-macios that referenced this issue Feb 7, 2024
Fixes xamarin#18968

We provide a mapping to the checked in source files via SourceLink.json
and the rest of the generated/untracked sources are embedded into the
PDB to provide a more comprehensive debugging experience. Since we
invoke CSC directly, there were a few workarounds that had to be implemented
(ex: implementing a helper script to account for untracked sources
instead of simply using the EmbedUntrackedSources MSBuild property).

As for testing, the newly added support was validated via the sourcelink
dotnet tool which confirmed all the sources in the PDB either had valid
urls or were embedded.

sourcelink test Microsoft.MacCatalyst.pdb —> sourcelink test passed: Microsoft.MacCatalyst.pdb

The PDB size does increase in size after embedding;
for example, Microsoft.MacCatalyst.pdb went from 5 MB to 15.7 MB.

But considering it would significantly help improve the debugging
experience, be consistent with Android’s offerings, and it’s a
highlighted attribute on the NuGet package explorer I think it’s a
worthy size increase.

Refs:
dotnet/android#7298
dotnet/roslyn#12625
https://github.com/dotnet/sourcelink/tree/main/docs
haritha-mohan added a commit to xamarin/xamarin-macios that referenced this issue Feb 26, 2024
Fixes #18968

We provide a mapping to the checked in source files via SourceLink.json
and the rest of the generated/untracked sources are embedded into the
PDB to provide a more comprehensive debugging experience. Since we
invoke CSC directly, there were a few workarounds that had to be
implemented (ex: implementing a helper script to account for untracked
sources instead of simply using the EmbedUntrackedSources MSBuild
property).

As for testing, the newly added support was validated via the dotnet
sourcelink tool which confirmed all the sources in the PDB either had
valid urls or were embedded.

`sourcelink test Microsoft.MacCatalyst.pdb` —> `sourcelink test passed:
Microsoft.MacCatalyst.pdb`

The PDB size does increase in size after embedding;
Microsoft.MacCatalyst.pdb went from 5 MB to 15.7 MB.

But considering it would significantly help improve the debugging
experience, be consistent with Android’s offerings, and it’s a
highlighted attribute on the NuGet package explorer I think it’s a
worthy size increase.

Refs:
dotnet/android#7298 
dotnet/roslyn#12625
https://github.com/dotnet/sourcelink/tree/main/docs

---------

Co-authored-by: Rolf Bjarne Kvinge <[email protected]>
Co-authored-by: Alex Soto <[email protected]>
Co-authored-by: Michael Cummings (MSFT) <[email protected]>
Co-authored-by: GitHub Actions Autoformatter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests