Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VulnerabilityInfo to NuGet.Protocol public API #5115

Merged
merged 15 commits into from
Apr 10, 2023

Conversation

zivkan
Copy link
Member

@zivkan zivkan commented Mar 29, 2023

Bug

Fixes: NuGet/Home#12518

Regression? Last working version:

Description

Implements VulnerabilityInfo resource as per the vulnerabilities in restore spec.

  • Added some model classes for JSON to be deserialized into
  • Created a class that implements IReadOnlyDictionary, only to have a JsonConverter that allows us to create an instance of Dictionary with StringComparer.InvariantIgnoreCase.
  • Created interface that all sources can (theoretically) implement for vulnerability data
  • Created V3 HTTP resource and provider for the vulnerability data interface.

PR Checklist

@zivkan
Copy link
Member Author

zivkan commented Mar 29, 2023

Public API review suggestions

Since we use Roslyn's Public API Analyzers, we can see all changes to public APIs in PublicAPI.Shipped.txt and PublicAPI.Unshipped.txt. For more info, see https://github.com/NuGet/NuGet.Client/blob/dev/docs/nuget-sdk.md.

In this PR, I added #nullable enable to some files that did not previously have them, so those are the changes we see to PublicAPI.Shipped.txt files. By all means validate, but none of these APIs should have changed, it's just adding nullable annotations. The new APIs are all in PublicAPI.Unshipped.txt

There are some new public APIs due to necessary NuGet.Protocol design, but the most important API in this PR is IVulnerabilityInfoResource. Basically everything else is just to support that API.

IVulnerabilityInfoResource

It only has one method, GetVulnerabilityInfoAsync, which returns a GetVulnerabilityInfoResult. More on that after the sample.

The goal is for usage of the API to look something like this:

SourceRepository source = GetSource();
IVulnerabilityInfoResource? vulnResource = await source.GetResourceAsync(IVulnerabilityInfoResource();
if (vulnResource == null)
{
  // this source does not have vulnerability info
  return null;
}

GetVulnerabilityInfoResult vulnData = await vulnResource.GetVulnerabilityInfoAsync(...);

IReadOnlyList<IDictionary<string, IReadOnlyList<VulnerabilityInfo>>>? knownVulnerabilities
    = vulnData.KnownVulnerabilities;
List<VulnerabilityInfo> packageVulnerabilities = new();
if (knownVulnerabilities != null)
{
    for (int file = 0; file < knownVulnerabilities.Count; file++)
    {
        if (knownVulnerabilities[file].TryGet(packageId, out IReadOnlyList<VulnerabilityInfo>? knownPackageVulnerabilities))
        {
            foreach (VulnerabilityInfo knownPackageVulnerability in knownPackageVulerabilities)
            {
                if (packageVersion.Satisfies(knownPackageVulnerability.Versions))
                {
                    packageVulnerabilities.Add(knownPackageVulnerability);
                }
            }
        }
    }
}

// Present package vulnerabilities however appropriate for the current tool

if (vulnData.Errors?.Count > 0)
{
    // Getting vulnerability data from source was not completely successful.
    // Present errors as appropriate
}

The big nested loop in the middle, that checks all files and checks each known vulnerability if the package version being checked satisfies the vulnerability data range, might become a public helper method in another assembly, but it doesn't feel like it belongs in NuGet.Protocol. To be honest, I don't know which assembly such a public API should belong in. So maybe by the time the feature is complete it will be in NuGet.Protocol. But filtering vulerabilities based on package ID and version is not related to "communication with NuGet feeds", hence it doesn't feel like the right place currently.

Why an interface

(Almost?) All of NuGet.Protocol's other resources are abstract classes. Why not follow that design, and put common code that all implementations (HTTP and in the future local folder source) will need?

Firstly, I'm a believer in composition over inheritance. While well intentioned, I don't believe that inheritance is a good solution to code sharing. Helper functions are better, in my opinion, and when necessary, use delegate parameters. C# 3.5 added Func and Action which makes it much easier to use. Although in perf-critical code it's best to avoid using (anonymous) lamba functions, since that causes allocations. Where possible, it's best to pass in a method name with a matching signature.

More importantly, I'm also a string believer in writing testable code. While NuGet.Protocol.Tests and NuGet.Protocol.FuncTests does test a lot of NuGet.Protocol's code, if you read the code and read the test infrastructure, I think it's clear that it's not easy. For example, look at, and debug while stepping into every method call, anything that uses StaticHttpSource. The intention is to use a dictionary where the key is the URL and the value is the response payload. However, it runs a whole lot of production code, including needing workarounds to avoid the http cache possibly returning a payload from a different test. Not to mention that NuGet.Protocol.FuncTests takes multiple minutes to execute, despite a small number of tests. By making the base resource, which everything outside of NuGet.Protocol uses, an interface, it allows tests to mock the return code, avoiding a whole lot of test infrastructure, and making the whole thing faster to execute.

Why not report failures with ILogger

For performance reasons, it is intended to have a shared context object get the vulerability data from all sources just once. My assumption might turn out to be wrong by the time I get to using these APIs in RestoreCommand, but I believe that reporting errors and warnings with ILogger will cause the message to be reported to the first project that calls the API. However, if a package source's vulnerability data can't be parsed, this means the vulnerability checking on all projects is incomplete. Therefore, any warnings or errors related to incomplete vulnerability checking applies to all projects. By using a list of Exceptions, it allows the per-project restore code to add errors to that project, meaning next restore will fail no-op checks, and will therefore try to recheck the vulnerability data, which will hopefully succeed in a following attempt.

GetVulnerabilityInfoResult

This type was created to avoid using a Tuple or ValueTuple. While ValueTuple can use the syntax (int Thing, string Whatever), so the tuple's items have "real" names, rather than Item1, and Item2, since the two items types have long definitions, it wouldn't be easy to use without var. While we could add a Deconstruct method, I'd rather wait and see when it's useful, rather than preemptively adding another public API that might not be useful.

Another thing to note (and it looks like I need to fill in the XMLDoc) is that either of the two properties can be null. If there's some kind of failure that prevents any of the vulnerability files from being parsed, then the KnownVulerabilities will be null. If there are no errors, then Exceptions will be null. If there are some exceptions, but some files were successfully parsed, then neither will be null, and the caller is expected to still understand that this is an error situation, but one that allows degraded functionality.

Why not have IVulnerabilityInfoResource.GetVulnerabilityInfoAsync throw when any error happens?

Since the design spec of the feature explicitly recommends that package sources split their vulnerability data into large, periodically updated "base" files, and a second frequently updated file, we encourage sources to provide multiple files. In case a server has error, for example the vulnerability index points to a missing file, or a file that is in the process of being updated and therefore the client gets a JSON file that can't be parsed correctly, it allows the client to run in a degraded state, where it can still check vulnerabilities for data is successfully downloaded, before providing appropriate error handling.

Think of it as a kind of "try-get", but since async methods can't use out parameters, a new data type is being used instead of a tuple.

Why not merge multiple files into a single dictionary

IVulnerabilityInfoResult.KnownVulnerabilities is a list of dictionaries. I'll agree that this feels like a leaky abstraction. However, I believe this may be a case where it's warranted because this code is on a known hot path. It's not on the "hottest path" (no-op restore), but any time no-op fails, then this code will be run, even if no packages need to be downloaded. Since speed or performance is often one of the biggest complaint areas from customer when we run surveys, we know it's important to customers that scenarios like this are fast. While my own testing/benchmarking can't possibly cover 100% of all scenarios, and I am making assumptions about how this not-yet-released API is going to be used in the future, benchmarking makes it appear that the time to merge the multiple dictionaries into a single dictionary, or reading the vulnerabilities files serially into a single dictionary, rather than in parallel into multiple dictionaries, is slower than having the vulnerability checking code iterate over multiple dictionaries.

VulerabilityInfoResourceV3Provider

This type is public because SourceRepository is designed to be extensible, so someone using NuGet.Protocol can either filter out, or inject their own providers when they create their own SourceRepository. Whether they use GetCoreV3().Where(p => !(p.Value is VulnerabilityInfoResourceV3Provider, or they create their own provider and pass nameof(VulnerabilityInfoResourceV3Provider) to the beforeProviders argument, both cases need the type to be public. Well, technically the second doesn't, but if it wasn't public they'd have to provide a string without allowing the compiler to validate the correct name for them.

VulnerabilityInfo

This is a 1:1 mapping of the deepest object within the vulnerability data file that package sources need to provide to the client. While I was originally concerned about this being a leaky abstraction, in the end I thought it was the ideal data structure. Clients need the URL and severity to show to customers. And the versions list could be useful as well. Imaging something like VS's Package Manager UI, where you get the package details of a package, and it not only tells you that it has a known vulnerability, but it also tells you the version range the vulerability applies to, so you can know in advance what the lowest version with a fix is, so you don't need to check every version individually to see when the known vulnerability information stops.

@kartheekp-ms
Copy link
Contributor

GetVulnerabilityInfoAsync, which returns a GetVulnerabilityInfoResult

I am curious to understand why not return VulnerabilityInfoResult instead of GetVulnerabilityInfoResult

@kartheekp-ms
Copy link
Contributor

The big nested loop in the middle, that checks all files and checks each known vulnerability if the package version being checked satisfies the vulnerability data range, might become a public helper method in another assembly, but it doesn't feel like it belongs in NuGet.Protocol.

May be the decision is already made, have we considered an option to download package vulnerabilities just like package versions by appending package name to the registration base URL for example https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/45bacae2-5efb-47c8-91e5-8ec20c22b4f8/nuget/v3/flat2/nuget.common/index.json. I don't have the performance metrics yet, but I think if the vulnerabilities database size increases in NuGet.org then performance may degrade.

@loic-sharma
Copy link
Contributor

loic-sharma commented Mar 31, 2023

Since the design spec of the feature explicitly...

Is this design spec public? Would you mind sharing it?

EDIT: Here it is: https://github.com/NuGet/Home/blob/72a2e173e1c7d82880f21b8574a81874eeb15ead/proposed/2022/vulnerabilities-in-restore.md

@loic-sharma
Copy link
Contributor

This looks great! Thanks for the excellent write up to make this easy to review, it's very appreciated.

Did you consider making VulnerabilityInfoResourceV3.GetVulnerabilityFilesAsync and VulnerabilityInfoResourceV3.GetVulnerabilityDataAsync public and part of IVulnerabilityInfoResourceV3? Scenarios where this might be useful:

  1. I want a tool that notifies me whenever nuget.org updates its vulnerability data
  2. I know of nuget.org's files and only want to ingest the update file without ever reading the base file.

for (int i = 0; i < tasks.Length; i++)
{
V3VulnerabilityIndexEntry indexEntry = indexEntries[i];
tasks[i] = GetVulnerabilityDataAsync(indexEntry, cacheContext, logger, cancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clamp this to an upper bound to avoid flooding the thread pool with tasks if nuget.org ever decides to have thousands of files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any WhenAll http communication should have an upper bound. This one is a sneaky/unobvious one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a basic limit of 16 for now. A better throttle so that the max limit isn't all done concurrently can be a perf optimization later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about the parallelization degree instead of the total count of pages actually.

@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Mar 31, 2023

I know of nuget.org's files and only want to ingest the update file without ever reading the base file.

I am not sure if this is possible because as per https://github.com/NuGet/Home/blob/dev/proposed/2022/vulnerabilities-in-restore.md#dedicated-vulnerability-information-resource, it is possible to update the base file also.

The vulnerability resource contains 2 pages.
One page, base, represents the data up to a certain point in time.
The second page, update, represents the data from the last update of base, to present.
Periodically, for example once a month, the data from update should be migrated to base.
If an entry needs to be removed from base, base should be updated.

@erdembayar
Copy link
Contributor

erdembayar commented Apr 1, 2023

The big nested loop in the middle, that checks all files and checks each known vulnerability if the package version being checked satisfies the vulnerability data range,

If performance is a concern here then have you considered other data structure types? Perhaps a Trie data structure could be an alternative instead of a big nested loop. We have already implemented one for Package Source mapping lookup by pattern. Actually, it could be further improved by using string compression. I haven't had time to implement this, and so far there have been no reported performance concerns regarding that.
The only high cost would be the initial Trie tree building, and then the lookup would be cheap. Still tbh, without actual perf test it's hard to say what would be perf difference, it depends on parameters like load size and number of packages etc.

@kartheekp-ms
Copy link
Contributor

If performance is a concern here then have you considered other data structure types? Perhaps a Trie data structure could be an alternative instead of a big nested loop.

In my opinion, the Trie data structure may not be appropriate for this scenario as our objective is not to match prefixes, as in the case of a package source mapping feature.

/// <remarks>Vulnerability data from a package source typically is spread across multiple files. Therefore, it's
/// possible for some data to be retrieved successfully, while other data failed. This type allows partial data
/// to be returned along with error information.</remarks>
public class GetVulnerabilityInfoResult
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #5115 (comment), since GitHub mysteriously doesn't supported threaded/resolvable conversations unless it's linked to a specific line in a file

I am curious to understand why not return VulnerabilityInfoResult instead of GetVulnerabilityInfoResult

@kartheekp-ms my thought process was that the data type for the payload data (list of dictionary) is very specific to one API and might not be interesting to other APIs. So, rather than using an overly generic name for a specific data type, give a specific name. Although there's a good chance I was over-thinking it.

I also experimented with:

public class ApiResult<T>
{
    public T? Result { get; }
    public IReadOnlyList<Exception>? Exceptions { get; }
}

However, I had some problem trying to get it to work. I no longer remember exactly what problems, possibly something to do with Newtonsoft.Json deserialization.

@zivkan zivkan force-pushed the dev-zivkan-protocol-vulnerability-resource branch from 76b833b to 8fce802 Compare April 3, 2023 16:43
@zivkan zivkan marked this pull request as ready for review April 4, 2023 15:22
@zivkan zivkan requested a review from a team as a code owner April 4, 2023 15:22
@zivkan zivkan force-pushed the dev-zivkan-protocol-vulnerability-resource branch from 62c4aa7 to 6edca4d Compare April 5, 2023 13:14
continue;
}

bool hasOnlyValidChars = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be a local or static function to be a little cleaner

jeffkl
jeffkl previously approved these changes Apr 6, 2023
Copy link
Contributor

@jeffkl jeffkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Copy link
Member

@nkolev92 nkolev92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Some great tests.

I don't see any handling of individual file updates. Is that intentional?

for (int i = 0; i < tasks.Length; i++)
{
V3VulnerabilityIndexEntry indexEntry = indexEntries[i];
tasks[i] = GetVulnerabilityDataAsync(indexEntry, cacheContext, logger, cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any WhenAll http communication should have an upper bound. This one is a sneaky/unobvious one.


namespace NuGet.Protocol.Tests.Resources
{
public class VulnerabilityInfoResourceV3Tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a few more tests (can be in a future PR).

  • GetVulnerabilityDataAsync
    • Fetches data from all index entries (testing more than 1).
    • Multiple calls use http cache.
    • Multiple calls when 1 file gets updated only fetches the updated file.

jebriede
jebriede previously approved these changes Apr 6, 2023
Copy link
Contributor

@jebriede jebriede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with suggestions! 👍

Comment on lines 221 to 228
string name = entry.Name;
if (string.IsNullOrWhiteSpace(name))
{
string message = string.Format("Vulnerability page index {0} does not contain a name (Source: '{1}')",
i, _sourceRepository.PackageSource.Name);
AddException(new FatalProtocolException(message), ref exceptions);
continue;
}
Copy link
Contributor

@jebriede jebriede Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a series of validations that follow the same pattern:
Test a predicate. If true, build a message, throw an exception, and continue.

We can reduce code duplication and make it easier to read by making a method or object for the repeated validation pattern.

@zivkan zivkan dismissed stale reviews from jebriede and jeffkl via dd60828 April 7, 2023 10:09
{
/// <summary>Represents one page of vulnerability data.</summary>
/// <remarks>More details at <a href="https://learn.microsoft.com/nuget/api/vulnerability-info-resource">https://learn.microsoft.com/nuget/api/vulnerability-info-resource</a></remarks>
public sealed class V3VulnerabilityIndexEntry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: V3 in the name is probably unnecessary.
Or if it is, I think it should be at the end instead of the beginning.

for (int i = 0; i < tasks.Length; i++)
{
V3VulnerabilityIndexEntry indexEntry = indexEntries[i];
tasks[i] = GetVulnerabilityDataAsync(indexEntry, cacheContext, logger, cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about the parallelization degree instead of the total count of pages actually.

@zivkan zivkan merged commit c3544ac into dev Apr 10, 2023
@zivkan zivkan deleted the dev-zivkan-protocol-vulnerability-resource branch April 10, 2023 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add VulnerabilityInfo APIs into NuGet.Protocol
7 participants