The primary purpose of NuGet Insights is to generate CSV data files which can be easily imported into many different query systems. The implementation and some extra metadata are tailored for making a data import into Azure Data Explorer (a.k.a. Kusto) easier, but since CSVs are supported for import in so many technologies, the hope is that you can adapt the data files to your needs with relative ease.
If you are using NuGet Insights in Azure and want to use Azure Data Explorer, there is a built-in importer feature that utilizes its ingestion endpoint. If you're not using Azure Data Explorer, you can download the CSV files from Blob Storage and import them. Or, if your import flow supports URLs, you can generate SAS tokens for the URLs or make your storage container publicly accessible and import directly using Blob Storage blob URLs.
Each data table is represented by a set of CSV files. The data table is split (i.e. partitioned) across multiple files so that processing the data files in memory feasible. Also, many data import systems can operate in parallel across many files. Therefore, having more files that are smaller is often better than having fewer large files.
The CSV file format itself is loosely defined across the industry but NuGet.Insights adheres to RFC 4180. Notably, strings containing certain characters (e.g. line breaks, double quotes, commas) are enclosed in double quotes. All CSV files are UTF-8 encoded without a byte order mark (BOM).
The following tables are currently generated by NuGet Insights:
Table name | Description |
---|---|
CatalogLeafItems | All catalog leaf items per package |
Certificates | All certificates referenced in available package signatures |
ExcludedPackages | Whether a package ID is excluded from default search results |
NuGetPackageExplorerFiles | SourceLink and debug information for assemblies |
NuGetPackageExplorers | Reproducibility status for NuGet packages |
PackageArchiveEntries | Low level information about .nupkg ZIP file entries |
PackageArchives | Hashes and low level information about .nupkg ZIP archives |
PackageAssemblies | .NET assembly metadata such as assembly name, version, and strong name status |
PackageAssets | Package assets and target framework as interpreted by the NuGet restore |
PackageCertificates | Mapping table between packages and certificates (many-to-many relationship) |
PackageCompatibilities | Framework/platform compatibility defined by several different algorithms |
PackageContents | Subset of text content files in each package |
PackageDeprecations | Package deprecation data per package version, if any |
PackageDownloads | Current download counts for individual package versions |
PackageFiles | Basic information (such as hash) of every file in the package .nupkg |
PackageIcons | Metadata about embedded and icon-URL icons available on NuGet.org |
PackageLicenses | Metadata about package license files, expressions and URLs |
PackageManifests | Recognized values from .nuspec files |
PackageOwners | Current package owner usernames (user or organization) |
PackageReadmes | Metadata NuGet package READMEs (either embedded or legacy) |
PackageSignatures | Metadata about the NuGet package signature, e.g. author signing certificate |
PackageVersions | Components of the package SemVer 2.0.0 and latest version status |
PackageVulnerabilities | Package vulnerability data per package version, if any |
PopularityTransfers | Popularity transfers for a give package ID, if any |
SymbolPackageArchiveEntries | Low level information about .snupkg ZIP file entries |
SymbolPackageArchives | Hashes and low level information about .snupkg ZIP archives |
SymbolPackageFiles | Basic information (such as hash) of every file in the package .snupkg |
VerifiedPackages | Whether a package ID is verified or not |
In these documents, the following language is used. To avoid confusion, here's what is meant.
Record: One row in the CSV data file. May represent a view of a single package, package ID, or something more granular.
Package ID: This is the string identity of a package shared by all of its versions, e.g. Newtonsoft.Json
.
Package version: This is the SemVer string that refers to a specific version of a package ID, e.g. 9.0.1
.
Package: In isolation, this refers to a specific version of a package. A package is identified by an ID and version.
Package Identity: Not to be confused with Package ID, this is the a string with the format {lowercase package ID}/{lowercase normalized package version}
. This is used for easy joins per package.
Deleted packages: When a package is deleted from NuGet.org, it must be scrubbed shortly after in the NuGet Insights data sets. This means that historical metadata about deleted packages is not available. The only information that is retained about deleted packages is the package ID, version, and perhaps some enum value indicated that the package was deleted.
Cardinality: Most tables have data related to individual packages (i.e. package versions) but others are at the package ID level or have multiple records per package.
Database normalization: Traditional database normalization is mostly avoided for two reasons.
- Normalized (de-duped) data shared between packages makes data clean-up due to package deletes much more complicated.
- Query systems like Azure Data Explorer thrive on denormalized data and queries often run faster if you don't have to perform a join.
Table relationships: Sometimes have a child or parent relationship with other tables. This is a primitive form of denormalization and is mostly used to ease some query-time scenarios rather than reducing data size.
Each table document has the following metadata table at the top.
Term | Meaning |
---|---|
Cardinality | Defined in the section above, see Cardinality |
Child tables | Defined in the section above, see Table relationships |
Parent tables | Defined in the section above, see Table relationships |
Column used for CSV partitioning | The table column used for bucketizing records between CSV files, see CSV partitioning algorithm below |
Column used for Kusto partitioning | The table column used for a Kusto partitioning policy, for fast Kusto joins |
Key fields | The sequence of columns that uniquely identify the row, also used for sorting CSV records |
Data file container name | The default Azure Blob Storage container name containing the CSV |
Driver | The implementation of the catalog scan driver that produces these records |
Record type | The C# reference type that represents the CSV record |
CSV itself doesn't prescribe data types for columns but documentation for each table will describe column data types in simple terms than can be easily mapped to a programming language or schema system of your choice. An empty string in a CSV cell indicates a null value for non-string cells. An empty value in a string CSV cell in ambiguous whether it means empty string or null string. In NuGet Insights, an empty CSV field should be considered an empty string, not null.
Data type | Description |
---|---|
string | Simple string of characters. Depending on the column, this can potentially contain line break characters |
bool | A boolean, represented as the string true or false |
short | Signed 16-bit integer, represented in decimal form |
ushort | Unsigned 16-bit integer, represented in decimal form |
uint | Unsigned 32-bit integer, represented in decimal form |
int | Signed 32-bit integer, represented in decimal form |
long | Signed 64-bit integer, represented in decimal form |
float | A floating point number stored in 4 bytes |
double | A floating point number stored in 8 bytes |
timestamp | A date and time, represented as an ISO 8601 string. Always in UTC time with the Z suffix |
enum | An enumerated value, represented as a string. All possible values are documented |
flags enum | One or more enumerated values, represented as a comma separated list of strings. All possible values are documented |
object | A JSON encoded object. The schema of the object is documented |
array of objects | A JSON encoded array of objects. The schema of the array is documented |
array of strings | A JSON encoded array of strings. The schema of the array is documented |
If a table has more than one data file, the following algorithm is used to determine which specific CSV file a record should go in to. The primary implemention is written in C# (copied below for understanding). To calculate the bucket in Kusto (KQL), you can use the GetBucket
user-defined function implemented in this repository.
The bucketCount
is the maximum number of data files. This is a fixed number per table. The bucketKey
is a string used for finding the specific CSV file (a.k.a. bucket) to put the record in to. On each table document, the Column used for partitioning property states which CSV column was used for partitioning.
The bucketKey
for each table is documented as "Column used for CSV partitioning".
Example: when the bucketCount
is 1000 and the bucketKey
is newtonsoft.json
, the bucket is 892
.
public static int GetBucket(int bucketCount, string bucketKey)
{
int bucket;
using (var algorithm = SHA256.Create())
{
var hash = algorithm.ComputeHash(Encoding.UTF8.GetBytes(bucketKey));
bucket = (int)(BitConverter.ToUInt64(hash) % (ulong)bucketCount);
}
return bucket;
}