Name		Name	Last commit message	Last commit date
parent directory ..
CatalogLeafItems.md		CatalogLeafItems.md
Certificates.md		Certificates.md
ExcludedPackages.md		ExcludedPackages.md
NuGetPackageExplorerFiles.md		NuGetPackageExplorerFiles.md
NuGetPackageExplorers.md		NuGetPackageExplorers.md
PackageArchiveEntries.md		PackageArchiveEntries.md
PackageArchives.md		PackageArchives.md
PackageAssemblies.md		PackageAssemblies.md
PackageAssets.md		PackageAssets.md
PackageCertificates.md		PackageCertificates.md
PackageCompatibilities.md		PackageCompatibilities.md
PackageContents.md		PackageContents.md
PackageDeprecations.md		PackageDeprecations.md
PackageDownloads.md		PackageDownloads.md
PackageFiles.md		PackageFiles.md
PackageIcons.md		PackageIcons.md
PackageLicenses.md		PackageLicenses.md
PackageManifests.md		PackageManifests.md
PackageOwners.md		PackageOwners.md
PackageReadmes.md		PackageReadmes.md
PackageSignatures.md		PackageSignatures.md
PackageVersions.md		PackageVersions.md
PackageVulnerabilities.md		PackageVulnerabilities.md
PopularityTransfers.md		PopularityTransfers.md
README.md		README.md
SymbolPackageArchiveEntries.md		SymbolPackageArchiveEntries.md
SymbolPackageArchives.md		SymbolPackageArchives.md
SymbolPackageFiles.md		SymbolPackageFiles.md
VerifiedPackages.md		VerifiedPackages.md

README.md

NuGet Insights data tables

The primary purpose of NuGet Insights is to generate CSV data files which can be easily imported into many different query systems. The implementation and some extra metadata are tailored for making a data import into Azure Data Explorer (a.k.a. Kusto) easier, but since CSVs are supported for import in so many technologies, the hope is that you can adapt the data files to your needs with relative ease.

If you are using NuGet Insights in Azure and want to use Azure Data Explorer, there is a built-in importer feature that utilizes its ingestion endpoint. If you're not using Azure Data Explorer, you can download the CSV files from Blob Storage and import them. Or, if your import flow supports URLs, you can generate SAS tokens for the URLs or make your storage container publicly accessible and import directly using Blob Storage blob URLs.

Each data table is represented by a set of CSV files. The data table is split (i.e. partitioned) across multiple files so that processing the data files in memory feasible. Also, many data import systems can operate in parallel across many files. Therefore, having more files that are smaller is often better than having fewer large files.

The CSV file format itself is loosely defined across the industry but NuGet.Insights adheres to RFC 4180. Notably, strings containing certain characters (e.g. line breaks, double quotes, commas) are enclosed in double quotes. All CSV files are UTF-8 encoded without a byte order mark (BOM).

Tables

The following tables are currently generated by NuGet Insights:

Table name	Description
CatalogLeafItems	All catalog leaf items per package
Certificates	All certificates referenced in available package signatures
ExcludedPackages	Whether a package ID is excluded from default search results
NuGetPackageExplorerFiles	SourceLink and debug information for assemblies
NuGetPackageExplorers	Reproducibility status for NuGet packages
PackageArchiveEntries	Low level information about .nupkg ZIP file entries
PackageArchives	Hashes and low level information about .nupkg ZIP archives
PackageAssemblies	.NET assembly metadata such as assembly name, version, and strong name status
PackageAssets	Package assets and target framework as interpreted by the NuGet restore
PackageCertificates	Mapping table between packages and certificates (many-to-many relationship)
PackageCompatibilities	Framework/platform compatibility defined by several different algorithms
PackageContents	Subset of text content files in each package
PackageDeprecations	Package deprecation data per package version, if any
PackageDownloads	Current download counts for individual package versions
PackageFiles	Basic information (such as hash) of every file in the package .nupkg
PackageIcons	Metadata about embedded and icon-URL icons available on NuGet.org
PackageLicenses	Metadata about package license files, expressions and URLs
PackageManifests	Recognized values from .nuspec files
PackageOwners	Current package owner usernames (user or organization)
PackageReadmes	Metadata NuGet package READMEs (either embedded or legacy)
PackageSignatures	Metadata about the NuGet package signature, e.g. author signing certificate
PackageVersions	Components of the package SemVer 2.0.0 and latest version status
PackageVulnerabilities	Package vulnerability data per package version, if any
PopularityTransfers	Popularity transfers for a give package ID, if any
SymbolPackageArchiveEntries	Low level information about .snupkg ZIP file entries
SymbolPackageArchives	Hashes and low level information about .snupkg ZIP archives
SymbolPackageFiles	Basic information (such as hash) of every file in the package .snupkg
VerifiedPackages	Whether a package ID is verified or not

Terminology & concepts

In these documents, the following language is used. To avoid confusion, here's what is meant.

Record: One row in the CSV data file. May represent a view of a single package, package ID, or something more granular.

Package ID: This is the string identity of a package shared by all of its versions, e.g. Newtonsoft.Json.

Package version: This is the SemVer string that refers to a specific version of a package ID, e.g. 9.0.1.

Package: In isolation, this refers to a specific version of a package. A package is identified by an ID and version.

Package Identity: Not to be confused with Package ID, this is the a string with the format {lowercase package ID}/{lowercase normalized package version}. This is used for easy joins per package.

Deleted packages: When a package is deleted from NuGet.org, it must be scrubbed shortly after in the NuGet Insights data sets. This means that historical metadata about deleted packages is not available. The only information that is retained about deleted packages is the package ID, version, and perhaps some enum value indicated that the package was deleted.

Cardinality: Most tables have data related to individual packages (i.e. package versions) but others are at the package ID level or have multiple records per package.

Database normalization: Traditional database normalization is mostly avoided for two reasons.

Normalized (de-duped) data shared between packages makes data clean-up due to package deletes much more complicated.
Query systems like Azure Data Explorer thrive on denormalized data and queries often run faster if you don't have to perform a join.

Table relationships: Sometimes have a child or parent relationship with other tables. This is a primitive form of denormalization and is mostly used to ease some query-time scenarios rather than reducing data size.

Table metadata

Each table document has the following metadata table at the top.

Term	Meaning
Cardinality	Defined in the section above, see Cardinality
Child tables	Defined in the section above, see Table relationships
Parent tables	Defined in the section above, see Table relationships
Column used for CSV partitioning	The table column used for bucketizing records between CSV files, see CSV partitioning algorithm below
Column used for Kusto partitioning	The table column used for a Kusto partitioning policy, for fast Kusto joins
Key fields	The sequence of columns that uniquely identify the row, also used for sorting CSV records
Data file container name	The default Azure Blob Storage container name containing the CSV
Driver	The implementation of the catalog scan driver that produces these records
Record type	The C# reference type that represents the CSV record

Data types

CSV itself doesn't prescribe data types for columns but documentation for each table will describe column data types in simple terms than can be easily mapped to a programming language or schema system of your choice. An empty string in a CSV cell indicates a null value for non-string cells. An empty value in a string CSV cell in ambiguous whether it means empty string or null string. In NuGet Insights, an empty CSV field should be considered an empty string, not null.

Data type	Description
string	Simple string of characters. Depending on the column, this can potentially contain line break characters
bool	A boolean, represented as the string `true` or `false`
short	Signed 16-bit integer, represented in decimal form
ushort	Unsigned 16-bit integer, represented in decimal form
uint	Unsigned 32-bit integer, represented in decimal form
int	Signed 32-bit integer, represented in decimal form
long	Signed 64-bit integer, represented in decimal form
float	A floating point number stored in 4 bytes
double	A floating point number stored in 8 bytes
timestamp	A date and time, represented as an ISO 8601 string. Always in UTC time with the `Z` suffix
enum	An enumerated value, represented as a string. All possible values are documented
flags enum	One or more enumerated values, represented as a comma separated list of strings. All possible values are documented
object	A JSON encoded object. The schema of the object is documented
array of objects	A JSON encoded array of objects. The schema of the array is documented
array of strings	A JSON encoded array of strings. The schema of the array is documented

CSV partitioning algorithm

If a table has more than one data file, the following algorithm is used to determine which specific CSV file a record should go in to. The primary implemention is written in C# (copied below for understanding). To calculate the bucket in Kusto (KQL), you can use the GetBucket user-defined function implemented in this repository.

The bucketCount is the maximum number of data files. This is a fixed number per table. The bucketKey is a string used for finding the specific CSV file (a.k.a. bucket) to put the record in to. On each table document, the Column used for partitioning property states which CSV column was used for partitioning.

The bucketKey for each table is documented as "Column used for CSV partitioning".

Example: when the bucketCount is 1000 and the bucketKey is newtonsoft.json, the bucket is 892.

C# implementation

public static int GetBucket(int bucketCount, string bucketKey)
{
    int bucket;
    using (var algorithm = SHA256.Create())
    {
        var hash = algorithm.ComputeHash(Encoding.UTF8.GetBytes(bucketKey));
        bucket = (int)(BitConverter.ToUInt64(hash) % (ulong)bucketCount);
    }

    return bucket;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tables

tables

README.md

NuGet Insights data tables

Tables

Terminology & concepts

Table metadata

Data types

CSV partitioning algorithm

C# implementation

Files

tables

Directory actions

More options

Directory actions

More options

Latest commit

History

tables

Folders and files

parent directory

README.md

NuGet Insights data tables

Tables

Terminology & concepts

Table metadata

Data types

CSV partitioning algorithm

C# implementation