Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression test asset and native dependency proposal #14403

Closed
FiveTimesTheFun opened this issue Mar 31, 2015 · 17 comments
Closed

System.IO.Compression test asset and native dependency proposal #14403

FiveTimesTheFun opened this issue Mar 31, 2015 · 17 comments

Comments

@FiveTimesTheFun
Copy link
Contributor

System.IO.Compression provides an API to compress and decompress files. It supports the ZIP and gzip archive formats and the Deflate compression algorithm.

_Testing_

System.IO.Compression has existing test assets that fit three categories:

  1. Round trip data through compression and verify it remains intact
  2. Extract and verify data compressed through other compatible compression libraries
  3. Verify behavior when attempting to decompress invalid or edge case archives
    The existing assets include archives, as well as decompressed binary files both for use as source material, and an oracle to verify decompression. Altogether, the binary assets weigh in at about 40MB.

Problem: these test assets need to exist at test run time. Placing them in our corefx git repository is problematic because binary assets do not compress well and will bloat the size of our repository.

Proposal: Binary test assets are packaged in to a nuget package per test assembly to be deployed at test run time. The binary assets to generate these nuget package are persisted in a single separate Git repository.

Git repository structure:
corefx-testdata
corefx-testdata\System.IO.Compression.TestData.nuspec
corefx-testdata\System.IO.Compression.TestData<binary assets>

Nupkg structure:
content<binary assets>

Test asset nuget packages are depended on by the test assembly’s packages.config

Currently, nuget is designed to deploy at package reference time (ie. when a package is referenced in Visual Studio) rather than runtime. In Dev14 this is changing. Until then, we’ll include a target in the test assembly’s csproj to copy the assets from the downloaded nuget package to the test’s working directory.

In addition to System.IO.Compression, metadata reader uses binary test assets, but as they are only 100KB combined and are not a practical problem. System.IO.Packaging has 60GB of test assets.

Proposal: System.IO.Packaging split their test assets in to tiers, with a single nuget package per tier. The inner loop tests should be no more than some number of megabytes.

_Native Dependencies_

In the desktop .NET Framework, although we have a purely managed implementation of the deflate algorithm, customers have reported that the popular cross-platform open-source zlib library is faster and performs high quality compression. In response we include zlib, which we wrap and delegate to. We build it from source using the unmodified 1.2.3 version of the sources in the FX partition, and name the resulting output “clrcompression.dll.”
On Windows, to fit with .NET Core’s app-local, side-by-side model, if we include zlib, we should deploy a copy of the native assembly alongside the managed System.IO.Compression.dll assembly included with an application. zlib distributes both sources and official binaries. While we could redistribute the official zlib binary, unfortunately they only release a 32-bit version and we are currently distributing a 64-bit version of coreclr.
While zlib has a permissive license, it is not the MIT license, and including it in the corefx repository would mean the corefx repository no longer only contained code under a single license.

Proposal: Continue building clrcompression.dll from the FX repository. Publish the 64-bit version of clrcompression.dll in a nuget package System.IO.Compression.clrcompression-x64.nupkg. Make System.IO.Compression.clrcompression-x64 a dependency of the System.IO.Compression package.

X-Plat todo: Linux has a strong history of deploying applications and their dependencies through package managers. Packagers bundle the application binaries and metadata describing dependencies in to packages, and publish those on package manager repositories. We are planning to declare other framework and runtime dependencies through package managers where appropriate, and System.IO.Compression’s zlib dependency should be consistent.

@FiveTimesTheFun FiveTimesTheFun self-assigned this Mar 31, 2015
@akoeplinger
Copy link
Member

Continue building clrcompression.dll from the FX repository.

Another alternative might be forking https://github.com/madler/zlib into https://github.com/dotnet and applying the few changes there, to keep the corefx repo on a single license.

@FiveTimesTheFun
Copy link
Contributor Author

Thank you, very good suggestion, I'll reach out to Mark.

@pedrolamas
Copy link
Contributor

Though it does seem reasonable to me to have the test assets in separate nuget packages, I'd probably would just bypass nuget, and add a build target to download a specific revision of the assets separate GitHub repository.

One can even take advantage of GitHub builtin download feature to bypass git and just get the files at a specific commit (target would then download and unzip the assets).

@FiveTimesTheFun
Copy link
Contributor Author

Hi @pedrolamas - thanks for the reply.

To elaborate, I planned to host the test asset nuget packages on a myget.org feed "dotnet-corefxtestdata", because we already have dependencies on myget.org and nuget.org engrained in to our build. I'd like to avoid introducing for the first time a build-time dependency on github.com unless there was a compelling reason to.

I'm also unfamiliar with the github feature to download files at a specific commit - but would that download all files in the repository? By splitting the test assets in to individual nuget packages per test assembly, I'd hoped to prevent individuals running tests for one assembly to need to download assets for another assembly - see in the spec where System.IO.Packaging has 60GB of total test assets.

Actually, if some libraries end up having very large test assets across all their tiers, that may cause us to want to split the test asset git repositories in the future, too. In the speclet above I said that all the assets would be in a single repository, what do people think of that? It's turned out to be very convenient for us to have the corefx libraries share the same corefx repository. The advantages we've seen include a single issue and pull request list, single URL we can point people to in documentation, and single area for people to collaborate together, contribute and witness progress. Full disclosure - I was originally a proponent of having more repositories with fewer libraries, but in hindsight I'm glad we did what we did.

@eatdrinksleepcode
Copy link

Although I understand the desire to not commit large binary assets to the same repo, I am not especially comfortable with keeping test assets in another repo.

As an alternative, would it be possible to generate these assets, with the code to generate them checked in to the main repo?

On Apr 2, 2015, at 10:27 AM, Matt Cohn [email protected] wrote:

Hi @pedrolamas - thanks for the reply.

To elaborate, I planned to host the test asset nuget packages on a myget.org feed "dotnet-corefxtestdata", because we already have dependencies on myget.org and nuget.org engrained in to our build. I'd avoid introducing a new build-time dependency on github.com unless there was a compelling reason to.

I'm also unfamiliar with the github feature to download files at a specific commit - but would that download all files in the repository? By splitting the test assets in to individual nuget packages per test assembly, I'd hoped to prevent individuals running tests for one assembly to need to download assets for another assembly - see in the spec where System.IO.Packaging has 60GB of total test assets.


Reply to this email directly or view it on GitHub.

@eatdrinksleepcode
Copy link

Also, is it possible to look at the existing tests and understand whether the size and number of existing tests assets is actually required? 60GB of test assets feels particularly excessive.

On Apr 2, 2015, at 10:27 AM, Matt Cohn [email protected] wrote:

Hi @pedrolamas - thanks for the reply.

To elaborate, I planned to host the test asset nuget packages on a myget.org feed "dotnet-corefxtestdata", because we already have dependencies on myget.org and nuget.org engrained in to our build. I'd avoid introducing a new build-time dependency on github.com unless there was a compelling reason to.

I'm also unfamiliar with the github feature to download files at a specific commit - but would that download all files in the repository? By splitting the test assets in to individual nuget packages per test assembly, I'd hoped to prevent individuals running tests for one assembly to need to download assets for another assembly - see in the spec where System.IO.Packaging has 60GB of total test assets.


Reply to this email directly or view it on GitHub.

@akoeplinger
Copy link
Member

@FiveTimesTheFun

In the speclet above I said that all the assets would be in a single repository, what do people think of that?

Sounds fine to me. In reality, almost no-one is going to clone that repo anyway and the test data is not changing constantly as well, so the arguments for "don't put binaries in the repo" don't really apply.

Having said that, the 60 gigabytes of test assets are a different story, not sure if that should really be checked into a git repository (does GitHub even support repos with that size)? Personally, I don't really care if those assets are published in a repo, having them only on myget and internally on some MS file share should be enough.

@eatdrinksleepcode

As an alternative, would it be possible to generate these assets, with the code to generate them checked in to the main repo?

The problem with that is you can't be sure you get the exact same results in a few years from now cause dependencies, etc. might change and you might not notice that you regressed something.

@FiveTimesTheFun
Copy link
Contributor Author

Yes... 60GB is a lot... I don't directly own that area so I don't have full context on the test assets, so I'm not sure if they should be reduced but operating on the assumption that they shouldn't.

Unfortunately @eatdrinksleepcode it's not feasible to generate these test assets on the fly. To elaborate on @akoeplinger 's response, many of these test assets are designed to maintain compatibility with binary formats. In System.IO.Compression's example, we generated archives with 7-Zip, Windows Explorer, and other compression libraries to ensure compatibility.

@akoeplinger - that was my original thought for jamming them all in the same repository but splitting it in to individual nuget packages for test time. We'll have to see what happens when System.IO.Packaging comes online. If we didn't keep the System.IO.Packaging assets in a git repository, where would we keep the nuspec to regenerate new versions of the package? Keep the nuspec in the repo, and the assets on myget, with a workflow to update the package of:

  1. clone the repo to get the nuspec
  2. download the previous nupkg
  3. update files
  4. update nuspec version number, commit
  5. generate new package, publish?

@eatdrinksleepcode
Copy link

@akoeplinger Why would that be the case? Our build should always be deterministic; if it's not, we need to fix that. And there is no reason for the dependencies of the binary generator to change, so I don't see that as a problem either.

@akoeplinger
Copy link
Member

@eatdrinksleepcode as Matt mentioned, some of them are designed to ensure compatibility. You'd need to e.g. keep a Windows XP machine around to generate a ZIP file with Windows Explorer as later OSes might've changed the implementation.

@FiveTimesTheFun .nuspec in the repo, yep. sounds acceptable.

@justinvp
Copy link
Contributor

GitHub announced Git Large File Storage (LFS) a couple days ago, which might be useful for the test assets.

@robertmclaws
Copy link

In regards to the native dependencies, I maintain a fork of the Zlib Silverlight implementation from DotNetZip that is portable, and currently distributed on NuGet. Would it make sense to provide that code for use in the Framework?

HTH!

@lilith
Copy link

lilith commented May 7, 2015

While zlib is quite stable/secure, I would like to see minimal friction around updates to native libraries in general. Anything we can do to ease the pain of a native binary nuget package would be wonderful for all of us. We've hacked our tree together on AppVeyor, but have to use custom scripts to extract the nuget packages and assembly the components.

@Petermarcu
Copy link
Member

@nathanaeljones , we're actively working with the NuGet folks to enable better support for native dependencies.

@stephentoub
Copy link
Member

@FiveTimesTheFun, can this be closed now?

@lilith
Copy link

lilith commented Sep 14, 2015

There are much faster versions of zlib. Consider Intel or Cloudflare's fork of 1.2.8: http://www.htslib.org/benchmarks/zlib.html

Any reason you are using a buggy 2005 edition of zlib (v1.2.3) instead of 1.2.8?

@joshfree
Copy link
Member

Closing since @FiveTimesTheFun went ahead and created https://github.com/dotnet/corefx-testdata back in April to store Compression, Packaging, and X509 Cert test data.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 1.0.0-beta7 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests