Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Update SpanBench test #10216

Merged
merged 1 commit into from
Mar 18, 2017
Merged

Conversation

JosephTremoulet
Copy link

  • Re-enable the tests that were disabled in e859c30.
  • Re-work how tests are invoked from the command line to require less
    boilerplate
    • Now each test has a single [Benchmark] entrypoint
    • That entrypoint invokes its test's single inner-loop by wrapping it
      in a lambda that it passes to a new helper Invoke method (shared
      across all tests) which handles the xunit vs. command-line
      differences..
    • Main finds the entrypoints by using reflection to search for the
      [Benchmark] attributes, so the explicit list of stringified test
      names is no longer needed and the command line will run the same
      set of tests that xunit-perf does.
  • The new SpanAPI tests now get invoked when this tests is run from the
    command line as well.
  • Add [NoInlining] to the SpanAPI tests' kernels.
  • Add some heap writes and conditional writes to prevent deadcode
    optimization from eliminating tests' kernels (assuming we don't do
    interprocedural constant propagation or deadcode, or store sinking and
    final value calculation, or PRE...).
  • Split the Index SpanAPI tests into one version that uses a
    loop-invariant index and thus should get hoisted out of the loop, and
    another version that uses variant indices.

@JosephTremoulet
Copy link
Author

@AndyAyersMS @ahsonkhan PTAL. This seems to be exposing some bug (the checked jit asserts when compiling the tests that I'm un-commenting here), so not ready to merge until I track that down, but it's looking like a jit bug so I think this test change is complete enough for review.
/cc @dotnet/jit-contrib

@JosephTremoulet
Copy link
Author

JosephTremoulet commented Mar 16, 2017

These lab failures are not what I expected:

08:22:46   Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Memory, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. The system cannot find the file specified.
08:22:46            at Span.SpanBench.TestSpanFill[T](T[] array, Int32 iterationCount)
08:22:46            at Span.SpanBench.<>c__DisplayClass74_0`1.<InvokeTestSpanFill>b__0(Int32 innerIterationCount) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 905
08:22:46            at Span.SpanBench.Invoke(Action`1 innerLoop, String nameFormat, Object[] nameArgs) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 81
08:22:46            at Span.SpanBench.InvokeTestSpanFill[T](Int32 length) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 905
08:22:46            at Span.SpanBench.TestSpanFillByte(Int32 length) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 888
08:22:46            at Span.SpanBench.Main(String[] args) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 1130

I'd say the references are messed up, but somehow we're building fine and then failing to run... since the SpanAPI tests weren't previously run when invoked from the command line, it's possible it's been like this for a while but the tests just haven't been running. It runs when I build locally, though. I'm investigating, but if anybody has suggestions I'm all ears.

@AndyAyersMS
Copy link
Member

I believe there is a separate place where test runtime dependencies are declared, though this area has been changing recently so my understanding may outdated.

You might see if tests/src/Common/test_dependencies/project.lock.json has been updated locally or needs to be updated (there maybe some explicit way to update this now).

@wtgodbe any ideas?

@JosephTremoulet
Copy link
Author

I believe there is a separate place where test runtime dependencies are declared

Ah, thanks for the tip. I can indeed repro now with a fresh sync and a clean enlistment -- looks like GenerateLayout is no longer including a copy of System.Memory.dll. Must have been a recent change, since this worked for me a day or two ago...

@JosephTremoulet
Copy link
Author

I tried adding "System.Memory": "4.4.0-beta-25114-01" (which is what we have in benchmarks/project.json) to test_dependencies. I have no idea if that's a reasonable thing to do, but it did change my failure mode from a MissingFileException to System.TypeLoadException: A value type containing a by-ref instance field, such as Span<T>, cannot be used as the type for a class instance field.. That's in fact a perfectly reasonable error with my change, so this feel like progress. I'll rewrite to avoid trying to declare spans as instance members of ref classes...

@JosephTremoulet
Copy link
Author

I'll rewrite to avoid trying to declare spans as instance members of ref classes

That turns out to be harder than I thought. I got rid of the Sink<Span<T>> cases, but am still getting the same TypeLoadException. I think it's complaining about the span members of the display classes that were generated for my lambdas. This must be a known issue, right? How do people usually deal with the situation that they wish their lambda could capture a Span?

@ahsonkhan
Copy link
Member

lambda could capture a Span?

Can a lambda capture a span (which is stack only)? cc @KrzysztofCwalina, @jkotas

#region TestFillAllSpan
[Benchmark(InnerIterationCount = FillAllIterations)]
[InlineData(DefaultLength)]
public static void FillAllSpan(int length)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this test now since we have a Span.Fill method that does this, no? Or is this meant to test writing into a span as well using the indexer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reasonable to expect people will keep writing open-coded loops

Agreed.

@JosephTremoulet
Copy link
Author

It turned out to be trivial to move all the lambda-captured spans into the lamdba methods themselves in this case. I can now run the test locally again, and I've pushed the updates.

Still don't know if just adding System.Memory to test_dependencies like that is ok or not...

{
foreach (var iteration in Benchmark.Iterations)
using (iteration.StartMeasurement())
innerLoop((int)Benchmark.InnerIterationCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. How would we run the benchmarks now? None of the tests with "Benchmark" attributes have iteration.StartMeasurement(). I thought the methods with Benchmark attributes are the entry point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark in method bodies are referring to the Xunit.Benchmark class, not the BenchmarkAttribute that [Benchmark] instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark class (much like what I'm doing with InnerIterationCount for the command-line case now).

@jkotas
Copy link
Member

jkotas commented Mar 16, 2017

Can a lambda capture a span (which is stack only)?

No. Check dotnet/csharplang#264 for list of other limitations.

static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount)
{
var sink = Sink<T>.Instance;
var span = new Span<T>(array);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this test also measuring the construction of Span from an array and not just indexer? When compared to the array indexer this would give worst results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.

All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.

For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.

Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is not to do that.

I agree with this as well. As long as the benchmark results are reasonable (and comparisons valid), I see no reason to change the tests.

[MethodImpl(MethodImplOptions.NoInlining)]
static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount)
{
var sink = Sink<T>.Instance;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the execution time of Sink<T>.Instance and Sink<Span<T>>.Instance similar and negligible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.


static void InvokeTestSpanDangerousCreate<T>(int length)
{
TestClass<T> testClass = new TestClass<T>();
Copy link
Member

@ahsonkhan ahsonkhan Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this, how are we still testing the non-reference case of Span<byte>? Both Span<string> and Span<byte> go down this path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSpanDangerousCreateByte calls InvokeTestSpanDangerousCreate<byte> calls Invoke calls the lambda call TestSpanDangerousCreate<byte> calls Span<byte>.DangerousCreate. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something?

Nope, I was confused.

Question: In the call graph you specified, at what point do we start measuring?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside Invoke, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>, and stop measuring once that delegate call returns.

{
var sink = Sink<T>.Instance;
var span = new Span<T>(array);
int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the index variant for lengths >= 10 which is true for 3/4 test cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.

static void TestSpanClear<T>(T[] array, int iterationCount)
{
var span = new Span<T>(array);
for (int i = 0; i < iterationCount; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measures construction + clear. Is the hopes/goal that since we clear "iterationCount" times that it would dwarf the constructor time? If not, when comparing to Array.Clear, span has a disadvantage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.

// Under a condition that we know is false but the jit doesn't,
// add a read from 'byteSpan' to make sure it's not dead, and an assignment
// to 'span' so the AsBytes call won't get hoisted.
if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I get the read from 'byteSpan', but why do we need to assign to 'span'? If we don't, will the call to AsBytes gets hoisted outside of the for loop and will only run once? Is that what "hoisted" means?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.


static void BubbleSortSpanBase()
[MethodImpl(MethodImplOptions.NoInlining)]
static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate the function name change to methods that are called? i.e. Add 'Wrapper' to this method name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered.

Fair point. No change needed then.

"BubbleSortSpanBase", "BubbleSortArrayBase",
"QuickSortSpanBase", "QuickSortArrayBase"*/
);
BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome :) Good to know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@ahsonkhan
Copy link
Member

Still don't know if just adding System.Memory to test_dependencies like that is ok or not...

For a standalone application, I had to include a PackageReference to System.Memory or else I would get compiler errors like this:
Program.cs(79,28): error CS0246: The type or namespace name 'Span<>' could not be found (are you missing a using directive or an assembly reference?) [C:\Users\ahkha\Documents\SpanDemo\FastSpan\spantest.csproj]

So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit?

@ahsonkhan
Copy link
Member

Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.

What is the motivation for this?

Copy link
Author

@JosephTremoulet JosephTremoulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit?

If I back out the change to the test_dependencies file, then even running through xunit I get the MissingFileException. I have no idea why it's working in the lab. (nor do I have any idea what's the right way to resolve this)

Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.

What is the motivation for this?

In the first version of these tests I saw, the index varied from iteration to iteration of the loop, and so the index operation would remain in the loop and the iteration could would multiply its measurement. In the current version, the index is loop-invariant and so it would be legal to hoist the load out of the loop, at which point I'd expect the test to run fast enough that we can't really meaningfully measure it. So I could see the value both ways -- the Variant one would let us measure indexing, and the Hoistable one would let us track whether the optimizer can hoist the load out of the loop (with the caveat that we'd want to translate its measurement numbers into "effectively zero" and "effectively not zero"). Given the issues, I'm not wedded to keeping the Hoistable one if you think it would be better removed.

{
foreach (var iteration in Benchmark.Iterations)
using (iteration.StartMeasurement())
innerLoop((int)Benchmark.InnerIterationCount);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark in method bodies are referring to the Xunit.Benchmark class, not the BenchmarkAttribute that [Benchmark] instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark class (much like what I'm doing with InnerIterationCount for the command-line case now).

#region TestFillAllSpan
[Benchmark(InnerIterationCount = FillAllIterations)]
[InlineData(DefaultLength)]
public static void FillAllSpan(int length)
{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.


static void InvokeTestSpanDangerousCreate<T>(int length)
{
TestClass<T> testClass = new TestClass<T>();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSpanDangerousCreateByte calls InvokeTestSpanDangerousCreate<byte> calls Invoke calls the lambda call TestSpanDangerousCreate<byte> calls Span<byte>.DangerousCreate. Am I missing something?

static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount)
{
var sink = Sink<T>.Instance;
var span = new Span<T>(array);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.

All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.

For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.

Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.

[MethodImpl(MethodImplOptions.NoInlining)]
static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount)
{
var sink = Sink<T>.Instance;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.

{
var sink = Sink<T>.Instance;
var span = new Span<T>(array);
int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.

static void TestSpanClear<T>(T[] array, int iterationCount)
{
var span = new Span<T>(array);
for (int i = 0; i < iterationCount; i++)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.

// Under a condition that we know is false but the jit doesn't,
// add a read from 'byteSpan' to make sure it's not dead, and an assignment
// to 'span' so the AsBytes call won't get hoisted.
if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.


static void BubbleSortSpanBase()
[MethodImpl(MethodImplOptions.NoInlining)]
static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?

"BubbleSortSpanBase", "BubbleSortArrayBase",
"QuickSortSpanBase", "QuickSortArrayBase"*/
);
BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@ahsonkhan
Copy link
Member

Given the issues, I'm not wedded to keeping the Hoistable one if you think it would be better removed.

I am for keeping it. I just wanted to get an understanding of what insight we could get from that data point.

@ahsonkhan
Copy link
Member

LGTM

using (iteration.StartMeasurement())
for (int i = 0; i < Benchmark.InnerIterationCount; i++)
span = new Span<byte>(array);
InvokeTestSpanConstructor<byte>(length);
Copy link
Member

@ahsonkhan ahsonkhan Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just learned (from @KrzysztofCwalina) how you could run benchmark tests with Type passed in as InlineData and avoid having multiple type specific definitions (like TestSpanConstructorByte and TestSpanConstructorString). In essence, have generic benchmark methods.

So, instead of having:

...
public static void TestSpanConstructorByte(int length)
...
public static void TestSpanConstructorString(int length)
...

You could do this:

[Benchmark(InnerIterationCount = 100000)]
[InlineData(1, typeof(byte))]
[InlineData(10, typeof(byte))]
[InlineData(100, typeof(byte))]
[InlineData(1000, typeof(byte))]
[InlineData(1, typeof(string))]
[InlineData(10, typeof(string))]
[InlineData(100, typeof(string))]
[InlineData(1000, typeof(string))]
public static void TestSpanConstructor(int length, Type type)
{
    MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type);
    object[] parameters = { length };
    method.Invoke(null, parameters);
    /*foreach (var iteration in Benchmark.Iterations)
    {
        using (iteration.StartMeasurement())
        {
            for (int i = 0; i < Benchmark.InnerIterationCount; i++)
                method.Invoke(null, parameters);
        }
    }*/
}
public static void InvokeTestSpanConstructor<T>(int length)
{
    var array = new T[length];
    Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false),
"TestSpanConstructor<{0}>({1})", typeof(T).Name, length);
}

@JosephTremoulet, do you think refactoring the tests this way is a good idea?

Edit: However, doing so will prevent us from passing in different InnerIterationCount for byte vs string (currently, since string is slow, we have BaseIterations / 100).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.

Copy link
Member

@ahsonkhan ahsonkhan Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the different overloads are easier to read since they don't need the reflection magic

Agreed, especially since we usually only have two types. Initially there were 4 so this change would have made more sense, then. I thought I would at least share the info since we both were curious on how to run benchmarks on generic types.

@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 17, 2017

Edit: Ignore scenario B and the questions that come from the results of B. I made a typo in the code giving wrong results. InnerIterationCount was 0 instead of BaseIterations (i.e. millions).

I tried to compare the results of the following:
A:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorByte1(int length)
{
    InvokeTestSpanConstructor<byte>(length);
}

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorString1(int length)
{
    InvokeTestSpanConstructor<string>(length);
}

public static void InvokeTestSpanConstructor<T>(int length)
{
    var array = new T[length];
    Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false), "TestSpanConstructor<{0}>({1})", typeof(T).Name, length);
}

/*[MethodImpl(MethodImplOptions.NoInlining)]*/
static void TestSpanConstructor<T>(T[] array, int iterationCount, bool untrue)
{
    var sink = Sink<T>.Instance;

    for (int i = 0; i < iterationCount; i++)
    {
        var span = new Span<T>(array);
        // Under a condition that we know is false but the jit doesn't,
        // add a read from 'span' to make sure it's not dead, and an assignment
        // to 'array' so the constructor call won't get hoisted.
        if (untrue) { sink.Data = span[0]; array = new T[iterationCount]; }
    }
}

B:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorByte(int length, bool untrue)
{
    var array = new byte[length];
    var sink = Sink<byte>.Instance;
    foreach (var iteration in Benchmark.Iterations)
        using (iteration.StartMeasurement())
            for (int i = 0; i < InnerIterationCount; i++)
            {
                var span = new Span<byte>(array);
                if (untrue) { sink.Data = span[0]; array = new byte[InnerIterationCount]; }
            }
}

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorString(int length, bool untrue)
{
    var array = new string[length];
    var sink = Sink<string>.Instance;
    foreach (var iteration in Benchmark.Iterations)
        using (iteration.StartMeasurement())
            for (int i = 0; i < InnerIterationCount; i++)
            {
                var span = new Span<string>(array);
                if (untrue) { sink.Data = span[0]; array = new string[InnerIterationCount]; }
            }
}

C:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, typeof(byte))]
[InlineData(1, typeof(string))]
public static void TestSpanConstructor(int length, Type type)
{
    MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type);
    object[] parameters = { length };
    method.Invoke(null, parameters);
}

Here are the results:

 Test Name                                                                             | Metric                | Iterations |    AVERAGE |    STDEV.S |        MIN |        MAX
:------------------------------------------------------------------------------------- |:--------------------- |:----------:| ----------:| ----------:| ----------:| ----------:
 System.Buffers.Tests.MemoryTests.TestSpanConstructor(length: 1, type: typeof(byte))   | Duration              |    101     |     48.790 |      7.526 |     44.325 |     78.395
 System.Buffers.Tests.MemoryTests.TestSpanConstructor(length: 1, type: typeof(string)) | Duration              |     20     |    540.724 |     63.090 |    480.255 |    684.930
 System.Buffers.Tests.MemoryTests.TestSpanConstructorByte1(length: 1)                  | Duration              |    101     |     49.833 |      9.072 |     44.213 |     88.616
 System.Buffers.Tests.MemoryTests.TestSpanConstructorString1(length: 1)                | Duration              |     21     |    528.542 |     53.536 |    477.277 |    692.126
 System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False)    | Duration              |    101     |      0.000 |      0.000 |      0.000 |      0.002
 System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False)  | Duration              |    101     |      0.009 |      0.085 |      0.000 |      0.856

A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However, I would imagine the "real" results come from B, i.e. essentially 0ms for byte and 0.009 ms for string (for the given iteration count). Given that, aren't we getting inaccurate data when we refactor the tests to go through the delegates and Invoke call?

How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much?

Copy link
Author

@JosephTremoulet JosephTremoulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to compare the results of the following...
A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However... B, [reports]. essentially 0ms for byte and 0.009 ms for string

Things I see that are different between what's being measured in A and C vs B:

  • A and C both include a delegate call (from Invoke to the lambda) and a direct call (from the lambda to TestSpanConstructor<T>. These are fixed costs (evidently < 50ms) that we should be able to drown out by making the iteration count high enough (e.g. target 1000ms).
  • A and C both have the static field load Sink<T>.Instance in the measured portion. That's also a fixed cost as above, though in this case we could move that out to InvokeTestSpanConstructor and pass sink in to TestSpanConstructor<T> as an argument the same way we pass in array... I guess I should probably do that, finding the address of the static field probably involves a dictionary lookup in the string case.
  • A and C both call the Span constructor from a generic method (TestSpanConstructor<T>), B from a nongeneric method (TestSpanConstructorString). I wonder if that makes A and C susceptible to #10031/#10051. If so, that could explain why they get such different measurements for string vs byte. You could rewrite B to call TestSpanConstructor<T> and see if the time for the string case suddenly jumps up (assuming you're trying this via xunit-perf, you could even move the timing loops into TestSpanConstructor<T> when trying that).

How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much?

I think (supposing my hypothesis about #10031 is true) that the extra costs in A/C are all small fixed costs that can be made irrelevant by selecting appropriate iteration counts.

using (iteration.StartMeasurement())
for (int i = 0; i < Benchmark.InnerIterationCount; i++)
span = new Span<byte>(array);
InvokeTestSpanConstructor<byte>(length);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.


static void InvokeTestSpanDangerousCreate<T>(int length)
{
TestClass<T> testClass = new TestClass<T>();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside Invoke, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>, and stop measuring once that delegate call returns.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is a better way to handle xunit-perf/exe duality then what I've been doing elsewhere.

@JosephTremoulet
Copy link
Author

@ahsonkhan, I'm having trouble reproducing your results. I verified that TestSpanConstructor<string> is running into #10031, but when I switched to a non-generic TestSpanConstructorString, even though we then get the inlining, it only made a small performance difference. From comparing the disassembly of the byte and nongeneric string versions, it would appear that this check (which the JIT eliminates in the byte case because the default(T) == null clause is false there) is probably causing the slowdown for the string version, but I can't figure out how you wouldn't be running into the same. Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

FYI, I also tried moving the Sink<T>.Instance calls out of the kernel methods, and found the difference in runtime to be negligible (in fact, in the generic case with T=String, while getting the address of the field does indeed involve some dictionary lookup stuff, it actually made the running time slightly faster because the first part of that sequence is redundant with part of the type check sequence in the loop, so leaving Span<T>.Instance in the kernel method let us remove a load from the loop). I don't have much of a preference either way, so let me know if you want me to move those out.

@AndyAyersMS
Copy link
Member

The jit code to optimize away the runtime type tests is fairly fragile, anything other than typeof(T) == typeof(..sometype..) is unlikely to be recognized today. You might open an issue and assign to me for this array type-check idiom; perhaps there is some way to clean this up.

@JosephTremoulet
Copy link
Author

You might open an issue and assign to me for this array type-check idiom; perhaps there is some way to clean this up

I wouldn't expect us to be able to clean this one up; I think it is the moral equivalent of array store type checks, which is in the Span<T> constructor that takes a T[] because arrays are covariant (and have the checks at stelems) but Spans are not (and don't have the checks at stores).

Though as you get farther in propagating exact types, we could start applying that knowledge for this and array store checks when the array allocation is in scope. Do you want an issue for that?

@AndyAyersMS
Copy link
Member

Maybe I am too pessimistic -- there are patterns in morph for the typeof(T) == obj.GetType() checks, we should at least see why they don't kick in here.

In a non-shared context we should be able to know the types well enough.

@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 17, 2017

but I can't figure out how you wouldn't be running into the same. Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

A and C both include a delegate call (from Invoke to the lambda) and a direct call (from the lambda to TestSpanConstructor. These are fixed costs (evidently < 50ms) that we should be able to drown out by making the iteration count high enough (e.g. target 1000ms).

I was actually testing/running against slow span which likely doesn't have a lot of the .net core 2.0 changes in the JIT/etc. I think you have addressed the main point of my comparison. I wasn't really showcasing why string or generic is so much larger but in general the test time is not measuring the real time of the API call.

Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

I will need some guidance on getting the disassembly, we can chat offline about that.

@ahsonkhan
Copy link
Member

FYI, I also tried moving the Sink<T>.Instance calls out of the kernel methods, and found the difference in runtime to be negligible ...
... so let me know if you want me to move those out.

No change needed.

@ahsonkhan
Copy link
Member

(and have the checks at stelems)

What is stelems?

@JosephTremoulet
Copy link
Author

What is stelems?

Store array element operations

@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 17, 2017

Looking at scenario B, I must be doing something incorrect, because this line is wrong/won't compile:

for (int i = 0; i < InnerIterationCount; i++)

It should instead be:

for (int i = 0; i < Benchmark.InnerIterationCount; i++)

Fixing this, I cannot reproduce the weird numbers that I initially reported for B (and it makes sense, possibly I had a local variable InnerIterationCount=1 or something).

Edit: yep, I had static int InnerIterationCount = 0; and was not using the real Benchmark.InnerIterationCount.

I tried to run tests again for two scenarios and here is what I observed:
1a) xunit.perf (No lambda)
System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False) - 29.768
System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False) - 429.595

1b) xunit.perf (lambda)
System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False) - 8.690
System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False) - 437.956

1a) shows Span<byte> constructor take ~30ms while 1b) shows approximately 8-10ms. Span<string> constructor uniformly takes around 400-440ms. Why a 3x reduction in runtime when we use lambdas?

Regardless, this is not blocking, and we should merge this PR!

// Use statics to smuggle some information from Main to Invoke when running tests
// from the command line.
static bool IsXunitInvocation = true; // xunit-perf leaves this true; command line Main sets to false
static int InnerIterationCount = 0; // used to communicate iteration count from BenchmarkAttribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Rename this to something other than InnerIterationCount to avoid confusion between this and BenchmarkAttribute.InnerIterationCount. If code evolves, I don't want to accidentally run 0 iterations instead by using InnerIterationCount in the wrong places where BenchmarkAttribute.InnerIterationCount should have been used.

Maybe ExposedInnerIterationCount?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could call it CommandLineIterationCount?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.

@JosephTremoulet
Copy link
Author

Why a 3x reduction in runtime when we use lambdas?

I really don't know. Do you see that result repeated consistently? 8ms/20ms are really short intervals to try to measure accurately.

@JosephTremoulet
Copy link
Author

Regardless, this is not blocking, and we should merge this PR!

Great, but I still don't know if it's ok to add System.Memory to tests/src/Common/test_dependencies/project.json like that. @wtgodbe, care to weigh in?

I also notice that the System.Memory references over in jit/config/benchmark get updated by @dotnet-bot when it updates references, and I don't know where the logic is that controls that to know if it would automatically start updating this reference in tandem or not. @stephentoub, maybe you know?

@ahsonkhan
Copy link
Member

Do you see that result repeated consistently?

I tried a few times. I could try again with larger iterations.

@jkotas
Copy link
Member

jkotas commented Mar 18, 2017

if it's ok to add System.Memory to tests/src/Common/test_dependencies/project.json

Yes, it is ok to add it.

I don't know where the logic is that controls

The auto update logic goes through all project.jsons and updates all version numbers it can find. It should kick in automatically for System.Memory, without any other special opt-in.

 - Re-enable the tests that were disabled in e859c30.
 - Re-work how tests are invoked from the command line to require less
   boilerplate
     - Now each test has a single [Benchmark] entrypoint
     - That entrypoint invokes its test's single inner-loop by wrapping it
       in a lambda that it passes to a new helper Invoke method (shared
       across all tests) which handles the xunit vs. command-line
       differences..
     - Main finds the entrypoints by using reflection to search for the
       [Benchmark] attributes, so the explicit list of stringified test
       names is no longer needed and the command line will run the same
       set of tests that xunit-perf does.
 - The new SpanAPI tests now get invoked when this tests is run from the
   command line as well.
 - Add [NoInlining] to the SpanAPI tests' kernels.
 - Add some heap writes and conditional writes to prevent deadcode
   optimization from eliminating tests' kernels (assuming we don't do
   interprocedural constant propagation or deadcode, or store sinking and
   final value calculation, or PRE...).
 - Split the Index SpanAPI tests into one version that uses a
   loop-invariant index and thus should get hoisted out of the loop, and
   another version that uses variant indices.
@JosephTremoulet JosephTremoulet merged commit 455d57c into dotnet:master Mar 18, 2017
@JosephTremoulet JosephTremoulet deleted the SpanBench branch March 18, 2017 18:06
@ahsonkhan
Copy link
Member

Part of dotnet/corefxlab#1314

@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants