Update SpanBench test #10216

JosephTremoulet · 2017-03-16T00:58:34Z

Re-enable the tests that were disabled in e859c30.
Re-work how tests are invoked from the command line to require less
boilerplate
- Now each test has a single [Benchmark] entrypoint
- That entrypoint invokes its test's single inner-loop by wrapping it
  in a lambda that it passes to a new helper Invoke method (shared
  across all tests) which handles the xunit vs. command-line
  differences..
- Main finds the entrypoints by using reflection to search for the
  [Benchmark] attributes, so the explicit list of stringified test
  names is no longer needed and the command line will run the same
  set of tests that xunit-perf does.
The new SpanAPI tests now get invoked when this tests is run from the
command line as well.
Add [NoInlining] to the SpanAPI tests' kernels.
Add some heap writes and conditional writes to prevent deadcode
optimization from eliminating tests' kernels (assuming we don't do
interprocedural constant propagation or deadcode, or store sinking and
final value calculation, or PRE...).
Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.

JosephTremoulet · 2017-03-16T01:00:01Z

@AndyAyersMS @ahsonkhan PTAL. This seems to be exposing some bug (the checked jit asserts when compiling the tests that I'm un-commenting here), so not ready to merge until I track that down, but it's looking like a jit bug so I think this test change is complete enough for review.
/cc @dotnet/jit-contrib

JosephTremoulet · 2017-03-16T16:49:45Z

These lab failures are not what I expected:

08:22:46   Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Memory, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. The system cannot find the file specified.
08:22:46            at Span.SpanBench.TestSpanFill[T](T[] array, Int32 iterationCount)
08:22:46            at Span.SpanBench.<>c__DisplayClass74_0`1.<InvokeTestSpanFill>b__0(Int32 innerIterationCount) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 905
08:22:46            at Span.SpanBench.Invoke(Action`1 innerLoop, String nameFormat, Object[] nameArgs) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 81
08:22:46            at Span.SpanBench.InvokeTestSpanFill[T](Int32 length) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 905
08:22:46            at Span.SpanBench.TestSpanFillByte(Int32 length) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 888
08:22:46            at Span.SpanBench.Main(String[] args) in D:\j\workspace\x86_checked_w---b3a226f6\tests\src\JIT\Performance\CodeQuality\Span\SpanBench.cs:line 1130

I'd say the references are messed up, but somehow we're building fine and then failing to run... since the SpanAPI tests weren't previously run when invoked from the command line, it's possible it's been like this for a while but the tests just haven't been running. It runs when I build locally, though. I'm investigating, but if anybody has suggestions I'm all ears.

AndyAyersMS · 2017-03-16T17:18:21Z

I believe there is a separate place where test runtime dependencies are declared, though this area has been changing recently so my understanding may outdated.

You might see if tests/src/Common/test_dependencies/project.lock.json has been updated locally or needs to be updated (there maybe some explicit way to update this now).

@wtgodbe any ideas?

JosephTremoulet · 2017-03-16T17:49:21Z

I believe there is a separate place where test runtime dependencies are declared

Ah, thanks for the tip. I can indeed repro now with a fresh sync and a clean enlistment -- looks like GenerateLayout is no longer including a copy of System.Memory.dll. Must have been a recent change, since this worked for me a day or two ago...

JosephTremoulet · 2017-03-16T18:26:08Z

I tried adding "System.Memory": "4.4.0-beta-25114-01" (which is what we have in benchmarks/project.json) to test_dependencies. I have no idea if that's a reasonable thing to do, but it did change my failure mode from a MissingFileException to System.TypeLoadException: A value type containing a by-ref instance field, such as Span<T>, cannot be used as the type for a class instance field.. That's in fact a perfectly reasonable error with my change, so this feel like progress. I'll rewrite to avoid trying to declare spans as instance members of ref classes...

JosephTremoulet · 2017-03-16T19:26:38Z

I'll rewrite to avoid trying to declare spans as instance members of ref classes

That turns out to be harder than I thought. I got rid of the Sink<Span<T>> cases, but am still getting the same TypeLoadException. I think it's complaining about the span members of the display classes that were generated for my lambdas. This must be a known issue, right? How do people usually deal with the situation that they wish their lambda could capture a Span?

ahsonkhan · 2017-03-16T19:29:25Z

lambda could capture a Span?

Can a lambda capture a span (which is stack only)? cc @KrzysztofCwalina, @jkotas

ahsonkhan · 2017-03-16T19:33:49Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        #region TestFillAllSpan
+        [Benchmark(InnerIterationCount = FillAllIterations)]
+        [InlineData(DefaultLength)]
+        public static void FillAllSpan(int length)
        {


I think we can remove this test now since we have a Span.Fill method that does this, no? Or is this meant to test writing into a span as well using the indexer?

I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.

it's reasonable to expect people will keep writing open-coded loops

Agreed.

JosephTremoulet · 2017-03-16T19:36:55Z

It turned out to be trivial to move all the lambda-captured spans into the lamdba methods themselves in this case. I can now run the test locally again, and I've pushed the updates.

Still don't know if just adding System.Memory to test_dependencies like that is ok or not...

ahsonkhan · 2017-03-16T19:37:05Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+            {
+                foreach (var iteration in Benchmark.Iterations)
+                    using (iteration.StartMeasurement())
+                        innerLoop((int)Benchmark.InnerIterationCount);


This is interesting. How would we run the benchmarks now? None of the tests with "Benchmark" attributes have iteration.StartMeasurement(). I thought the methods with Benchmark attributes are the entry point.

xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark in method bodies are referring to the Xunit.Benchmark class, not the BenchmarkAttribute that [Benchmark] instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark class (much like what I'm doing with InnerIterationCount for the command-line case now).

jkotas · 2017-03-16T19:37:47Z

Can a lambda capture a span (which is stack only)?

No. Check dotnet/csharplang#264 for list of other limitations.

ahsonkhan · 2017-03-16T19:41:03Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount)
+        {
+            var sink = Sink<T>.Instance;
+            var span = new Span<T>(array);


Isn't this test also measuring the construction of Span from an array and not just indexer? When compared to the array indexer this would give worst results.

Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.

All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.

For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.

Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.

My inclination is not to do that.

I agree with this as well. As long as the benchmark results are reasonable (and comparisons valid), I see no reason to change the tests.

ahsonkhan · 2017-03-16T19:42:27Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        [MethodImpl(MethodImplOptions.NoInlining)]
+        static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount)
+        {
+            var sink = Sink<T>.Instance;


Is the execution time of Sink<T>.Instance and Sink<Span<T>>.Instance similar and negligible?

I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.

ahsonkhan · 2017-03-16T19:49:46Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+
+        static void InvokeTestSpanDangerousCreate<T>(int length)
+        {
+            TestClass<T> testClass = new TestClass<T>();


By doing this, how are we still testing the non-reference case of Span<byte>? Both Span<string> and Span<byte> go down this path.

TestSpanDangerousCreateByte calls InvokeTestSpanDangerousCreate<byte> calls Invoke calls the lambda call TestSpanDangerousCreate<byte> calls Span<byte>.DangerousCreate. Am I missing something?

Am I missing something?

Nope, I was confused.

Question: In the call graph you specified, at what point do we start measuring?

Inside Invoke, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>, and stop measuring once that delegate call returns.

ahsonkhan · 2017-03-16T19:51:23Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        {
+            var sink = Sink<T>.Instance;
+            var span = new Span<T>(array);
+            int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7));


How is the index variant for lengths >= 10 which is true for 3/4 test cases?

It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.

ahsonkhan · 2017-03-16T19:55:06Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        static void TestSpanClear<T>(T[] array, int iterationCount)
+        {
+            var span = new Span<T>(array);
+            for (int i = 0; i < iterationCount; i++)


Measures construction + clear. Is the hopes/goal that since we clear "iterationCount" times that it would dwarf the constructor time? If not, when comparing to Array.Clear, span has a disadvantage.

Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.

ahsonkhan · 2017-03-16T19:56:41Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+                // Under a condition that we know is false but the jit doesn't,
+                // add a read from 'byteSpan' to make sure it's not dead, and an assignment
+                // to 'span' so the AsBytes call won't get hoisted.
+                if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); }


This is interesting. I get the read from 'byteSpan', but why do we need to assign to 'span'? If we don't, will the call to AsBytes gets hoisted outside of the for loop and will only run once? Is that what "hoisted" means?

I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.

ahsonkhan · 2017-03-16T19:58:24Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-
-        static void BubbleSortSpanBase()
+        [MethodImpl(MethodImplOptions.NoInlining)]
+        static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue)


Propagate the function name change to methods that are called? i.e. Add 'Wrapper' to this method name?

In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?

It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered.

Fair point. No change needed then.

ahsonkhan · 2017-03-16T19:59:10Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-            "BubbleSortSpanBase", "BubbleSortArrayBase",
-            "QuickSortSpanBase", "QuickSortArrayBase"*/
-        );
+                BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>();


This is awesome :) Good to know.

ahsonkhan · 2017-03-16T20:04:09Z

Still don't know if just adding System.Memory to test_dependencies like that is ok or not...

For a standalone application, I had to include a PackageReference to System.Memory or else I would get compiler errors like this:
Program.cs(79,28): error CS0246: The type or namespace name 'Span<>' could not be found (are you missing a using directive or an assembly reference?) [C:\Users\ahkha\Documents\SpanDemo\FastSpan\spantest.csproj]

So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit?

ahsonkhan · 2017-03-16T20:06:08Z

Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.

What is the motivation for this?

JosephTremoulet

So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit?

If I back out the change to the test_dependencies file, then even running through xunit I get the MissingFileException. I have no idea why it's working in the lab. (nor do I have any idea what's the right way to resolve this)

Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.

What is the motivation for this?

In the first version of these tests I saw, the index varied from iteration to iteration of the loop, and so the index operation would remain in the loop and the iteration could would multiply its measurement. In the current version, the index is loop-invariant and so it would be legal to hoist the load out of the loop, at which point I'd expect the test to run fast enough that we can't really meaningfully measure it. So I could see the value both ways -- the Variant one would let us measure indexing, and the Hoistable one would let us track whether the optimizer can hoist the load out of the loop (with the caveat that we'd want to translate its measurement numbers into "effectively zero" and "effectively not zero"). Given the issues, I'm not wedded to keeping the Hoistable one if you think it would be better removed.

JosephTremoulet · 2017-03-16T20:16:17Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+            {
+                foreach (var iteration in Benchmark.Iterations)
+                    using (iteration.StartMeasurement())
+                        innerLoop((int)Benchmark.InnerIterationCount);


xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark in method bodies are referring to the Xunit.Benchmark class, not the BenchmarkAttribute that [Benchmark] instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark class (much like what I'm doing with InnerIterationCount for the command-line case now).

JosephTremoulet · 2017-03-16T20:16:55Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        #region TestFillAllSpan
+        [Benchmark(InnerIterationCount = FillAllIterations)]
+        [InlineData(DefaultLength)]
+        public static void FillAllSpan(int length)
        {


I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.

JosephTremoulet · 2017-03-16T20:19:05Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+
+        static void InvokeTestSpanDangerousCreate<T>(int length)
+        {
+            TestClass<T> testClass = new TestClass<T>();


TestSpanDangerousCreateByte calls InvokeTestSpanDangerousCreate<byte> calls Invoke calls the lambda call TestSpanDangerousCreate<byte> calls Span<byte>.DangerousCreate. Am I missing something?

JosephTremoulet · 2017-03-16T20:28:55Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount)
+        {
+            var sink = Sink<T>.Instance;
+            var span = new Span<T>(array);


Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.

All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.

For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.

Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.

JosephTremoulet · 2017-03-16T20:32:15Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        [MethodImpl(MethodImplOptions.NoInlining)]
+        static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount)
+        {
+            var sink = Sink<T>.Instance;


I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.

JosephTremoulet · 2017-03-16T20:33:27Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        {
+            var sink = Sink<T>.Instance;
+            var span = new Span<T>(array);
+            int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7));


It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.

JosephTremoulet · 2017-03-16T20:35:25Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        static void TestSpanClear<T>(T[] array, int iterationCount)
+        {
+            var span = new Span<T>(array);
+            for (int i = 0; i < iterationCount; i++)


Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.

JosephTremoulet · 2017-03-16T20:36:16Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+                // Under a condition that we know is false but the jit doesn't,
+                // add a read from 'byteSpan' to make sure it's not dead, and an assignment
+                // to 'span' so the AsBytes call won't get hoisted.
+                if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); }


I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.

JosephTremoulet · 2017-03-16T20:40:25Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-
-        static void BubbleSortSpanBase()
+        [MethodImpl(MethodImplOptions.NoInlining)]
+        static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue)


In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?

JosephTremoulet · 2017-03-16T20:40:53Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-            "BubbleSortSpanBase", "BubbleSortArrayBase",
-            "QuickSortSpanBase", "QuickSortArrayBase"*/
-        );
+                BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>();


ahsonkhan · 2017-03-16T20:52:41Z

Given the issues, I'm not wedded to keeping the Hoistable one if you think it would be better removed.

I am for keeping it. I just wanted to get an understanding of what insight we could get from that data point.

ahsonkhan · 2017-03-16T21:11:14Z

LGTM

ahsonkhan · 2017-03-17T01:58:54Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-                using (iteration.StartMeasurement())
-                    for (int i = 0; i < Benchmark.InnerIterationCount; i++)
-                        span = new Span<byte>(array);
+            InvokeTestSpanConstructor<byte>(length);


I just learned (from @KrzysztofCwalina) how you could run benchmark tests with Type passed in as InlineData and avoid having multiple type specific definitions (like TestSpanConstructorByte and TestSpanConstructorString). In essence, have generic benchmark methods.

So, instead of having:

... public static void TestSpanConstructorByte(int length) ... public static void TestSpanConstructorString(int length) ...

You could do this:

[Benchmark(InnerIterationCount = 100000)] [InlineData(1, typeof(byte))] [InlineData(10, typeof(byte))] [InlineData(100, typeof(byte))] [InlineData(1000, typeof(byte))] [InlineData(1, typeof(string))] [InlineData(10, typeof(string))] [InlineData(100, typeof(string))] [InlineData(1000, typeof(string))] public static void TestSpanConstructor(int length, Type type) { MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type); object[] parameters = { length }; method.Invoke(null, parameters); /*foreach (var iteration in Benchmark.Iterations) { using (iteration.StartMeasurement()) { for (int i = 0; i < Benchmark.InnerIterationCount; i++) method.Invoke(null, parameters); } }*/ } public static void InvokeTestSpanConstructor<T>(int length) { var array = new T[length]; Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false), "TestSpanConstructor<{0}>({1})", typeof(T).Name, length); }

@JosephTremoulet, do you think refactoring the tests this way is a good idea?

Edit: However, doing so will prevent us from passing in different InnerIterationCount for byte vs string (currently, since string is slow, we have BaseIterations / 100).

IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.

IMO the different overloads are easier to read since they don't need the reflection magic

Agreed, especially since we usually only have two types. Initially there were 4 so this change would have made more sense, then. I thought I would at least share the info since we both were curious on how to run benchmarks on generic types.

ahsonkhan · 2017-03-17T02:46:02Z

Edit: Ignore scenario B and the questions that come from the results of B. I made a typo in the code giving wrong results. InnerIterationCount was 0 instead of BaseIterations (i.e. millions).

I tried to compare the results of the following:
A:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorByte1(int length)
{
    InvokeTestSpanConstructor<byte>(length);
}

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorString1(int length)
{
    InvokeTestSpanConstructor<string>(length);
}

public static void InvokeTestSpanConstructor<T>(int length)
{
    var array = new T[length];
    Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false), "TestSpanConstructor<{0}>({1})", typeof(T).Name, length);
}

/*[MethodImpl(MethodImplOptions.NoInlining)]*/
static void TestSpanConstructor<T>(T[] array, int iterationCount, bool untrue)
{
    var sink = Sink<T>.Instance;

    for (int i = 0; i < iterationCount; i++)
    {
        var span = new Span<T>(array);
        // Under a condition that we know is false but the jit doesn't,
        // add a read from 'span' to make sure it's not dead, and an assignment
        // to 'array' so the constructor call won't get hoisted.
        if (untrue) { sink.Data = span[0]; array = new T[iterationCount]; }
    }
}

B:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorByte(int length, bool untrue)
{
    var array = new byte[length];
    var sink = Sink<byte>.Instance;
    foreach (var iteration in Benchmark.Iterations)
        using (iteration.StartMeasurement())
            for (int i = 0; i < InnerIterationCount; i++)
            {
                var span = new Span<byte>(array);
                if (untrue) { sink.Data = span[0]; array = new byte[InnerIterationCount]; }
            }
}

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorString(int length, bool untrue)
{
    var array = new string[length];
    var sink = Sink<string>.Instance;
    foreach (var iteration in Benchmark.Iterations)
        using (iteration.StartMeasurement())
            for (int i = 0; i < InnerIterationCount; i++)
            {
                var span = new Span<string>(array);
                if (untrue) { sink.Data = span[0]; array = new string[InnerIterationCount]; }
            }
}

C:

[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, typeof(byte))]
[InlineData(1, typeof(string))]
public static void TestSpanConstructor(int length, Type type)
{
    MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type);
    object[] parameters = { length };
    method.Invoke(null, parameters);
}

Here are the results:

 Test Name                                                                             | Metric                | Iterations |    AVERAGE |    STDEV.S |        MIN |        MAX
:------------------------------------------------------------------------------------- |:--------------------- |:----------:| ----------:| ----------:| ----------:| ----------:
 System.Buffers.Tests.MemoryTests.TestSpanConstructor(length: 1, type: typeof(byte))   | Duration              |    101     |     48.790 |      7.526 |     44.325 |     78.395
 System.Buffers.Tests.MemoryTests.TestSpanConstructor(length: 1, type: typeof(string)) | Duration              |     20     |    540.724 |     63.090 |    480.255 |    684.930
 System.Buffers.Tests.MemoryTests.TestSpanConstructorByte1(length: 1)                  | Duration              |    101     |     49.833 |      9.072 |     44.213 |     88.616
 System.Buffers.Tests.MemoryTests.TestSpanConstructorString1(length: 1)                | Duration              |     21     |    528.542 |     53.536 |    477.277 |    692.126
 System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False)    | Duration              |    101     |      0.000 |      0.000 |      0.000 |      0.002
 System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False)  | Duration              |    101     |      0.009 |      0.085 |      0.000 |      0.856

A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However, I would imagine the "real" results come from B, i.e. essentially 0ms for byte and 0.009 ms for string (for the given iteration count). Given that, aren't we getting inaccurate data when we refactor the tests to go through the delegates and Invoke call?

How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much?

JosephTremoulet

I tried to compare the results of the following...
A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However... B, [reports]. essentially 0ms for byte and 0.009 ms for string

Things I see that are different between what's being measured in A and C vs B:

A and C both include a delegate call (from Invoke to the lambda) and a direct call (from the lambda to TestSpanConstructor<T>. These are fixed costs (evidently < 50ms) that we should be able to drown out by making the iteration count high enough (e.g. target 1000ms).
A and C both have the static field load Sink<T>.Instance in the measured portion. That's also a fixed cost as above, though in this case we could move that out to InvokeTestSpanConstructor and pass sink in to TestSpanConstructor<T> as an argument the same way we pass in array... I guess I should probably do that, finding the address of the static field probably involves a dictionary lookup in the string case.
A and C both call the Span constructor from a generic method (TestSpanConstructor<T>), B from a nongeneric method (TestSpanConstructorString). I wonder if that makes A and C susceptible to #10031/#10051. If so, that could explain why they get such different measurements for string vs byte. You could rewrite B to call TestSpanConstructor<T> and see if the time for the string case suddenly jumps up (assuming you're trying this via xunit-perf, you could even move the timing loops into TestSpanConstructor<T> when trying that).

How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much?

I think (supposing my hypothesis about #10031 is true) that the extra costs in A/C are all small fixed costs that can be made irrelevant by selecting appropriate iteration counts.

JosephTremoulet · 2017-03-17T04:01:32Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

-                using (iteration.StartMeasurement())
-                    for (int i = 0; i < Benchmark.InnerIterationCount; i++)
-                        span = new Span<byte>(array);
+            InvokeTestSpanConstructor<byte>(length);


IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.

JosephTremoulet · 2017-03-17T04:06:35Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+
+        static void InvokeTestSpanDangerousCreate<T>(int length)
+        {
+            TestClass<T> testClass = new TestClass<T>();


Inside Invoke, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>, and stop measuring once that delegate call returns.

AndyAyersMS

Seems like this is a better way to handle xunit-perf/exe duality then what I've been doing elsewhere.

JosephTremoulet · 2017-03-17T16:21:58Z

@ahsonkhan, I'm having trouble reproducing your results. I verified that TestSpanConstructor<string> is running into #10031, but when I switched to a non-generic TestSpanConstructorString, even though we then get the inlining, it only made a small performance difference. From comparing the disassembly of the byte and nongeneric string versions, it would appear that this check (which the JIT eliminates in the byte case because the default(T) == null clause is false there) is probably causing the slowdown for the string version, but I can't figure out how you wouldn't be running into the same. Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

FYI, I also tried moving the Sink<T>.Instance calls out of the kernel methods, and found the difference in runtime to be negligible (in fact, in the generic case with T=String, while getting the address of the field does indeed involve some dictionary lookup stuff, it actually made the running time slightly faster because the first part of that sequence is redundant with part of the type check sequence in the loop, so leaving Span<T>.Instance in the kernel method let us remove a load from the loop). I don't have much of a preference either way, so let me know if you want me to move those out.

AndyAyersMS · 2017-03-17T16:53:20Z

The jit code to optimize away the runtime type tests is fairly fragile, anything other than typeof(T) == typeof(..sometype..) is unlikely to be recognized today. You might open an issue and assign to me for this array type-check idiom; perhaps there is some way to clean this up.

JosephTremoulet · 2017-03-17T17:23:28Z

You might open an issue and assign to me for this array type-check idiom; perhaps there is some way to clean this up

I wouldn't expect us to be able to clean this one up; I think it is the moral equivalent of array store type checks, which is in the Span<T> constructor that takes a T[] because arrays are covariant (and have the checks at stelems) but Spans are not (and don't have the checks at stores).

Though as you get farther in propagating exact types, we could start applying that knowledge for this and array store checks when the array allocation is in scope. Do you want an issue for that?

AndyAyersMS · 2017-03-17T17:26:35Z

Maybe I am too pessimistic -- there are patterns in morph for the typeof(T) == obj.GetType() checks, we should at least see why they don't kick in here.

In a non-shared context we should be able to know the types well enough.

ahsonkhan · 2017-03-17T18:30:28Z

but I can't figure out how you wouldn't be running into the same. Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

A and C both include a delegate call (from Invoke to the lambda) and a direct call (from the lambda to TestSpanConstructor. These are fixed costs (evidently < 50ms) that we should be able to drown out by making the iteration count high enough (e.g. target 1000ms).

I was actually testing/running against slow span which likely doesn't have a lot of the .net core 2.0 changes in the JIT/etc. I think you have addressed the main point of my comparison. I wasn't really showcasing why string or generic is so much larger but in general the test time is not measuring the real time of the API call.

Can you share the disassembly you're getting for your TestSpanConstructorString from case B?

I will need some guidance on getting the disassembly, we can chat offline about that.

ahsonkhan · 2017-03-17T18:35:41Z

FYI, I also tried moving the Sink<T>.Instance calls out of the kernel methods, and found the difference in runtime to be negligible ...
... so let me know if you want me to move those out.

No change needed.

ahsonkhan · 2017-03-17T18:38:57Z

(and have the checks at stelems)

What is stelems?

JosephTremoulet · 2017-03-17T18:51:19Z

What is stelems?

Store array element operations

ahsonkhan · 2017-03-17T23:13:06Z

Looking at scenario B, I must be doing something incorrect, because this line is wrong/won't compile:

for (int i = 0; i < InnerIterationCount; i++)

It should instead be:

for (int i = 0; i < Benchmark.InnerIterationCount; i++)

Fixing this, I cannot reproduce the weird numbers that I initially reported for B (and it makes sense, possibly I had a local variable InnerIterationCount=1 or something).

Edit: yep, I had static int InnerIterationCount = 0; and was not using the real Benchmark.InnerIterationCount.

I tried to run tests again for two scenarios and here is what I observed:
1a) xunit.perf (No lambda)
System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False) - 29.768
System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False) - 429.595

1b) xunit.perf (lambda)
System.Buffers.Tests.MemoryTests.TestSpanConstructorByte(length: 1, untrue: False) - 8.690
System.Buffers.Tests.MemoryTests.TestSpanConstructorString(length: 1, untrue: False) - 437.956

1a) shows Span<byte> constructor take ~30ms while 1b) shows approximately 8-10ms. Span<string> constructor uniformly takes around 400-440ms. Why a 3x reduction in runtime when we use lambdas?

Regardless, this is not blocking, and we should merge this PR!

ahsonkhan · 2017-03-17T23:21:33Z

tests/src/JIT/Performance/CodeQuality/Span/SpanBench.cs

+        // Use statics to smuggle some information from Main to Invoke when running tests
+        // from the command line.
+        static bool IsXunitInvocation = true; // xunit-perf leaves this true; command line Main sets to false
+        static int InnerIterationCount = 0;   // used to communicate iteration count from BenchmarkAttribute


Suggestion: Rename this to something other than InnerIterationCount to avoid confusion between this and BenchmarkAttribute.InnerIterationCount. If code evolves, I don't want to accidentally run 0 iterations instead by using InnerIterationCount in the wrong places where BenchmarkAttribute.InnerIterationCount should have been used.

Maybe ExposedInnerIterationCount?

I could call it CommandLineIterationCount?

That works.

JosephTremoulet · 2017-03-18T02:17:46Z

Why a 3x reduction in runtime when we use lambdas?

I really don't know. Do you see that result repeated consistently? 8ms/20ms are really short intervals to try to measure accurately.

JosephTremoulet · 2017-03-18T02:20:20Z

Regardless, this is not blocking, and we should merge this PR!

Great, but I still don't know if it's ok to add System.Memory to tests/src/Common/test_dependencies/project.json like that. @wtgodbe, care to weigh in?

I also notice that the System.Memory references over in jit/config/benchmark get updated by @dotnet-bot when it updates references, and I don't know where the logic is that controls that to know if it would automatically start updating this reference in tandem or not. @stephentoub, maybe you know?

ahsonkhan · 2017-03-18T02:33:52Z

Do you see that result repeated consistently?

I tried a few times. I could try again with larger iterations.

jkotas · 2017-03-18T06:10:25Z

if it's ok to add System.Memory to tests/src/Common/test_dependencies/project.json

Yes, it is ok to add it.

I don't know where the logic is that controls

The auto update logic goes through all project.jsons and updates all version numbers it can find. It should kick in automatically for System.Memory, without any other special opt-in.

- Re-enable the tests that were disabled in e859c30. - Re-work how tests are invoked from the command line to require less boilerplate - Now each test has a single [Benchmark] entrypoint - That entrypoint invokes its test's single inner-loop by wrapping it in a lambda that it passes to a new helper Invoke method (shared across all tests) which handles the xunit vs. command-line differences.. - Main finds the entrypoints by using reflection to search for the [Benchmark] attributes, so the explicit list of stringified test names is no longer needed and the command line will run the same set of tests that xunit-perf does. - The new SpanAPI tests now get invoked when this tests is run from the command line as well. - Add [NoInlining] to the SpanAPI tests' kernels. - Add some heap writes and conditional writes to prevent deadcode optimization from eliminating tests' kernels (assuming we don't do interprocedural constant propagation or deadcode, or store sinking and final value calculation, or PRE...). - Split the Index SpanAPI tests into one version that uses a loop-invariant index and thus should get hoisted out of the loop, and another version that uses variant indices.

ahsonkhan · 2017-03-25T02:09:59Z

Part of dotnet/corefxlab#1314

JosephTremoulet requested review from ahsonkhan and AndyAyersMS March 16, 2017 00:58

dnfclas added the cla-already-signed label Mar 16, 2017

JosephTremoulet force-pushed the SpanBench branch from a557b48 to d90dfad Compare March 16, 2017 14:37

ahsonkhan reviewed Mar 16, 2017

View reviewed changes

JosephTremoulet commented Mar 16, 2017

View reviewed changes

ahsonkhan approved these changes Mar 16, 2017

View reviewed changes

ahsonkhan reviewed Mar 17, 2017

View reviewed changes

JosephTremoulet commented Mar 17, 2017

View reviewed changes

AndyAyersMS approved these changes Mar 17, 2017

View reviewed changes

ahsonkhan reviewed Mar 17, 2017

View reviewed changes

JosephTremoulet force-pushed the SpanBench branch from 198fa37 to 65226a2 Compare March 18, 2017 16:21

JosephTremoulet merged commit 455d57c into dotnet:master Mar 18, 2017

JosephTremoulet deleted the SpanBench branch March 18, 2017 18:06

karelz modified the milestone: 2.0.0 Aug 28, 2017

Update SpanBench test #10216

Update SpanBench test #10216

Conversation

JosephTremoulet commented Mar 16, 2017

JosephTremoulet commented Mar 16, 2017

JosephTremoulet commented Mar 16, 2017 • edited Loading

AndyAyersMS commented Mar 16, 2017

JosephTremoulet commented Mar 16, 2017

JosephTremoulet commented Mar 16, 2017

JosephTremoulet commented Mar 16, 2017

ahsonkhan commented Mar 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JosephTremoulet commented Mar 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Mar 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan Mar 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan commented Mar 16, 2017

ahsonkhan commented Mar 16, 2017

JosephTremoulet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan commented Mar 16, 2017

ahsonkhan commented Mar 16, 2017

ahsonkhan Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

ahsonkhan commented Mar 17, 2017 • edited Loading

JosephTremoulet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS left a comment

Choose a reason for hiding this comment

JosephTremoulet commented Mar 17, 2017

AndyAyersMS commented Mar 17, 2017

JosephTremoulet commented Mar 17, 2017

AndyAyersMS commented Mar 17, 2017

ahsonkhan commented Mar 17, 2017 • edited Loading

ahsonkhan commented Mar 17, 2017

ahsonkhan commented Mar 17, 2017

JosephTremoulet commented Mar 17, 2017

ahsonkhan commented Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JosephTremoulet commented Mar 18, 2017

JosephTremoulet commented Mar 16, 2017 •

edited

Loading

ahsonkhan Mar 16, 2017 •

edited

Loading

ahsonkhan Mar 17, 2017 •

edited

Loading

ahsonkhan Mar 17, 2017 •

edited

Loading

ahsonkhan commented Mar 17, 2017 •

edited

Loading

ahsonkhan commented Mar 17, 2017 •

edited

Loading

ahsonkhan commented Mar 17, 2017 •

edited

Loading