-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
@AndyAyersMS @ahsonkhan PTAL. This seems to be exposing some bug (the checked jit asserts when compiling the tests that I'm un-commenting here), so not ready to merge until I track that down, but it's looking like a jit bug so I think this test change is complete enough for review. |
a557b48
to
d90dfad
Compare
These lab failures are not what I expected:
I'd say the references are messed up, but somehow we're building fine and then failing to run... since the SpanAPI tests weren't previously run when invoked from the command line, it's possible it's been like this for a while but the tests just haven't been running. It runs when I build locally, though. I'm investigating, but if anybody has suggestions I'm all ears. |
I believe there is a separate place where test runtime dependencies are declared, though this area has been changing recently so my understanding may outdated. You might see if tests/src/Common/test_dependencies/project.lock.json has been updated locally or needs to be updated (there maybe some explicit way to update this now). @wtgodbe any ideas? |
Ah, thanks for the tip. I can indeed repro now with a fresh sync and a clean enlistment -- looks like GenerateLayout is no longer including a copy of System.Memory.dll. Must have been a recent change, since this worked for me a day or two ago... |
I tried adding |
That turns out to be harder than I thought. I got rid of the |
Can a lambda capture a span (which is stack only)? cc @KrzysztofCwalina, @jkotas |
#region TestFillAllSpan | ||
[Benchmark(InnerIterationCount = FillAllIterations)] | ||
[InlineData(DefaultLength)] | ||
public static void FillAllSpan(int length) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this test now since we have a Span.Fill method that does this, no? Or is this meant to test writing into a span as well using the indexer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's reasonable to expect people will keep writing open-coded loops
Agreed.
It turned out to be trivial to move all the lambda-captured spans into the lamdba methods themselves in this case. I can now run the test locally again, and I've pushed the updates. Still don't know if just adding |
{ | ||
foreach (var iteration in Benchmark.Iterations) | ||
using (iteration.StartMeasurement()) | ||
innerLoop((int)Benchmark.InnerIterationCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. How would we run the benchmarks now? None of the tests with "Benchmark" attributes have iteration.StartMeasurement(). I thought the methods with Benchmark attributes are the entry point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark
in method bodies are referring to the Xunit.Benchmark
class, not the BenchmarkAttribute
that [Benchmark]
instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark
class (much like what I'm doing with InnerIterationCount
for the command-line case now).
No. Check dotnet/csharplang#264 for list of other limitations. |
static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount) | ||
{ | ||
var sink = Sink<T>.Instance; | ||
var span = new Span<T>(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this test also measuring the construction of Span from an array and not just indexer? When compared to the array indexer this would give worst results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.
All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.
For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.
Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination is not to do that.
I agree with this as well. As long as the benchmark results are reasonable (and comparisons valid), I see no reason to change the tests.
[MethodImpl(MethodImplOptions.NoInlining)] | ||
static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount) | ||
{ | ||
var sink = Sink<T>.Instance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the execution time of Sink<T>.Instance
and Sink<Span<T>>.Instance
similar and negligible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.
|
||
static void InvokeTestSpanDangerousCreate<T>(int length) | ||
{ | ||
TestClass<T> testClass = new TestClass<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing this, how are we still testing the non-reference case of Span<byte>
? Both Span<string>
and Span<byte>
go down this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestSpanDangerousCreateByte
calls InvokeTestSpanDangerousCreate<byte>
calls Invoke
calls the lambda call TestSpanDangerousCreate<byte>
calls Span<byte>.DangerousCreate
. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something?
Nope, I was confused.
Question: In the call graph you specified, at what point do we start measuring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside Invoke
, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>
, and stop measuring once that delegate call returns.
{ | ||
var sink = Sink<T>.Instance; | ||
var span = new Span<T>(array); | ||
int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the index variant for lengths >= 10 which is true for 3/4 test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.
static void TestSpanClear<T>(T[] array, int iterationCount) | ||
{ | ||
var span = new Span<T>(array); | ||
for (int i = 0; i < iterationCount; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measures construction + clear. Is the hopes/goal that since we clear "iterationCount" times that it would dwarf the constructor time? If not, when comparing to Array.Clear, span has a disadvantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.
// Under a condition that we know is false but the jit doesn't, | ||
// add a read from 'byteSpan' to make sure it's not dead, and an assignment | ||
// to 'span' so the AsBytes call won't get hoisted. | ||
if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. I get the read from 'byteSpan', but why do we need to assign to 'span'? If we don't, will the call to AsBytes gets hoisted outside of the for loop and will only run once? Is that what "hoisted" means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.
|
||
static void BubbleSortSpanBase() | ||
[MethodImpl(MethodImplOptions.NoInlining)] | ||
static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagate the function name change to methods that are called? i.e. Add 'Wrapper' to this method name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered.
Fair point. No change needed then.
"BubbleSortSpanBase", "BubbleSortArrayBase", | ||
"QuickSortSpanBase", "QuickSortArrayBase"*/ | ||
); | ||
BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome :) Good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
For a standalone application, I had to include a PackageReference to System.Memory or else I would get compiler errors like this: So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit? |
What is the motivation for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I would imagine this is necessary for the tests to run via command line. That would be my assumption. How come this was not necessary for when benchmarks run via xunit?
If I back out the change to the test_dependencies file, then even running through xunit I get the MissingFileException. I have no idea why it's working in the lab. (nor do I have any idea what's the right way to resolve this)
Split the Index SpanAPI tests into one version that uses a
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.
What is the motivation for this?
In the first version of these tests I saw, the index varied from iteration to iteration of the loop, and so the index operation would remain in the loop and the iteration could would multiply its measurement. In the current version, the index is loop-invariant and so it would be legal to hoist the load out of the loop, at which point I'd expect the test to run fast enough that we can't really meaningfully measure it. So I could see the value both ways -- the Variant one would let us measure indexing, and the Hoistable one would let us track whether the optimizer can hoist the load out of the loop (with the caveat that we'd want to translate its measurement numbers into "effectively zero" and "effectively not zero"). Given the issues, I'm not wedded to keeping the Hoistable one if you think it would be better removed.
{ | ||
foreach (var iteration in Benchmark.Iterations) | ||
using (iteration.StartMeasurement()) | ||
innerLoop((int)Benchmark.InnerIterationCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xunit-performance calls the [Benchmark]-attributed methods with logic similar to what's in Main now. Those methods now build a delegate and pass it to this method, which does iteration.StartMeasurement(). Note that the references to Benchmark
in method bodies are referring to the Xunit.Benchmark
class, not the BenchmarkAttribute
that [Benchmark]
instantiates; the static class fetches the data from the attributes (again, similar to what Main now does) and makes it available to the benchmark code via properties on the Benchmark
class (much like what I'm doing with InnerIterationCount
for the command-line case now).
#region TestFillAllSpan | ||
[Benchmark(InnerIterationCount = FillAllIterations)] | ||
[InlineData(DefaultLength)] | ||
public static void FillAllSpan(int length) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to expect people will keep writing open-coded loops like this and have them in the benchmark. I'm not wedded to the idea, though, if you feel strongly.
|
||
static void InvokeTestSpanDangerousCreate<T>(int length) | ||
{ | ||
TestClass<T> testClass = new TestClass<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestSpanDangerousCreateByte
calls InvokeTestSpanDangerousCreate<byte>
calls Invoke
calls the lambda call TestSpanDangerousCreate<byte>
calls Span<byte>.DangerousCreate
. Am I missing something?
static void TestSpanIndexHoistable<T>(T[] array, int length, int iterationCount) | ||
{ | ||
var sink = Sink<T>.Instance; | ||
var span = new Span<T>(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, when I wrapped the tests in lambdas to make them callable from the command line, I lost the ability to put setup code in the same method as the kernel but outside the measurement loop. So then there was a decision to be made about whether the spans created like this should be created inside the kernel method or created outside the loop and passed in. With the SpanAPI tests, I opted to create the spans as locals in the kernel methods; passing the spans in would make measuring on win-x64 skewed since it would see them as byrefs (pending fixing #10049), so I thought this would best mirror the codegen we want to be tracking. My expectation is (as you surmised in a later comment) that the loop's run-time will dwarf out the span creation time because it iterates so many times.
All that said, for IndexHoistable in particular, we'd hope to get to a point that the kernel loop doesn't even include the store... at that point I'd hope the running-time is effectively zero and the difference between creating the span here and not doing it in the array case would be too small to measure.
For any of these tests that you feel it's important, we could split them back out and have one method that gets invoked from Xunit (which could have a span local that gets initialized outside the timing loop) and another that gets invoked from the command line (which would look like this and we'd hope the codegen was similar enough). My inclination is not to do that.
Regarding the choice of passing the spans into the method or not, while I made the spans local to the kernel methods for the SpanAPI tests, I made them parameters in the other version. I just thought it would be good to have both approaches measured, and now hopefully the other tests will see a bump when #10049 gets implemented.
[MethodImpl(MethodImplOptions.NoInlining)] | ||
static void TestArrayIndexHoistable<T>(T[] array, int length, int iterationCount) | ||
{ | ||
var sink = Sink<T>.Instance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so. Any type initialization should get triggered during the warm-up runs, so then the measured code will just verify that the type is initialized and load the field.
{ | ||
var sink = Sink<T>.Instance; | ||
var span = new Span<T>(array); | ||
int mask = (length < 2 ? 0 : (length < 8 ? 1 : 7)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It follows the sequence 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,... By "variant" here I just meant "not the same on each iteration of the loop" so that the JIT can't hoist the span access out of the loop.
static void TestSpanClear<T>(T[] array, int iterationCount) | ||
{ | ||
var span = new Span<T>(array); | ||
for (int i = 0; i < iterationCount; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the loop will both iterate and zero out the whole span on each iteration; construction will just load the length out of the array.
// Under a condition that we know is false but the jit doesn't, | ||
// add a read from 'byteSpan' to make sure it's not dead, and an assignment | ||
// to 'span' so the AsBytes call won't get hoisted. | ||
if (untrue) { sink.Data = byteSpan[0]; span = new Span<T>(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked if we currently hoist AsBytes, but yes it could and yes that's what "hoisted" means and yes that's why I included the assignment to span.
|
||
static void BubbleSortSpanBase() | ||
[MethodImpl(MethodImplOptions.NoInlining)] | ||
static void TestSpanSliceStringChar(string s, int iterationCount, bool untrue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other tests, the [Benchmark] method is suffixed with a type and the kernel method is not. Since the kernel method isn't generic in this case, there wasn't a type to suffix the [Benchmark] method with, so I added a "Wrapper" suffix to it. It's a bit easier to inspect jitting of the kernel methods if they aren't overloads, is why I bothered. That said, if you want them to have the same name, I'll change it, though wouldn't it be better to remove "Wrapper" from the other method?
"BubbleSortSpanBase", "BubbleSortArrayBase", | ||
"QuickSortSpanBase", "QuickSortArrayBase"*/ | ||
); | ||
BenchmarkAttribute benchAttr = m.GetCustomAttribute<BenchmarkAttribute>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
I am for keeping it. I just wanted to get an understanding of what insight we could get from that data point. |
LGTM |
using (iteration.StartMeasurement()) | ||
for (int i = 0; i < Benchmark.InnerIterationCount; i++) | ||
span = new Span<byte>(array); | ||
InvokeTestSpanConstructor<byte>(length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just learned (from @KrzysztofCwalina) how you could run benchmark tests with Type passed in as InlineData and avoid having multiple type specific definitions (like TestSpanConstructorByte and TestSpanConstructorString). In essence, have generic benchmark methods.
So, instead of having:
...
public static void TestSpanConstructorByte(int length)
...
public static void TestSpanConstructorString(int length)
...
You could do this:
[Benchmark(InnerIterationCount = 100000)]
[InlineData(1, typeof(byte))]
[InlineData(10, typeof(byte))]
[InlineData(100, typeof(byte))]
[InlineData(1000, typeof(byte))]
[InlineData(1, typeof(string))]
[InlineData(10, typeof(string))]
[InlineData(100, typeof(string))]
[InlineData(1000, typeof(string))]
public static void TestSpanConstructor(int length, Type type)
{
MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type);
object[] parameters = { length };
method.Invoke(null, parameters);
/*foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for (int i = 0; i < Benchmark.InnerIterationCount; i++)
method.Invoke(null, parameters);
}
}*/
}
public static void InvokeTestSpanConstructor<T>(int length)
{
var array = new T[length];
Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false),
"TestSpanConstructor<{0}>({1})", typeof(T).Name, length);
}
@JosephTremoulet, do you think refactoring the tests this way is a good idea?
Edit: However, doing so will prevent us from passing in different InnerIterationCount for byte vs string (currently, since string is slow, we have BaseIterations / 100).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the different overloads are easier to read since they don't need the reflection magic
Agreed, especially since we usually only have two types. Initially there were 4 so this change would have made more sense, then. I thought I would at least share the info since we both were curious on how to run benchmarks on generic types.
Edit: Ignore scenario B and the questions that come from the results of B. I made a typo in the code giving wrong results. InnerIterationCount was 0 instead of BaseIterations (i.e. millions). I tried to compare the results of the following: [Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorByte1(int length)
{
InvokeTestSpanConstructor<byte>(length);
}
[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1)]
public static void TestSpanConstructorString1(int length)
{
InvokeTestSpanConstructor<string>(length);
}
public static void InvokeTestSpanConstructor<T>(int length)
{
var array = new T[length];
Invoke((int innerIterationCount) => TestSpanConstructor<T>(array, innerIterationCount, false), "TestSpanConstructor<{0}>({1})", typeof(T).Name, length);
}
/*[MethodImpl(MethodImplOptions.NoInlining)]*/
static void TestSpanConstructor<T>(T[] array, int iterationCount, bool untrue)
{
var sink = Sink<T>.Instance;
for (int i = 0; i < iterationCount; i++)
{
var span = new Span<T>(array);
// Under a condition that we know is false but the jit doesn't,
// add a read from 'span' to make sure it's not dead, and an assignment
// to 'array' so the constructor call won't get hoisted.
if (untrue) { sink.Data = span[0]; array = new T[iterationCount]; }
}
} B: [Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorByte(int length, bool untrue)
{
var array = new byte[length];
var sink = Sink<byte>.Instance;
foreach (var iteration in Benchmark.Iterations)
using (iteration.StartMeasurement())
for (int i = 0; i < InnerIterationCount; i++)
{
var span = new Span<byte>(array);
if (untrue) { sink.Data = span[0]; array = new byte[InnerIterationCount]; }
}
}
[Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, false)]
public static void TestSpanConstructorString(int length, bool untrue)
{
var array = new string[length];
var sink = Sink<string>.Instance;
foreach (var iteration in Benchmark.Iterations)
using (iteration.StartMeasurement())
for (int i = 0; i < InnerIterationCount; i++)
{
var span = new Span<string>(array);
if (untrue) { sink.Data = span[0]; array = new string[InnerIterationCount]; }
}
} C: [Benchmark(InnerIterationCount = BaseIterations)]
[InlineData(1, typeof(byte))]
[InlineData(1, typeof(string))]
public static void TestSpanConstructor(int length, Type type)
{
MethodInfo method = typeof(MemoryTests).GetMethod("InvokeTestSpanConstructor").MakeGenericMethod(type);
object[] parameters = { length };
method.Invoke(null, parameters);
} Here are the results:
A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However, I would imagine the "real" results come from B, i.e. essentially 0ms for byte and 0.009 ms for string (for the given iteration count). Given that, aren't we getting inaccurate data when we refactor the tests to go through the delegates and Invoke call? How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to compare the results of the following...
A and C yield very similar results (approximately 50ms for byte, and 540ms for string). However... B, [reports]. essentially 0ms for byte and 0.009 ms for string
Things I see that are different between what's being measured in A and C vs B:
- A and C both include a delegate call (from
Invoke
to the lambda) and a direct call (from the lambda toTestSpanConstructor<T>
. These are fixed costs (evidently < 50ms) that we should be able to drown out by making the iteration count high enough (e.g. target 1000ms). - A and C both have the static field load
Sink<T>.Instance
in the measured portion. That's also a fixed cost as above, though in this case we could move that out toInvokeTestSpanConstructor
and passsink
in toTestSpanConstructor<T>
as an argument the same way we pass inarray
... I guess I should probably do that, finding the address of the static field probably involves a dictionary lookup in the string case. - A and C both call the Span constructor from a generic method (
TestSpanConstructor<T>
), B from a nongeneric method (TestSpanConstructorString
). I wonder if that makes A and C susceptible to #10031/#10051. If so, that could explain why they get such different measurements for string vs byte. You could rewrite B to callTestSpanConstructor<T>
and see if the time for the string case suddenly jumps up (assuming you're trying this via xunit-perf, you could even move the timing loops intoTestSpanConstructor<T>
when trying that).
How will we reason about such data? Is it just to highlight the proportional difference between byte/string and compare one run to another as code changes and see improvements? Does the "real" execution time not matter so much?
I think (supposing my hypothesis about #10031 is true) that the extra costs in A/C are all small fixed costs that can be made irrelevant by selecting appropriate iteration counts.
using (iteration.StartMeasurement()) | ||
for (int i = 0; i < Benchmark.InnerIterationCount; i++) | ||
span = new Span<byte>(array); | ||
InvokeTestSpanConstructor<byte>(length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the different overloads are easier to read since they don't need the reflection magic, and they also have the benefit you mentioned in your edit. I don't mind changing it if you disagree, though.
|
||
static void InvokeTestSpanDangerousCreate<T>(int length) | ||
{ | ||
TestClass<T> testClass = new TestClass<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside Invoke
, we start measuring just before invoking the delegate that's bound to the lambda body that calls TestSpanDangerousCreate<byte>
, and stop measuring once that delegate call returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is a better way to handle xunit-perf/exe duality then what I've been doing elsewhere.
@ahsonkhan, I'm having trouble reproducing your results. I verified that FYI, I also tried moving the |
The jit code to optimize away the runtime type tests is fairly fragile, anything other than |
I wouldn't expect us to be able to clean this one up; I think it is the moral equivalent of array store type checks, which is in the Though as you get farther in propagating exact types, we could start applying that knowledge for this and array store checks when the array allocation is in scope. Do you want an issue for that? |
Maybe I am too pessimistic -- there are patterns in morph for the In a non-shared context we should be able to know the types well enough. |
I was actually testing/running against slow span which likely doesn't have a lot of the .net core 2.0 changes in the JIT/etc. I think you have addressed the main point of my comparison. I wasn't really showcasing why string or generic is so much larger but in general the test time is not measuring the real time of the API call.
I will need some guidance on getting the disassembly, we can chat offline about that. |
No change needed. |
What is stelems? |
Store array element operations |
Looking at scenario B, I must be doing something incorrect, because this line is wrong/won't compile: for (int i = 0; i < InnerIterationCount; i++) It should instead be: for (int i = 0; i < Benchmark.InnerIterationCount; i++) Fixing this, I cannot reproduce the weird numbers that I initially reported for B (and it makes sense, possibly I had a local variable InnerIterationCount=1 or something). Edit: yep, I had I tried to run tests again for two scenarios and here is what I observed: 1b) xunit.perf (lambda) 1a) shows Span<byte> constructor take ~30ms while 1b) shows approximately 8-10ms. Span<string> constructor uniformly takes around 400-440ms. Why a 3x reduction in runtime when we use lambdas? Regardless, this is not blocking, and we should merge this PR! |
// Use statics to smuggle some information from Main to Invoke when running tests | ||
// from the command line. | ||
static bool IsXunitInvocation = true; // xunit-perf leaves this true; command line Main sets to false | ||
static int InnerIterationCount = 0; // used to communicate iteration count from BenchmarkAttribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Rename this to something other than InnerIterationCount to avoid confusion between this and BenchmarkAttribute.InnerIterationCount. If code evolves, I don't want to accidentally run 0 iterations instead by using InnerIterationCount in the wrong places where BenchmarkAttribute.InnerIterationCount should have been used.
Maybe ExposedInnerIterationCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could call it CommandLineIterationCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works.
I really don't know. Do you see that result repeated consistently? 8ms/20ms are really short intervals to try to measure accurately. |
Great, but I still don't know if it's ok to add System.Memory to I also notice that the System.Memory references over in jit/config/benchmark get updated by @dotnet-bot when it updates references, and I don't know where the logic is that controls that to know if it would automatically start updating this reference in tandem or not. @stephentoub, maybe you know? |
I tried a few times. I could try again with larger iterations. |
Yes, it is ok to add it.
The auto update logic goes through all project.jsons and updates all version numbers it can find. It should kick in automatically for System.Memory, without any other special opt-in. |
198fa37
to
65226a2
Compare
- Re-enable the tests that were disabled in e859c30. - Re-work how tests are invoked from the command line to require less boilerplate - Now each test has a single [Benchmark] entrypoint - That entrypoint invokes its test's single inner-loop by wrapping it in a lambda that it passes to a new helper Invoke method (shared across all tests) which handles the xunit vs. command-line differences.. - Main finds the entrypoints by using reflection to search for the [Benchmark] attributes, so the explicit list of stringified test names is no longer needed and the command line will run the same set of tests that xunit-perf does. - The new SpanAPI tests now get invoked when this tests is run from the command line as well. - Add [NoInlining] to the SpanAPI tests' kernels. - Add some heap writes and conditional writes to prevent deadcode optimization from eliminating tests' kernels (assuming we don't do interprocedural constant propagation or deadcode, or store sinking and final value calculation, or PRE...). - Split the Index SpanAPI tests into one version that uses a loop-invariant index and thus should get hoisted out of the loop, and another version that uses variant indices.
Part of dotnet/corefxlab#1314 |
boilerplate
in a lambda that it passes to a new helper Invoke method (shared
across all tests) which handles the xunit vs. command-line
differences..
[Benchmark] attributes, so the explicit list of stringified test
names is no longer needed and the command line will run the same
set of tests that xunit-perf does.
command line as well.
optimization from eliminating tests' kernels (assuming we don't do
interprocedural constant propagation or deadcode, or store sinking and
final value calculation, or PRE...).
loop-invariant index and thus should get hoisted out of the loop, and
another version that uses variant indices.