Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT/ARM32] low performance compared to amd64 investigation (System.Tests.Perf_Char.Char_ToUpper test). #13049

Closed
viewizard opened this issue Jul 9, 2019 · 12 comments
Labels

Comments

@viewizard
Copy link
Member

Initial thread was started by @alpencolt https://github.com/dotnet/coreclr/issues/23520
Initial performance test results generated by @alpencolt
https://gist.github.com/alpencolt/0580af0be86e49bb9d89508dabcd8615

I investigate "System.Tests.Perf_Char.Char_ToUpper" test performance drop for arm32, compared to amd64. First 3 tests looks fine, but on fourth we have huge performance drop.

yield return new object[] { 'A', new CultureInfo("en-US") }; // ASCII upper case
yield return new object[] { 'a', new CultureInfo("en-US") }; // ASCII lower case
yield return new object[] { '\u0130', new CultureInfo("en-US") }; // non-ASCII, English
yield return new object[] { '\u4F60', new CultureInfo("zh-Hans") }; // non-ASCII, Chinese
| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality  |
| -------------------------------------------------------------------------------- | --------- | ----------------:| ----------------:| ---------- |
| System.Tests.Perf_Char.Char_ToUpper(c: '你', cultureName: zh-Hans)                | 133.18    |            33.79 |          4500.00 |           |
| System.Tests.Perf_Char.Char_ToUpper(c: 'İ', cultureName: en-US)                  | 2.28      |            29.59 |            67.55 |           |
| System.Tests.Perf_Char.Char_ToUpper(c: 'A', cultureName: en-US)                  | 1.86      |             6.46 |            12.00 |           |
| System.Tests.Perf_Char.Char_ToUpper(c: 'a', cultureName: en-US)                  | 1.26      |             6.04 |             7.60 |           |

I logged disassembly for both, amd64 and arm32 (Char_ToUpper_amd.txt
Char_ToUpper_arm.txt), but didn't find any issues in asm, that may drop performance so much.

I find out, that the drop by somehow connected to CultureInfo:

  1. If I change CultureInfo in test to "zh-Hans"/"zh" or even to "zh-CN" (zn_CN.UTF8 locale generated in system), test show me performance drop on all (ASCII and non-ASCII) symbols.
  2. If I change CultureInfo in test to not existent, say, "tv-TV", all is ok (at least performance drop no so huge, only in 2.2-2.5 times).
  3. If I change CultureInfo in test to any non "zh*", all is ok (at least performance drop no so huge, only in 2.2-2.5 times).

On both (amd64 and arm32) systems installed Ubuntu 16.04 with same libICU version:

$ apt-cache policy libicu55
libicu55:
  Installed: 55.1-7ubuntu0.4
  Candidate: 55.1-7ubuntu0.4

I see, libICU usage in test, but I don't see any CultureInfo-libICU connection in this test related code, all what I see is u_toupper() calls without any initialization during test execution.

Looks like I am stuck. Could someone route me, please?

@tarekgh
Copy link
Member

tarekgh commented Jul 10, 2019

I see, libICU usage in test, but I don't see any CultureInfo-libICU connection in this test related code, all what I see is u_toupper() calls without any initialization during test execution.

u_toupper is ICU API. the following is the code going from TextInfo to ICU.

https://github.com/dotnet/coreclr/blob/27790ab2dbee25d2e9c6fc41863aa7e983552a3f/src/System.Private.CoreLib/shared/System/Globalization/TextInfo.Unix.cs#L51
https://github.com/dotnet/coreclr/blob/27790ab2dbee25d2e9c6fc41863aa7e983552a3f/src/corefx/System.Globalization.Native/pal_casing.c#L19

Looks like I am stuck. Could someone route me, please?

Is it possible you can measure the perf of the ICU u_toupper()/u_tolower(), just to isolate if the issue is really the ICU or we have other issue in the managed side.

CC @adamsitnik

@viewizard
Copy link
Member Author

I did additional tests for ARM32 in order to get some u_toupper()/u_tolower() calls performance.

$ locale -a
C
C.UTF-8
POSIX
en_DK.utf8
en_US
en_US.iso88591
en_US.utf8
ja_JP.utf8
ru_RU.utf8
ru_UA.utf8
sl_SI.utf8
zh_CN
zh_CN.gb2312
zh_CN.utf8
zh_HK.utf8
zh_SG
zh_SG.gb2312
zh_SG.utf8
zh_TW
zh_TW.big5
zh_TW.utf8

[2019/07/10 08:16:52][INFO] |       Method | c | cultureName |         Mean |         Error |        StdDev |       Median |          Min |          Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/07/10 08:16:52][INFO] |------------- |-- |------------ |-------------:|--------------:|--------------:|-------------:|-------------:|-------------:|------------:|------------:|------------:|--------------------:|
[2019/07/10 08:16:52][INFO] | Char_ToUpper | A |       en-US |     7.224 ns |     0.0874 ns |     0.0817 ns |     7.188 ns |     7.143 ns |     7.384 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | A |       tv-TV |     8.489 ns |     1.9606 ns |     1.8339 ns |     7.435 ns |     7.243 ns |    12.296 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | A |       zh-CN | 3,631.250 ns |   691.3996 ns |   796.2168 ns | 3,479.500 ns | 2,625.000 ns | 4,875.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | A |     zh-Hans | 3,489.550 ns |   703.2143 ns |   809.8227 ns | 3,083.500 ns | 2,709.000 ns | 5,458.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | a |       en-US |     7.455 ns |     0.1281 ns |     0.1136 ns |     7.417 ns |     7.321 ns |     7.638 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | a |       tv-TV |     7.400 ns |     0.0925 ns |     0.0772 ns |     7.404 ns |     7.280 ns |     7.527 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | a |       zh-CN | 3,570.750 ns |   652.1534 ns |   751.0208 ns | 3,249.500 ns | 2,708.000 ns | 4,833.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | a |     zh-Hans | 3,395.778 ns |   714.4587 ns |   764.4628 ns | 3,208.500 ns | 2,666.000 ns | 5,167.000 ns |           - |           - |           - |                   - |

Accordingly to my logs, in this test u_toupper() was called 0 times.

@tarekgh
Copy link
Member

tarekgh commented Jul 10, 2019

Accordingly to my logs, in this test u_toupper() was called 0 times.

From the table it looks you are using ASCII characters ('A' and 'a'). we have optimization for the ASCII characters and will not call ICU in such cases. you can try it with non ASCII characters and you should get u_toupper get called. could you paste the perf comparison data data in ASCII and non ASCII scenarios? Sorry I don't have handy VMs to do it.

@janvorli I recall you mentioned before it is not right to compare the perf of 64-bit against 32-bit. is this right?

@viewizard
Copy link
Member Author

Sure, here is non ASCII test results for ARM32:

[2019/07/10 08:16:52][INFO] |       Method | c | cultureName |         Mean |         Error |        StdDev |       Median |          Min |          Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/07/10 08:16:52][INFO] |------------- |-- |------------ |-------------:|--------------:|--------------:|-------------:|-------------:|-------------:|------------:|------------:|------------:|--------------------:|
[2019/07/10 08:16:52][INFO] | Char_ToUpper | İ |       en-US |    78.088 ns |     0.8403 ns |     0.7860 ns |    78.662 ns |    77.056 ns |    79.074 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | İ |       tv-TV |    67.843 ns |     1.1325 ns |     1.0594 ns |    68.344 ns |    66.212 ns |    69.199 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | İ |       zh-CN | 5,035.000 ns | 1,237.1635 ns | 1,375.1040 ns | 4,625.000 ns | 3,666.000 ns | 7,625.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | İ |     zh-Hans | 5,017.526 ns | 1,392.2104 ns | 1,547.4381 ns | 4,250.000 ns | 3,583.000 ns | 7,958.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | 你 |       en-US |    64.208 ns |     0.1734 ns |     0.1538 ns |    64.145 ns |    64.064 ns |    64.584 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | 你 |       tv-TV |    72.827 ns |     0.5658 ns |     0.5292 ns |    73.052 ns |    72.058 ns |    73.830 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | 你 |       zh-CN | 4,688.632 ns | 1,322.2356 ns | 1,469.6613 ns | 3,959.000 ns | 3,292.000 ns | 7,583.000 ns |           - |           - |           - |                   - |
[2019/07/10 08:16:52][INFO] | Char_ToUpper | 你 |     zh-Hans | 4,804.789 ns | 1,311.4575 ns | 1,457.6815 ns | 4,000.000 ns | 3,666.000 ns | 8,042.000 ns |           - |           - |           - |                   - |

But, the point is the same - for tests with "zh*" cultureName, I see huge performance drop and this is not connected to ICU u_toupper() calls time for sure, plus, ASCII tests with same performance drop on "zh*" cultureName tests don't call u_toupper() at all.

@tarekgh
Copy link
Member

tarekgh commented Jul 12, 2019

That is very interesting. The only place that can depends on the culture operation when using Ascii character is

https://github.com/dotnet/corefx/blob/59b60a232ecf3df2319fadd0193f5dcb42948e29/src/Common/src/CoreLib/System/Globalization/TextInfo.cs#L578

But this code should be executed once per culture.

Where is the code of the test System.Tests.Perf_Char.Char_ToUpper?

@viewizard
Copy link
Member Author

Where is the code of the test System.Tests.Perf_Char.Char_ToUpper?

RAW logs are quite big (189,6 MB for ARM32 and 125,7MB for AMD64), you can download packed files here:
ARM32:
https://github.com/viewizard/perf_test_logs/releases/download/0.0.1/Char_ToUpper_arm32_full.log.xz
AMD64:
https://github.com/viewizard/perf_test_logs/releases/download/0.0.1/Char_ToUpper_amd64_full.log.xz

@tarekgh
Copy link
Member

tarekgh commented Jul 12, 2019

@viewizard I am asking where I can find the C# code of this test?

@tarekgh
Copy link
Member

tarekgh commented Jul 12, 2019

Thanks @viewizard for the pointer.

Could you please try update the test to something like the following and remeasure? I am just trying to validate the theory that the one time initialization in the text info is the place which showing the perf difference.

    [BenchmarkCategory(Categories.CoreFX)]
    public class Perf_Char
    {
        private static CultureInfo s_en_US = new CultureInfo("en-US");
        private static CultureInfo s_zh_Hans = new CultureInfo("zh-Hans");

        public static IEnumerable<object[]> Char_ChangeCase_MemberData()
        {
           // force one time initialization
           char a = char.ToUpper('a', s_en_US);
           char b = char.ToUpper('a', s_zh_Hans);

            yield return new object[] { 'A', s_en_US }; // ASCII upper case
            yield return new object[] { 'a', s_en_US }; // ASCII lower case
            yield return new object[] { '\u0130', s_en_US }; // non-ASCII, English
            yield return new object[] { '\u4F60', s_zh_Hans }; // non-ASCII, Chinese
        }

        [Benchmark]
        [ArgumentsSource(nameof(Char_ChangeCase_MemberData))]
        public char Char_ToLower(char c, CultureInfo cultureName) => char.ToLower(c, cultureName); // the argument is called "cultureName" instead of "culture" to keep benchmark ID in BenchView, do NOT rename it

        [Benchmark]
        [ArgumentsSource(nameof(Char_ChangeCase_MemberData))]
        public char Char_ToUpper(char c, CultureInfo cultureName)=> char.ToUpper(c, cultureName);
    }
}

@viewizard
Copy link
Member Author

@tarekgh, here is new log for ARM32 with code you provided above:

[2019/07/15 08:17:54][INFO] |       Method | c | cultureName |      Mean |     Error |    StdDev |    Median |       Min |       Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/07/15 08:17:54][INFO] |------------- |-- |------------ |----------:|----------:|----------:|----------:|----------:|----------:|------------:|------------:|------------:|--------------------:|
[2019/07/15 08:17:54][INFO] | Char_ToUpper | A |       en-US |  7.433 ns | 0.1736 ns | 0.1624 ns |  7.429 ns |  7.219 ns |  7.788 ns |           - |           - |           - |                   - |
[2019/07/15 08:17:54][INFO] | Char_ToUpper | a |       en-US |  7.552 ns | 0.2064 ns | 0.1830 ns |  7.483 ns |  7.376 ns |  8.038 ns |           - |           - |           - |                   - |
[2019/07/15 08:17:54][INFO] | Char_ToUpper | İ |       en-US | 68.384 ns | 0.8234 ns | 0.7300 ns | 68.858 ns | 67.341 ns | 69.124 ns |           - |           - |           - |                   - |
[2019/07/15 08:17:54][INFO] | Char_ToUpper | 你 |     zh-Hans | 64.715 ns | 0.3562 ns | 0.3332 ns | 64.885 ns | 64.269 ns | 65.265 ns |           - |           - |           - |                   - |

Just in case, here i new AMD64 log with code you provided above:

[2019/07/15 11:26:50][INFO] |       Method | c | cultureName |      Mean |     Error |    StdDev |    Median |       Min |       Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/07/15 11:26:50][INFO] |------------- |-- |------------ |----------:|----------:|----------:|----------:|----------:|----------:|------------:|------------:|------------:|--------------------:|
[2019/07/15 11:26:50][INFO] | Char_ToUpper | A |       en-US |  4.566 ns | 0.0854 ns | 0.0799 ns |  4.574 ns |  4.382 ns |  4.689 ns |           - |           - |           - |                   - |
[2019/07/15 11:26:50][INFO] | Char_ToUpper | a |       en-US |  3.802 ns | 0.0645 ns | 0.0603 ns |  3.777 ns |  3.742 ns |  3.922 ns |           - |           - |           - |                   - |
[2019/07/15 11:26:50][INFO] | Char_ToUpper | İ |       en-US | 18.111 ns | 0.2511 ns | 0.2349 ns | 18.098 ns | 17.810 ns | 18.562 ns |           - |           - |           - |                   - |
[2019/07/15 11:26:50][INFO] | Char_ToUpper | 你 |     zh-Hans | 18.123 ns | 0.3878 ns | 0.3627 ns | 17.988 ns | 17.674 ns | 18.720 ns |           - |           - |           - |                   - |

@tarekgh
Copy link
Member

tarekgh commented Jul 15, 2019

@viewizard thanks a lot for providing the data. This looks ICU is somehow contributing to the difference as you can see even en-US culture showing a big difference too in case of non ASCII character.

I believe next step is to get a micro benchmark numbers to see the profiling internal methods and interops costs.
@janvorli @adamsitnik do you know how we can do micro benchmark profiling on ARM 32 and 64 on Linux? I couldn't find any doc pointer for that.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2020
Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Dec 17, 2024
@dotnet-policy-service dotnet-policy-service bot removed this from the Future milestone Dec 31, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jan 31, 2025
@dotnet-policy-service dotnet-policy-service bot removed no-recent-activity backlog-cleanup-candidate An inactive issue that has been marked for automated closure. labels Jan 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants