Add Encode(Span<char>) API #39900

steveharter · 2019-07-30T21:31:49Z

Fixes #39523

For performance adds the API linked above for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.

Improves performance of Utf16 encoding (with custom encode) by ~25% by avoiding Utf16->Utf8->Utf16 conversions:

Before
|            Method |       Mean |    Error  |   StdDev  |     Median |        Min |        Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
| EscapeUtf16Custom | 1,745.3 ns | 19.710 ns | 16.459 ns | 1,749.1 ns | 1,719.2 ns | 1,774.9 ns |      0.2635 |           - |           - |             1.66 KB |
|  EscapeUtf8Custom | 1,606.9 ns |  8.175 ns |  6.827 ns | 1,607.7 ns | 1,593.5 ns | 1,618.8 ns |      0.2657 |           - |           - |             1.66 KB |

After
| EscapeUtf16Custom | 1,285.4 ns | 17.460 ns | 16.332 ns | 1,279.2 ns | 1,264.2 ns | 1,312.6 ns |      0.2684 |           - |           - |             1.66 KB |
|  EscapeUtf8Custom | 1,150.1 ns |  9.990 ns |  9.344 ns | 1,149.0 ns | 1,133.8 ns | 1,167.3 ns |      0.2702 |           - |           - |             1.66 KB |

Also by changing code in System.Text.Json, improves performance of Utf8 escaping when no escaping needs to occur (~15% faster), or occurs later in the string:

Before
|            Method | Formatted | SkipValidation |     Escaped |      Mean |     Error |    StdDev |    Median |       Min |       Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|------------------ |---------- |--------------- |------------ |----------:|----------:|----------:|----------:|----------:|----------:|------------:|------------:|------------:|--------------------:|
|  WriteStringsUtf8 |     False |          False |  AllEscaped | 54.647 ms | 0.1615 ms | 0.1432 ms | 54.612 ms | 54.468 ms | 54.922 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False |  OneEscaped |  9.646 ms | 0.0793 ms | 0.0742 ms |  9.615 ms |  9.583 ms |  9.812 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False | NoneEscaped |  6.866 ms | 0.2275 ms | 0.2620 ms |  6.759 ms |  6.555 ms |  7.303 ms |           - |           - |           - |               120 B |


After
|  WriteStringsUtf8 |     False |          False |  AllEscaped | 52.957 ms | 0.5980 ms | 0.5594 ms | 52.929 ms | 52.145 ms | 53.978 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False |  OneEscaped |  8.678 ms | 0.0992 ms | 0.0879 ms |  8.688 ms |  8.529 ms |  8.845 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False | NoneEscaped |  5.604 ms | 0.0700 ms | 0.0655 ms |  5.610 ms |  5.495 ms |  5.757 ms |           - |           - |           - |               120 B |

FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs

src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/JavaScriptEncoder.cs

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/JavaScriptEncoder.cs

steveharter · 2019-07-31T17:49:17Z

Note: the second commit addressing feedback was just rebased onto the original (due to local issues with out-of-sync history causing unnecessary commits being shown)

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs

ahsonkhan · 2019-07-31T19:12:35Z

src/System.Text.Encodings.Web/tests/JavaScriptStringEncoderTests.cs

            using (var writer = new StringWriter())
            {
                System.Text.Encodings.Web.JavaScriptEncoder.Default.Encode(writer, "\U0001f4a9");
                Assert.Equal("\\uD83D\\uDCA9", writer.GetStringBuilder().ToString());
            }
+
+            Span<char> destination = new char[12];
+            OperationStatus status = System.Text.Encodings.Web.JavaScriptEncoder.Default.Encode("\U0001f4a9".AsSpan(), destination, out int charsConsumed, out int charsWritten, isFinalBlock: true);


At some point, we should probably fix up the namespace of these tests (rename Microsoft.Framework.WebEncoders to System.Text.Encodings.Web.Tests) (outside of this PR, ofc).

https://github.com/dotnet/corefx/issues/40073

src/System.Text.Encodings.Web/tests/JavaScriptStringEncoderTests.cs

src/System.Text.Json/tests/Utf8JsonWriterTests.cs

ahsonkhan · 2019-07-31T19:49:07Z

Improves performance of Utf16 encoding (with custom encode) by ~25% by avoiding Utf16->Utf8->Utf16 conversions:

Also by changing code in System.Text.Json, improves performance of Utf8 escaping when no escaping needs to occur (~15% faster), or occurs later in the string

Great. We should get this in for 3.0, imo. This should bring back some of the perf that was regressed previously when we moved to JavascriptEncoder (#39415):

cc @ericstj

FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

What's the reason for that? Is there a way for us to bring some of this perf improvement into S.T.E.W itself?

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs

steveharter · 2019-08-05T20:02:20Z

cc @ericstj
FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

What's the reason for that? Is there a way for us to bring some of this perf improvement into S.T.E.W itself?

The reason S.T.E.W is slower is because there is a class hierarchy (JavaScriptEncoder : TextEncoder) where both the default encoder and custom share the same (slower) code which is driven off of a combination of valid and invalid ranges which are flexible and can be specified by the caller -- and this code can't have the micro-optimizations that we have in S.T.J.

However we could create a new internal type (DefaultJavaScriptEncoder in S.T.E.W) and move code from S.T.J to S.T.E.W however using a custom encoder is still going to be 25% slower since it won't use that fast code.

Also note that S.T.E.W was already made 3x-4x faster from the original (for Utf8).

src/System.Text.Encodings.Web/src/System/Text/Unicode/UnicodeHelpers.cs

GrabYourPitchforks · 2019-08-05T21:49:28Z

It'd be good to have a unit test validating the changes for the latest iteration 390e6c7, but otherwise LGTM. Thanks!

For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.

ahsonkhan · 2019-08-06T23:23:36Z

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs

@@ -607,7 +606,10 @@ internal static OperationStatus EncodeUtf8Shim(TextEncoder encoder, ReadOnlySpan
                        return OperationStatus.DestinationTooSmall;
                    }

-                    return EncodeIntoBuffer(destinationPtr, destination.Length, sourcePtr, source.Length, out charsConsumed, out charsWritten, firstCharacterToEncode, isFinalBlock);
+                    fixed (char* destinationPtr = destination)


What would be the benefit of deferring creating this fixed block? In general, is it best to keep your fixed blocks as small as possible and is there no value in trying to batch them together with other fixed block setup?

cc @jkotas

ahsonkhan · 2019-08-06T23:26:29Z

src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
-        // Cast to (byte) for performance; avoids array bounds check.
-        private static bool NeedsEscaping(char value) => value > LastAsciiCharacter || AllowList[(byte)value] == 0;
+        private static bool NeedsEscaping(char value) => value > LastAsciiCharacter || AllowList[value] == 0;


nit: implement this in terms of NeedsEscapingNoBoundsCheck

private static bool NeedsEscaping(char value) => value > LastAsciiCharacter || NeedsEscapingNoBoundsCheck(value);

ahsonkhan · 2019-08-06T23:27:56Z

src/System.Text.Json/tests/JsonTestHelper.cs

@@ -696,6 +697,7 @@ public static void AssertContentsNotEqual(string expectedValue, ArrayBufferWrite
                    );

            // Temporary hack until we can use the same escape algorithm on both sides and make sure we want uppercase hex.
+            // Todo: create new AssertContentsNotEqualAgainJsonNet to avoid calling NormalizeToJsonNetFormat when not necessary.


Please create tracking issues for TODOs like this. They are easier to track/manage then searching for "todo" in source. Also, increases visibility for others to pick up issues like this.

For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.

For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json. Commit migrated from dotnet/corefx@0cb8c78

steveharter self-assigned this Jul 30, 2019

steveharter added area-System.Text.Encodings.Web area-System.Text.Json labels Jul 30, 2019

steveharter requested a review from ahsonkhan July 30, 2019 21:45

steveharter assigned GrabYourPitchforks Jul 30, 2019

GrabYourPitchforks reviewed Jul 30, 2019

View reviewed changes

gfoidl reviewed Jul 31, 2019

View reviewed changes

src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs Outdated Show resolved Hide resolved

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/JavaScriptEncoder.cs Outdated Show resolved Hide resolved

watfordgnf reviewed Jul 31, 2019

View reviewed changes

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs Outdated Show resolved Hide resolved

gfoidl reviewed Jul 31, 2019

View reviewed changes

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/JavaScriptEncoder.cs Show resolved Hide resolved

steveharter force-pushed the Utf16Escaping branch from f96562e to b9e8703 Compare July 31, 2019 17:27

Add Utf16 Encode() to S.T.Encoding.Web and uptake in S.T.Json

9fbd6ee

steveharter force-pushed the Utf16Escaping branch from b9e8703 to 9fbd6ee Compare July 31, 2019 17:47

Netfx build fix

8094c3c

ahsonkhan reviewed Jul 31, 2019

View reviewed changes

ahsonkhan added this to the 3.0 milestone Jul 31, 2019

ahsonkhan reviewed Aug 1, 2019

View reviewed changes

src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs Show resolved Hide resolved

Review feedback

11f7f60

GrabYourPitchforks reviewed Aug 5, 2019

View reviewed changes

src/System.Text.Encodings.Web/src/System/Text/Unicode/UnicodeHelpers.cs Outdated Show resolved Hide resolved

Return NeedMoreData only for trailing high unpaired surrogate

390e6c7

GrabYourPitchforks approved these changes Aug 5, 2019

View reviewed changes

Add additional surrogate tests; avoid creating destPtr until needed

bc7eb52

steveharter merged commit 0cb8c78 into dotnet:master Aug 6, 2019

steveharter deleted the Utf16Escaping branch August 6, 2019 19:30

steveharter mentioned this pull request Aug 6, 2019

[release/3.0] Add Encode(Span<char>) API (#39900) #40072

Merged

ahsonkhan reviewed Aug 6, 2019

View reviewed changes

karelz modified the milestones: 3.0, 5.0 Aug 9, 2019

AndyAyersMS mentioned this pull request Jan 31, 2020

Detect unnecessary AggressiveInlining annotations dotnet/runtime#13186

Open

This was referenced Feb 1, 2020

Fix up System.Text.Encodings.Web test namespace to be consistent with other corefx projects dotnet/runtime#30509

Closed

Use Strings resx file for the exception messages within System.Text.Encodings.Web dotnet/runtime#30510

Closed

ahsonkhan mentioned this pull request Feb 15, 2020

Create new AssertContentsAgainstJsonNet to avoid calling NormalizeToJsonNetFormat when not necessary. dotnet/runtime#32351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Encode(Span<char>) API #39900

Add Encode(Span<char>) API #39900

steveharter commented Jul 30, 2019 •

edited by karelz

Loading

steveharter commented Jul 31, 2019

ahsonkhan Jul 31, 2019 •

edited

Loading

ahsonkhan Aug 6, 2019

ahsonkhan commented Jul 31, 2019 •

edited

Loading

steveharter commented Aug 5, 2019

GrabYourPitchforks commented Aug 5, 2019

ahsonkhan Aug 6, 2019

ahsonkhan Aug 6, 2019

ahsonkhan Aug 6, 2019

Add Encode(Span<char>) API #39900

Add Encode(Span<char>) API #39900

Conversation

steveharter commented Jul 30, 2019 • edited by karelz Loading

steveharter commented Jul 31, 2019

ahsonkhan Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

ahsonkhan Aug 6, 2019

Choose a reason for hiding this comment

ahsonkhan commented Jul 31, 2019 • edited Loading

steveharter commented Aug 5, 2019

GrabYourPitchforks commented Aug 5, 2019

ahsonkhan Aug 6, 2019

Choose a reason for hiding this comment

ahsonkhan Aug 6, 2019

Choose a reason for hiding this comment

ahsonkhan Aug 6, 2019

Choose a reason for hiding this comment

steveharter commented Jul 30, 2019 •

edited by karelz

Loading

ahsonkhan Jul 31, 2019 •

edited

Loading

ahsonkhan commented Jul 31, 2019 •

edited

Loading