Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Add Encode(Span<char>) API #39900

Merged
merged 5 commits into from
Aug 6, 2019
Merged

Conversation

steveharter
Copy link
Member

@steveharter steveharter commented Jul 30, 2019

Fixes #39523

For performance adds the API linked above for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.

Improves performance of Utf16 encoding (with custom encode) by ~25% by avoiding Utf16->Utf8->Utf16 conversions:

Before
|            Method |       Mean |    Error  |   StdDev  |     Median |        Min |        Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
| EscapeUtf16Custom | 1,745.3 ns | 19.710 ns | 16.459 ns | 1,749.1 ns | 1,719.2 ns | 1,774.9 ns |      0.2635 |           - |           - |             1.66 KB |
|  EscapeUtf8Custom | 1,606.9 ns |  8.175 ns |  6.827 ns | 1,607.7 ns | 1,593.5 ns | 1,618.8 ns |      0.2657 |           - |           - |             1.66 KB |

After
| EscapeUtf16Custom | 1,285.4 ns | 17.460 ns | 16.332 ns | 1,279.2 ns | 1,264.2 ns | 1,312.6 ns |      0.2684 |           - |           - |             1.66 KB |
|  EscapeUtf8Custom | 1,150.1 ns |  9.990 ns |  9.344 ns | 1,149.0 ns | 1,133.8 ns | 1,167.3 ns |      0.2702 |           - |           - |             1.66 KB |

Also by changing code in System.Text.Json, improves performance of Utf8 escaping when no escaping needs to occur (~15% faster), or occurs later in the string:

Before
|            Method | Formatted | SkipValidation |     Escaped |      Mean |     Error |    StdDev |    Median |       Min |       Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|------------------ |---------- |--------------- |------------ |----------:|----------:|----------:|----------:|----------:|----------:|------------:|------------:|------------:|--------------------:|
|  WriteStringsUtf8 |     False |          False |  AllEscaped | 54.647 ms | 0.1615 ms | 0.1432 ms | 54.612 ms | 54.468 ms | 54.922 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False |  OneEscaped |  9.646 ms | 0.0793 ms | 0.0742 ms |  9.615 ms |  9.583 ms |  9.812 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False | NoneEscaped |  6.866 ms | 0.2275 ms | 0.2620 ms |  6.759 ms |  6.555 ms |  7.303 ms |           - |           - |           - |               120 B |


After
|  WriteStringsUtf8 |     False |          False |  AllEscaped | 52.957 ms | 0.5980 ms | 0.5594 ms | 52.929 ms | 52.145 ms | 53.978 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False |  OneEscaped |  8.678 ms | 0.0992 ms | 0.0879 ms |  8.688 ms |  8.529 ms |  8.845 ms |           - |           - |           - |               120 B |
|  WriteStringsUtf8 |     False |          False | NoneEscaped |  5.604 ms | 0.0700 ms | 0.0655 ms |  5.610 ms |  5.495 ms |  5.757 ms |           - |           - |           - |               120 B |

FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

@steveharter
Copy link
Member Author

Note: the second commit addressing feedback was just rebased onto the original (due to local issues with out-of-sync history causing unnecessary commits being shown)

using (var writer = new StringWriter())
{
System.Text.Encodings.Web.JavaScriptEncoder.Default.Encode(writer, "\U0001f4a9");
Assert.Equal("\\uD83D\\uDCA9", writer.GetStringBuilder().ToString());
}

Span<char> destination = new char[12];
OperationStatus status = System.Text.Encodings.Web.JavaScriptEncoder.Default.Encode("\U0001f4a9".AsSpan(), destination, out int charsConsumed, out int charsWritten, isFinalBlock: true);
Copy link

@ahsonkhan ahsonkhan Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should probably fix up the namespace of these tests (rename Microsoft.Framework.WebEncoders to System.Text.Encodings.Web.Tests) (outside of this PR, ofc).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahsonkhan
Copy link

ahsonkhan commented Jul 31, 2019

Improves performance of Utf16 encoding (with custom encode) by ~25% by avoiding Utf16->Utf8->Utf16 conversions:

Also by changing code in System.Text.Json, improves performance of Utf8 escaping when no escaping needs to occur (~15% faster), or occurs later in the string

Great. We should get this in for 3.0, imo. This should bring back some of the perf that was regressed previously when we moved to JavascriptEncoder (#39415):
image

cc @ericstj

FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

What's the reason for that? Is there a way for us to bring some of this perf improvement into S.T.E.W itself?

@ahsonkhan ahsonkhan added this to the 3.0 milestone Jul 31, 2019
@steveharter
Copy link
Member Author

cc @ericstj
FWIW the ASCII escaping in System.Text.Json is ~25% faster than System.Text.Encoding.Web.JavaScriptEncoder - thus the reason why there is some duplicate functionality between the two; S.T.J will only call S.T.E.W for non-ASCII or when using a non-default encoder.

What's the reason for that? Is there a way for us to bring some of this perf improvement into S.T.E.W itself?

The reason S.T.E.W is slower is because there is a class hierarchy (JavaScriptEncoder : TextEncoder) where both the default encoder and custom share the same (slower) code which is driven off of a combination of valid and invalid ranges which are flexible and can be specified by the caller -- and this code can't have the micro-optimizations that we have in S.T.J.

However we could create a new internal type (DefaultJavaScriptEncoder in S.T.E.W) and move code from S.T.J to S.T.E.W however using a custom encoder is still going to be 25% slower since it won't use that fast code.

Also note that S.T.E.W was already made 3x-4x faster from the original (for Utf8).

@GrabYourPitchforks
Copy link
Member

It'd be good to have a unit test validating the changes for the latest iteration 390e6c7, but otherwise LGTM. Thanks!

@steveharter steveharter merged commit 0cb8c78 into dotnet:master Aug 6, 2019
@steveharter steveharter deleted the Utf16Escaping branch August 6, 2019 19:30
steveharter added a commit to steveharter/dotnet_corefx that referenced this pull request Aug 6, 2019
For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.
@@ -607,7 +606,10 @@ internal static OperationStatus EncodeUtf8Shim(TextEncoder encoder, ReadOnlySpan
return OperationStatus.DestinationTooSmall;
}

return EncodeIntoBuffer(destinationPtr, destination.Length, sourcePtr, source.Length, out charsConsumed, out charsWritten, firstCharacterToEncode, isFinalBlock);
fixed (char* destinationPtr = destination)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of deferring creating this fixed block? In general, is it best to keep your fixed blocks as small as possible and is there no value in trying to batch them together with other fixed block setup?

cc @jkotas

[MethodImpl(MethodImplOptions.AggressiveInlining)]
// Cast to (byte) for performance; avoids array bounds check.
private static bool NeedsEscaping(char value) => value > LastAsciiCharacter || AllowList[(byte)value] == 0;
private static bool NeedsEscaping(char value) => value > LastAsciiCharacter || AllowList[value] == 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: implement this in terms of NeedsEscapingNoBoundsCheck

private static bool NeedsEscaping(char value)
    => value > LastAsciiCharacter || NeedsEscapingNoBoundsCheck(value);

@@ -696,6 +697,7 @@ public static void AssertContentsNotEqual(string expectedValue, ArrayBufferWrite
);

// Temporary hack until we can use the same escape algorithm on both sides and make sure we want uppercase hex.
// Todo: create new AssertContentsNotEqualAgainJsonNet to avoid calling NormalizeToJsonNetFormat when not necessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create tracking issues for TODOs like this. They are easier to track/manage then searching for "todo" in source. Also, increases visibility for others to pick up issues like this.

wtgodbe pushed a commit that referenced this pull request Aug 8, 2019
For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.
@karelz karelz modified the milestones: 3.0, 5.0 Aug 9, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
For performance adds the new API for encoding char. Also changes the semantics of System.Text.Json to use the replacement character for Utf16 for bad surrogate pairs instead of throwing - this makes it consistent with Utf8 and the default behavior of System.Text.Encoding.Json.

Commit migrated from dotnet/corefx@0cb8c78
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Encode(ReadOnlySpan<char>) method to TextEncoder for performance
8 participants