-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reland "Migrate darwin common "framework_shared" target to ARC #37049" #37883
Conversation
…et to ARC flutter#37049" (flutter#37219)" (flutter#37320)" This reverts commit cf22fc2.
I don't have any specific suggestions for optimizing that code under ARC. That said, I'm also not super concerned about the exact perf of that piece of code; anyone passing huge amounts of data over method channels as a
Agreed, I would not recommend holding back ARC migration on this benchmark regression. I would expect a non-trivial reduction in the number of crashes in Flutter applications over time by migrating to ARC, in addition to easier maintenance (which frees eng resources for other tasks--like going back and optimizing critical codepaths), both of which are very significant benefits.
That seems expected to me; the primary foundation types all use CoW under the hood, I believe. |
Why not just pull this logic into its own function in a file that is compiled with MRC? You can even add the inline keyword and hope it gets inlined. NSArray* FlutterStandardReaderReadArray(FlutterStandardReader* reader) {
UInt32 length = [reader readSize];
NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
for (UInt32 i = 0; i < length; i++) {
id value = [reader readValue];
[array addObject:(value == nil ? [NSNull null] : value)];
}
return array;
} I'd have to look at the assembly output to understand why it is slower, my guess is that the extra retain and release calls for |
Here's the loop with ARC: LBB2_2: ; in Loop: Header=BB2_4 Depth=1
ldr x0, [x24, _OBJC_CLASSLIST_REFERENCES_$_.1@PAGEOFF]
bl _objc_msgSend$null
mov x29, x29
bl _objc_retainAutoreleasedReturnValue
mov x23, x0
mov x0, x21
mov x2, x23
bl "_objc_msgSend$addObject:"
mov x0, x23
bl _objc_release
LBB2_3: ; in Loop: Header=BB2_4 Depth=1
mov x0, x22
bl _objc_release
subs w20, w20, #1
b.eq LBB2_6
LBB2_4: ; =>This Inner Loop Header: Depth=1
mov x0, x19
bl _objc_msgSend$readValue
mov x29, x29
bl _objc_retainAutoreleasedReturnValue
mov x22, x0
cbz x0, LBB2_2
; %bb.5: ; in Loop: Header=BB2_4 Depth=1
mov x0, x21
mov x2, x22
bl "_objc_msgSend$addObject:"
b LBB2_3 In order to get rid of all the extra retains and releases I used the -(NSArray*)readArray {
UInt32 length = [self readSize];
NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
for (UInt32 i = 0; i < length; i++) {
__unsafe_unretained id value = [self readValue];
[array addObject:(value == nil ? [NSNull null] : value)];
}
return array;
} LBB2_2: ; in Loop: Header=BB2_3 Depth=1
ldr x0, [x23, _OBJC_CLASSLIST_REFERENCES_$_.1@PAGEOFF]
bl _objc_msgSend$null
mov x29, x29
bl _objc_retainAutoreleasedReturnValue
mov x22, x0
mov x0, x21
mov x2, x22
bl "_objc_msgSend$addObject:"
mov x0, x22
bl _objc_release
subs w20, w20, #1
b.eq LBB2_5
LBB2_3: ; =>This Inner Loop Header: Depth=1
mov x0, x19
bl _objc_msgSend$readValue
mov x29, x29
bl _objc_unsafeClaimAutoreleasedReturnValue
cbz x0, LBB2_2
; %bb.4: ; in Loop: Header=BB2_3 Depth=1
mov x2, x0
mov x0, x21
bl "_objc_msgSend$addObject:"
subs w20, w20, #1
b.ne LBB2_3 I didn't benchmark them, but it should be faster. Although maybe not as fast as MRC because of the edit: the "unsafe" prefix is scary but in this case you are fine since the autorelease pool has a retain on the object. |
To be clear, after investigating the problem I don't feel there is any legitimate reason to accept this loss of performance and this is something we work hard to optimized. You can either pull out the loop to an MRC file or use the |
I tried Having this method in MRC is OK to avoid performance regression. However, there are many performance regressions like this as we migrate to ARC (Some might be caught by benchmarks some might not). If we are going to live with a partial MRC embedder for performance reason, we have to define some rules to draw a line between what we migrate and what we don't. |
Probably not worth debugging since those functions like
The line here is this is a particularly sensitive benchmark that potential customers evaluate when making decisions about adopting Flutter or not. |
I guess the rule for the migration for now would be if we break a benchmark, we need to evaluate it case by case and decide if we should have some code remain in MRC. |
What's the real-world use case for a huge number of objects in a list over platform channels? |
Where is the guarantee that it isn't a use-case? Platform channel performance is of particular importance to customers evaluating Flutter for add-to-app. That isn't a hypothetical. I'm happy to discuss privately. I'll also add that that the list doesn't have to be large, chatty smaller payloads would experience the same overhead. Decoding messages happens on the ui thread where we are talking about scales of 16ms to maintain smooth animations. |
Nowhere. My position is based on the widely held industry view that premature optimization is generally a bad thing, so the bar for making code more complex and more difficult to maintain is not "do it unless you can prove that performance will never be a problem", but rather "only do it if there's concrete evidence that it is".
What is the impact (either in ms or percentage of overall message time) of just this retain/release in a list with, say, 10 objects? Maybe I'm misunderstanding the benchmark; I thought it was visible here only because of having a very high element count.
Does this benchmark reflect real-world-scenario performance though? Or is it accidentally and artificially highlighting one part that would be much smaller in most use cases? |
Yea, this code has been production code for many years and we've got direct feedback from large clients about platform channel performance, specifically when evaluating for add-to-app scenarios where there is more communication over channels. I think we are beyond "premature optimization" =) The benchmark isn't absurd. The list size is 1000 elements of various small types. The performance regression is 13% slower. Here's where the test payload is created: https://github.com/flutter/flutter/blob/71139099c5c4069f13dba4387b98af6158735f62/dev/benchmarks/platform_channels_benchmarks/lib/main.dart#L199 |
Was it about Lists of 1000s of objects? I'm well aware that platform channel performance has been an issue before. I've been consistently saying that I don't see any reason to believe that the performance of this particular case is likely to be important, not that no platform channel performance matters.
There is a difference between "more communication" and "communication with these specific properties".
I have asked several questions that would help me evaluate whether that's actually true:
I haven't seen answers to those questions, just assertions that platform channel performance matters in some general way—which, again, I am not, and was not, disputing.
I didn't say that it was absurd, just that it doesn't seem like it would reflect typical use cases. Just saying it isn't absurd doesn't address my questions, it just dismisses them. |
I think looking at it that way is an unfair characterization. Ultimately what you are asking for is analytics about real-world usage of platform channels. Then we can know exactly what cases matter the most. We don't have that, can't have that and even if we did have it we'd still have to generalize a solution that balances the needs of all usages. The best we can do is extrapolate based on what users have told us. That information is confidential. I'm happy to discuss privately. Xiao is another good resource if you want. I think you are getting a bit bogged down about the maintenance being easier with ARC, but if you look at the blame for FlutterCodecs.mm the majority of it was written 6 years ago and hasn't changed in any meaningful way. In fact, it's very sensitive to any change, more than probably any part of the engine. So what really is the benefit of changing a file to ARC that doesn't change and has years of production testing at the cost of performance? Even if, for arguments sake, if that performance drop is rare? |
I am not asking for a complete accounting for every single current and future use of platform channels, I'm asking for very specific information about what we already know.
Agreed, which is why I'm asking what we already know about use cases based on what we've been told. Unlike your request above that I prove an impossible negative, I'm just asking whether we have actual evidence that this benchmark corresponds to a real-world use case (since it doesn't at all match up with anything I've personally seen, and I'm struggling to imagine a non-contrived use case). Either we do have such evidence, in which case this is an easy question to answer and thus resolve my concern, or we don't, in which case I continue to be skeptical about that benchmark reflecting real-world outcomes rather than being a potentially misleading microbenchmark (and in which case I'd like to know the regression for usage that looks like everything I've seen in the wild, which is lists several orders of magnitude smaller than this).
I can follow up privately if needed, but let me rephrase my question to avoid issues of confidentiality first: have you seen a real-world use case that looks anything like a single message containing a 1000-object list?
Having pockets of MRC code in an ARC codebase is a significant liability, because it is extremely easy for people to change it and/or review it without realizing that it's MRC:
When I say that it's extremely easy, I say this as someone with years of experience bouncing back and forth between ARC and non-ARC reviews (in the context of readability reviews in particular), and even when I knew to be vigilant because it was new code I still sometimes got it wrong. If we fast-forward, say, a year, at which point this could easily be the only MRC Obj-C code in the engine, that would become even more likely because nobody would be expecting to see MRC code in a review. Having the file be rarely edited is a double-edged sword because while it reduces the opportunity for mistakes, it also means people are even less likely to be on the lookout for it making them more likely to happen when those rare changes do come up. |
Three more important datapoints:
|
Can you point me to the part of the conversation about
This continues to conflate "real world use case of platform channel performance being important" and "real world use case of this particular piece of platform channels being important".
I'm not sure where in the thread that is, perhaps you can point me to the specific part offline. But payload size doesn't tell us anything relevant to this benchmark regression without knowing what the composition of the payload is. Adding fixed-size-per-item overhead to list decoding will have ~zero effect on a list of two 15kb objects, and a lot of effect on a list of 30,000 1-byte objects.
The specific piece that decodes per-list-item entries is O(n), yes, but unless the benchmark only tests that one exact piece of code (in which case it would definitely be a really unrealistic microbenchmark), then the overall benchmark won't scale linearly. |
Also, wasn't this the investigation that concluded that specifically optimizing platform channel encode/decode performance, even though it had showed up in CPU profiles, actually had almost no measurable real-world impact on startup time? Because if so, that seems to directly support the argument that microoptimization of this particular piece of code is not so high priority that it should block ARC migration. |
You can see readValue's cost in the CPU traces. Keep in mind these were recorded on Android and, at least the first CPU trace, is just capturing the ui thread. The reading cost there is equal to the writing cost (it's helpful to do a pivot on StandardMessageCodec).
Right, we can not include every use case as a benchmark so it requires a bit of extrapolation.
We don't have analytics about the actual composition of the payload. What we do know is that they are using the StandardMessageCodec, so it is unlikely to be one item. The idea that the contributing factor to the regression is List size is a theory. One that was put into question when Chris extracted that function and compiled it in MRC but didn't see an improvement. We know the channel that is generating these large payloads and we have contacts on the team that may know the payloads if we want to dig deeper.
I don't think I understand this point. That particular benchmark was designed for test coverage, to make sure are testing all datatypes and the size was chosen based off of choosing a size whose probability of being see in the wild is greater than 1%. We can always create another one if we want. Stepping back here's the take aways you should have from looking at the conversation with customer:money:
Now the question is: what is the probability that this regression will not negatively affect any users? I suspect that should be very low given our limited data. So hopefully we can dismiss the notion that this regression wouldn't affect any user or that this microbenchmark is unrepresentative on account of that being unlikely. So, now the calculus becomes simply: "Is the cost of maintaining one MRC file that rarely changes and has 6 years of testing greater than the cost of negatively affecting users performance?" |
That is not the only question. "How many users will be affected" and "how much will users be affected" are also very important questions.
There are several implicit value judgements baked into that "simple" distillation of the problem. I could just as accurately say that the calculus is "Is the cost of permanently maintaining a mixed MRC/ARC codebase, thus increasing the probability of future crash and leak bugs, greater than the cost of regressing a benchmark whose direct applicability to real-world performance has not been clearly established?" My opinion is that we should move forward with the ARC migration without a carve-out for the code covered by this benchmark, giving ourselves a solid foundation for future engine work. If in the future we have further evidence that platform channel encoding/decoding in particular is a sufficiently high priority for revisiting for optimization, we can investigate the best way to optimize it (which, per comments in the related doc, would probably not be via MRC ObjC, but some other solution). |
I completely agree. |
Okay, I'll try to distill all the variables used for making the decision. If we can agree on those at least I can feel confident that you are making this judgement with all the available information. Making a decision without having all the available information is the worst outcome here. It's hard for me to believe that if I conveyed what I know about all the work I've done on platform channels, you'd think migrating one file from MRC was worth it. So humor me in trying to lay out everything I know again. Since we are dealing with uncertain variables, I'll give them rough probabilities backed up with the limited data we have:
Hopefully that better explains how I see this choice and is a complete list. I understand ARC is better than MRC and we should adopt it. Stuart, you know me, I tried to migrate all of the engine to ARC months ago, it was a huge push. All the same, deciding to migrate this particular file in the face of these benchmarks is not pragmatic or supported by what we know about the code's history or our user's feedback. That's all ignoring what sort of message this decision communicates about our priorities, especially when these benchmarks have been called out by potential customers as key datapoints when deciding even to invest in Flutter. |
Please answer my question above then. |
My understanding from working on that team is that there were no easy solutions and that multiple fronts had to be addressed to get the total goal. In the buganizer issue linked above you see a customer mentioning that it's hard to measure real world performance because it's very noisy. They had to look at P95 numbers to detect a 2% decrease in startup time as a result of avoiding platform channel payload copies. While that is hard to detect, a 2% decrease is actually not insignificant as you'll see below. Here's another example which is tied to the performance of the encoder: We noticed that Note that while background platform channels provides a workaround for this regression, it isn't always possible to use them. So on the question of wether encoding changes were measurable, what we do know is that the variance of real world benchmarks can hide significant changes. That's why we must consider the microbenchmarks, real-world benchmarks, and profiles when drawing conclusions. Not one single measurement paints the whole picture. If you want to link to me privately some results you are looking at I can better respond to them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the buganizer issue linked above you see a customer mentioning that it's hard to measure real world performance because it's very noisy. They had to look at P95 numbers to detect a 2% decrease in startup time as a result of avoiding platform channel payload copies.
That is not what it says though. It says that with a benchmark resolution of 2% at P95 the benchmark showed no improvement.
While that is hard to detect, a 2% decrease is actually not insignificant as you'll see below.
"2%" and "0% +/-2%" are not the same thing.
Here's another example which is tied to the performance of the encoder: We noticed that customer:money was bottlenecked on the platform thread
It also has a lot to do with real-world thread contention, which this benchmark does not measure. The fact that encoder/decoder benchmark performance and real-world end-to-end performance are not at all the same thing is the core of my position.
(Edited to add one other point I forgot to include:
2. At some point I introduced platform channel analytics that can be turned on locally and
customer:money
did use it. The link is in that b/ conversation. You can see there are channels that are sending payloads of 27kb.
I looked at that spreadsheet, and it's showing total bytes, not individual messages, and it doesn't say how many messages there are. So what we know is that 27kb was sent total. Which again gets back to my concern about the huge potential difference between one huge messages vs lots of small ones here.)
we created the background platform channels which allows the encoding (and execution of the handlers) to happen on a background thread.
You are continuing to argue against a straw man that I have been very clear is not my position. I have, at no point, suggested that end-to-end, real-world platform channel performance does not matter.
Nobody in this review has advocated for the removal of background platform channel support. To the extent that the background channel data is relevant at all to the PR at hand, I believe it supports my position: in the real-world example you are drawing all of this data from the outcome was that
- optimizing serialization had no measurable impact on startup time, while
- moving the end-to-end process to another thread had a significant effect on startup time.
So the evidence is that encoder/decoder performance by itself, which is what this PR discussion is about, was actually not a significant driver of real-world perf outcomes based on the data we have, and that other parts of the system play a significantly larger role.
I don't think continued discussion involving me will be productive here, as we are both now repeating things we've already said. I am comfortable with the tradeoff of proceeding here for the reasons I've articulated at length above (and summarized at the end of this comment), so I'm approving this from my standpoint.
@cyanglaz given the disagreement here, please don't land based on my approval alone. At least @jmagman and @zanderso should approve (or not) before proceeding, and potentially Ian as well if this needs to be escalated.
Yea, I understand the need to move forward and at this point I feel my energy would be better spent fixing, what is in my estimation, a mistake instead of trying to convince you that it is a mistake. I would like you to at least bring this to the attention of @xster since he has close connections to our users and is aware of this problem, in case he has any more information that would be relevant and to at least put it on his radar since he fields many questions about this topic. @xster quick recap of the discussion: This PR migrates StandardMessageCodec to ARC which regresses a microbenchmark where the payload is 14k. We have seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We converted some retains to copies which may be contributing to some of the regression. The original PR didn't raise the question about wether copying the data is the correct course here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request change not because of the larger issue discussed, but I want to make sure we don't land this without explanation why some retains were changed to copies =)
I ran the profiler on this patch locally and verified I got a 10% regression. Here are the traces (isolating this test and cranking the number of iterations to 25000): encode timedecode timebottom upSo the extra retain/releases ARC inserts are the problem like we suspected but it's not strictly list size that drives the regression. It would be more correct to say the recursion depth, so the total number of items in the payload. You can see the costs of I tried duplicate the MRC code that we had previously is using a combination of - (nullable id)readValue __attribute__((ns_returns_autoreleased)) {
__autoreleasing id result = [self readValueOfType:[self readByte]];
return result;
}
- (nullable id)readValueOfType:(UInt8)type __attribute__((ns_returns_autoreleased)) {
FlutterStandardField field = (FlutterStandardField)type;
__autoreleasing id result;
switch (field) {
// ...
case FlutterStandardFieldList: {
UInt32 length = [self readSize];
NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
for (UInt32 i = 0; i < length; i++) {
__unsafe_unretained id value = [self readValue];
[array addObject:(value == nil ? [NSNull null] : value)];
}
result = array;
break;
}
}
return result;
} I couldn't get the right combination to make it duplicate the performance of the MRC code. Suffice it to say, there are 2 things I know now:
TLDR I tried to fix the regression but came away convinced it will be more difficult than I thought to get MRC performance with ARC enabled, and if we do achieve it it will be harder to maintain than the MRC code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy I mentioned looked fine. If we just migrate the codec to CoreFoundation this sounds good to me.
Talked to @gaaclarke offline about the CoreFoundation approach. Since it would be a good practice for me, I will prototype the migration to CoreFoundation to see the benchmark results. |
After investigation, detailed in this doc We have determined that migrating to static C function and CoreFoundation is not trivial. Discussed with @gaaclarke offline and we think we should move on and accept the regression for now and come back revisit the performance of platform channel later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cyanglaz for looking into this. I strongly believe this migration to ARC should be limited to logic that does not affect benchmarks on the grounds that customer needs trump developer comfort, especially when the gravity of the benchmark is weighed against the utility of having this one file be ARC. But assuming the team has considered that and does agree with the tradeoff; this code is correct and a fair effort was made to improve it.
…ared" target to ARC flutter#37049" (flutter#37219)" (flutter#37320)" (flutter/engine#37883)
…117148) * df430c4fd Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter/engine#38292) * 41dd4f5e1 Revert "Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (#38292)" (flutter/engine#38301) * 8ce9a3f71 Roll Skia from 3171deabd88a to b368746d696a (13 revisions) (flutter/engine#38294) * 4881fe25e Revert "Revert "reland "Migrate darwin common "framework_shared" target to ARC #37049" (#37219)" (#37320)" (flutter/engine#37883) * 3b2302c8d [Impeller] Remove validation log when the pipeline library is collected before pipeline is setup. (flutter/engine#38306) * a04997c81 [Impeller] Disable impeller_unittests. (flutter/engine#38307) * fc71faad0 License script improvements (flutter/engine#38148) * 6a2560c35 [Windows] Synthesize modifier keys events on pointer events (flutter/engine#38138) * b1d407563 Roll Skia from b368746d696a to 3f81f95176ce (11 revisions) (flutter/engine#38312) * b25fcf748 Roll Skia from 3f81f95176ce to 46e8f2a18a3d (3 revisions) (flutter/engine#38314) * 948699bba Collapse bounds calculations into DisplayListBuilder (flutter/engine#34365) * 38367de84 Roll Fuchsia Mac SDK from u-tC0QEGUT4xQ4KOo... to VEOIaacOA75U7PYyz... (flutter/engine#38316) * 29196519c Roll Skia from 46e8f2a18a3d to 9f728d78f10d (1 revision) (flutter/engine#38317)
…et to ARC flutter#37049" (flutter#37219)" (flutter#37320)" (flutter#37883) This reverts commit cf22fc2.
…et to ARC flutter#37049" (flutter#37219)" (flutter#37320)" (flutter#37883) This reverts commit cf22fc2.
…lutter#117148) * df430c4fd Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter/engine#38292) * 41dd4f5e1 Revert "Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter#38292)" (flutter/engine#38301) * 8ce9a3f71 Roll Skia from 3171deabd88a to b368746d696a (13 revisions) (flutter/engine#38294) * 4881fe25e Revert "Revert "reland "Migrate darwin common "framework_shared" target to ARC flutter#37049" (flutter#37219)" (flutter#37320)" (flutter/engine#37883) * 3b2302c8d [Impeller] Remove validation log when the pipeline library is collected before pipeline is setup. (flutter/engine#38306) * a04997c81 [Impeller] Disable impeller_unittests. (flutter/engine#38307) * fc71faad0 License script improvements (flutter/engine#38148) * 6a2560c35 [Windows] Synthesize modifier keys events on pointer events (flutter/engine#38138) * b1d407563 Roll Skia from b368746d696a to 3f81f95176ce (11 revisions) (flutter/engine#38312) * b25fcf748 Roll Skia from 3f81f95176ce to 46e8f2a18a3d (3 revisions) (flutter/engine#38314) * 948699bba Collapse bounds calculations into DisplayListBuilder (flutter/engine#34365) * 38367de84 Roll Fuchsia Mac SDK from u-tC0QEGUT4xQ4KOo... to VEOIaacOA75U7PYyz... (flutter/engine#38316) * 29196519c Roll Skia from 46e8f2a18a3d to 9f728d78f10d (1 revision) (flutter/engine#38317)
Relands #37219, which was reverted at #37320 due to a benchmark regression. flutter/flutter#114697
There is no change in this PR compares to #37219
The benchmark regression is due to a performance regression caused by ARC, specifically here: https://github.com/flutter/engine/blob/main/shell/platform/darwin/common/framework/Source/FlutterStandardCodec.mm#L458-L464 .
This regression is unavoidable and been discussed at several places online: https://jimmymandersson.medium.com/understanding-arcs-effect-on-your-apps-performance-377e39076a88
I create two simple sample test cases comparing ARC and ARC in performance and the result shows (from my machine) that the ARC version of the test took 8ms and MRC took 6ms: https://github.com/cyanglaz/MrcVSArc
ARC has many benefits so I think we should still migrate to it. Please advise if there are any to optimize the ARC code in this PR to reduce the regression.
I also have tested the performance if all the "copy"s are reverted to "retain", but the improvement is very minimal. I'd suggest keep them as copy so the code is safer.
part of flutter/flutter#112232
cc @zanderso @stuartmorgan @gaaclarke for thoughts.
Pre-launch Checklist
///
).If you need help, consider asking for advice on the #hackers-new channel on Discord.