-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EXC_BAD_ACCESS in decodeMessage(decoder:) due to stack overflow #1034
Comments
I'm guessing this is the PR that led to this problem: #900 |
Hi, author of #900 here. This doesn’t look like a particularly fun bug—thanks for the diagnosis. The rationale for #900 was to reduce ARC / heap pressure, primarily for use cases involving lots of small messages. In other languages, message fields are typically stored by reference, rather than by value (an
|
Hi, thanks for the response! To answer your questions:
|
Given that the encoder now assumes the protobufs will be value types, we could take a step to give some flexibility here by potentially modifying the codegen. There are a lot of use-cases where the reduced allocator pressure and improved data locality of simple structs is really valuable, and I’d like to find a way to keep it. So, what if we considered having a codegen option that would change this behaviour to instead generate class-backed structs: that is, structs whose only stored member is a class, and that use entirely computed properties to compute their fields. These fields would still be stored properties on the class, giving good performance to load, but the structs would no longer be nested, so they’ll be scattered all over the heap as if they were classes. This will keep their sizes down. |
If the stack usage in the generated function is not a Swift compiler bug, then it might make sense to actually count oneof fields while computing useHeapStorage in the MessageGenerator. If that isn't sufficient then it seems like weighting fields by their size in the useHeapStorage calculation makes sense too, as suggested. Either way I think understanding why the function is using so much stack is a great idea. Perhaps we could do better just in how we lay out the code, like by using nested functions. @n8gray , out of curiosity, if you add a field of type "Tree Root Message" to the declaration of the problematic message (and obviously never set it), does the generated code then work? This will trigger useHeapStorage because of the declaration cycle. I am not in favor of adding an option for this. I mean, just pondering the question of how you would document it reveals that it wouldn't be clear or usable. |
ObjC protos are the reference model, and it causes no end to folks getting threading wrong. Especially with a deep chain like you have. Folks have to remember once they set a message into a field they can't touch it again since the message is referencing it. But if the messages are used for models or to collect things and send them off for logging, it is way to easy to accidentally end up making changes, setting as a field on a message to send to a server and then continuing to mutate it for the next thing to send out. Even if you try to copy the message on field setting (which greatly slows things down), there are still potential issues with repeated fields and map fields since you can't force a copy on the way into those structures without rolling your own custom ones. So value times makes it things much harder to have threading issues. The current calculation for when to use storage has always been a bit of a minimal guess and it too shallow. I've got some ideas I've toyed with in the past to start computing real costs of objects to better decided when to transition to storage which should help the general problem here. But all of those past thoughts have been purely around the stack/heap costs and not looked specifically to much at the exact cost of the oneof enums but definitely not thought about the cost on extended lifetime calls. We should be able to adapt to a better computed cost as a trade off between the old and current approaches, which will hopefully solve this specific problem, and it can always continue to be tuned in the future. I'll try to take a look at this early next week. @Lukasa the Storage model already is a class backed storage. As mentioned on the FAQ, options that impact how things generate are rarely a good path because inevitable you end up with someone having a dependency on protos generated one way, and someone else depending on the same message(s), but wanted to generate it with another set of options. |
This is exactly what's happening—SwiftSyntax had the same issue with its visitation methods for syntax trees because they contained very large I don't think we need to do something that drastic here. There's a much easier change that I think we can make to reduce stack usage for debug builds—we just update the generator so that we nest the body of each case in an immediately executed closure: switch fieldNumber {
case 1: try { try decoder.decodeSingularMessageField(value: &_storage._m1) }()
case 2: try { try decoder.decodeSingularMessageField(value: &_storage._m2) }()
case 3: try { try decoder.decodeSingularStringField(value: &_storage._m3) }()
case 100: try {
var v: SomeCaseMessage1?
if let current = _storage._render {
try decoder.handleConflictingOneOf()
if case .case1(let m) = current {v = m}
}
try decoder.decodeSingularMessageField(value: &v)
if let v = v {_storage._case1 = .case1(v)}
}()
...
(I'm probably missing some explicit Even though the The double- This will result in an extra function call for debug builds, but optimized builds should remove the closure completely. But by moving each case into its own "function", we ensure that it only occupies space on the stack when that case is executed, instead of the sum of all the cases being on the stack for the entire execution of the outer function. There are other improvements we should look into to remove the likelihood that the memory size of messages is ever excessively and surprisingly large, but in the meantime, the above should be a straightforward change that would still be useful independent of other improvements. |
@allevato Brilliant! That sounds like a great solution to the stack frame problem. @thomasvl It’s good to hear you’re willing to take this on! I agree it’s best to avoid knobs and switches when possible, so I hope you can find heuristics that strike the right balance and avoid these problems with deep hierarchies. |
@cburrows I’d try out your idea for adding a recursive reference but I’ve reverted all my pods and brews back to 1.7 at this point. I’m sure it would succeed at forcing that message to use heap allocation, but there are many other messages in our app that will have the same problem. |
I can clarify the thinking behind the current design a bit. First of all,
The first question involves what happens when you copy and then mutate a value: Reference semantics cause all copies to reflect the mutations; value semantics mean that mutations do not affect other copies. (In particular, as @thomasvl pointed out, value semantics have real advantages for multi-threaded code.) The latter question involves how the storage gets actually allocated. Traditional Obj-C style classes provide reference semantics and indirect storage. Naive use of Swift structs give you value semantics and inline storage. But value semantics are not incompatible with indirect storage; indeed much of the standard library (String, Array, etc) provides value semantics with indirect storage by using a struct or enum as a facade for a separate heap-allocated data store. So to answer the real questions:
The difficult piece has always been deciding what fields should be stored inline and what should be stored indirectly. There is really only one hard requirement: reference cycles cannot be stored entirely inline. But otherwise, we have a lot of options to consider:
I'm not eager to provide options to control the layout. I think we might have to do that someday, but I'd rather focus our attention now on finding better heuristics so that most users can get good results without having to manually tweak a lot of options. Here are some ideas I would like to explore:
|
@n8gray -- It looks like your message hierarchy might be entirely tree-structured, which raises an interesting question: Would it make more sense for each sub message to be a separate heap allocation? Or would it be better for the top-level message to be allocated on the heap as a single 6k structure with the submessages inlined into that storage? |
@tbkka Thanks for laying out the history and trade offs in a clear and concise way. As I mentioned earlier, the protobufs in our app represent the structure and data of complex interactive data visualizations, so they are large, deep trees. The root of the tree is usually just some With that in mind, the answer to your question is that pushing the inlining down one level is not going to help, as the next level will likely be just as bad, and if it’s not today it could easily be tomorrow. (In fact the “Root object” we are discussing is not top level at all, it just happened to be at a level that had a lot of oneof cases and triggered the stack problem.) Aggressive inline storage is never going to work for our use case as a general strategy. It’s a great optimization to prevent extra boxing at the leaves of the tree, and to get cost-free abstraction when you want to wrap a single item or two in a message for whatever reason, but that’s probably it. We need a solution that reliably scales to lots of fields, lots of depth, and lots of oneof cases. Indirect storage is reliable that way, at the expense of performance penalties at the leaves of the tree. Fine for us, not great for #733. The more I think about it the harder it seems to find a strategy that will work for both. Hopefully you folks have more imagination. 😉 |
Catching back up on the comments - The |
All In general, a function's stack usage in debug builds will grow proportionally to the sum of the local storage of the Binding an associated value in a switch foo {
case .bar(let value): try doSomething(with: value)
} we'd do this: switch foo {
case .bar: try {
guard case .bar(let value) = foo else { preconditionFailure() }
try doSomething(with: value)
}()
} Using If we go and make any of these changes, we should consider them carefully—we're giving the compiler/optimizer more work to do, for example, so we need to make sure that by reducing stack usage we're not raising compile time (for both types of builds), especially for targets with large messages. |
That almost seems like it might rely on another compiler bug then. If the body of the |
@n8gray Thanks for clarifying.
I wasn't really proposing pushing the inlining down exactly one level -- there's no really practical mechanism for determining what that means -- but rather trying to better understand whether your concern was about a few large top-level messages or a larger number of intermediate messages. As you pointed out, inlining still looks like a good choice for small messages, and #733 is really about how we can avoid unnecessary boxing for very small messages. (~32 byes in that case)
How big are your leaf messages? What would happen in your use case if messages smaller than about 64 bytes were stored inline? (Allocating a single object on the heap has about that much overhead by the time you add up the Swift object header, malloc data structures, and allocation rounding.) |
So I was being a little loose with terminology above. Here's the behavior:
|
It's hard to generalize since we have over 100 |
Also, re this being a duplicate of #1014, I don't believe so. #1014 is crashing the some of the Swift initialization code. Yes, the amount of stack using during decode also plays into that, but it is always going to be possible for that Swift internal runtime code to need more space than what is available in some context. Fixing the issues here won't guarantee that some other case couldn't still fail in the same way #1014 fails. |
Some more quick findings: Having a switch fieldNumber {
case 1: try { try decoder.decodeSingularMessageField(value: &_storage._m1) }()
case 2: try { try decoder.decodeSingularMessageField(value: &_storage._m2) }()
case 3: try { try decoder.decodeSingularStringField(value: &_storage._m3) }()
The outer Whether the message uses a storage class or not also makes a difference. If we pass a stored property of a So if we go the direction of wrapping each case in a closure, we could possibly avoid the closures for fields like cases 1, 2, 3 in messages that don't use a storage |
@allevato opened a forums thread on this also: https://forums.swift.org/t/uneconomical-stack-usage-in-non-optimized-builds/39268 |
…f enum equality. Progress on: apple#1034
…f visiting Progress on: apple#1034
First PR up. In theory, we could attempt to compute the cost of each field and only need to use this closure workaround when things are over some size check, but given a message or message chain can be recursive, there's no way to tell how expensive those would be. So those cases would still have to use closures for everything, so it seems better to just keep the generator simple and if we generate a closure, always generate a closure. i.e. - less variation in the generated code, less to prove correct. |
…f enum equality. Progress on: apple#1034
…f visiting Progress on: apple#1034
The helper is fileprivate, so the compiler should always inline it, but this will help when dealing with static usage for non optimized builds (apple#1034).
The helper is fileprivate, so the compiler should always inline it, but this will help when dealing with static usage for non optimized builds (apple#1034).
The helper is fileprivate, so the compiler should always inline it, but this will help when dealing with static usage for non optimized builds (#1034).
…f isInitialized Progress on: apple#1034
…f enum equality. Progress on: #1034
@n8gray if you can test with master, I'd be curious how much the just merged changes help your situation. I'll work on when Storage is used next, but the first data point on your real world case would also be useful. |
Thanks, I'll try to check it out today. |
@thomasvl The new codegen has solved the stack issue for |
Look at the recursive cost of message fields also, so a nesting of messages all just under the limit doesn't result in a top level message over the limit. This also addresses some of the issues from apple#1034 and hopefully from apple#1014. This is also work on apple#735
Second round of work which makes the calculation around use of Storage transitive (and thus consistent), is now on HEAD. That should reduce the usage a little more. We'll let this sit on HEAD for a few days before considering cutting a new release to get these changes out to everyone. While there are other place we likely could do more codegen changes (the |
I just did some testing on 6806de7 and it looks a lot better. The sizes of the message structs I mentioned before have dropped significantly:
I still have misgivings about letting structs grow to 256 bytes but the worst case seems to be under control. Thanks! |
Yea, as mentioned on the other issue and commits, we've now got a framework in place to actually use real costs going forward which should be much better. |
Note: This is likely a duplicate of #1014 but I didn't want to cram all of this into a comment in a long chain. Feel free to close as duplicate if you want.
Summary/TLDR
The change to move computed properties backed by
_storage
to properties stored directly on the message structs has caused massive size increases for messages at the top of a deep hierarchy. The switch code generated foroneof
parsing is sensitive to these size increases, leading to massive stack frames. In debug mode without optimization these stack frames can easily exhaust the 512KB stack of a background dispatch thread and cause a crash.Details
We are experiencing a crash in debug builds after upgrading SwiftProtobuf and
protoc-gen-swift
from 1.7 to 1.10.2. Our crash happens in thedecodeMessage(decoder:)
method of a message containing aoneof
with lots of cases. From what I can tell the stack frame generated for thewithExtendedLifetime
closure in that method is huge, about 800KB. Since the decoding is happening on a background thread with a 512KB stack, the app crashes as soon as this closure is invoked.The message is of the form:
The code that gets generated for
decodeMessage(decoder:)
has the form:If I set a breakpoint at the line marked
LINE A
and check the stack pointer, then set a breakpoint atLINE B
and do the same I find that the difference is about 800KB.Here's one example:
I have a couple of observations to explain this:
Observation 1:
In the earlier version of swift-protobuf almost all fields of Message structs were stored in
_storage
, a reference-typed container, and exposed as computed vars. In the current version they seem to much more often (but not always?) be implemented as properties on the message itself. Thus the size of a leaf message struct with no child messages became O(n) instead of O(1), wheren
is the number of fields in the message.The story gets a lot worse when you have child messages. Since messages are structs, children get embedded directly in their parents, so the size of a non-leaf message grows like the sum of its entire message sub-tree. So the size scales more like O(n^m), where n is the average number of children per level and m is the average number of levels in the message hierarchy. Since Swift always likes to allocate structs on the stack, this can lead to massive stack frames.
This agrees with what I saw when I measured the size of the
SomeCaseMessageN
messages usingMemoryLayout<SomeCaseMessageN>.size
. In the older code, they were all 24 bytes, with a few exceptions. The largest was 80 bytes. In the new code the sizes tended to increase by 10X-20X. 224 and 448 bytes were typical sizes, with several messages growing above 1000B. One grew to 5256 bytes, while the biggest was 6080 bytes.Observation 2:
If the compiler is smart, it will know that all the cases in the switch statement inside that closure are mutually independent. Only one of them will ever execute per invocation, so their local variables can share space on the stack, and the size of the stack frame will be the size of the biggest case in the switch. Since we're in debug mode it might not be making that optimization, and instead allocating separate space for every case in the switch. Given the size impact of the changes I discussed above, this could add up to a lot of space.
In fact, when I change the optimization mode from
none
tosize
orspeed
, the message sizes don't change but the size of the stack frame drops to about 26KB. That's still an oversized stack frame eating 5% of the stack in release mode, but at least it's not bigger than the entire stack!For reference, with the code generated from 1.7, the stack frame is about 16KB with optimization mode
none
. With optimization modespeed
the stack frame drops to just 400B.Wrapping up
There seem to be problems with putting so many stored properties on messages. Perhaps this could be dialed back to where it was before?
Environment:
OS: MacOS Catalina 10.15.6
Xcode: Version 11.6
Swift 5
The text was updated successfully, but these errors were encountered: