-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass SemanticModel as an argument where it is required #25839
Conversation
📝 Build failure was because #25681 isn't merged yet |
@@ -31,7 +31,7 @@ internal partial class FindReferencesSearchEngine | |||
var symbol = symbolAndFinder.symbolAndProjectId; | |||
var finder = symbolAndFinder.finder; | |||
|
|||
await ProcessDocumentAsync(document, symbol, finder).ConfigureAwait(false); | |||
await ProcessDocumentAsync(document, model, symbol, finder).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @sharwell
one question, is this only place where you give in model from document.GetSematnicModel? no other place?
if that is true.. there seems an issue on weak reference caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This indicates a much deeper and more concerning bug. As we can see in this method, the model is acquired at the top, and then held all the way to the finally block. It is part of the contract of Roslyn that as long as this guy is held around that Document.TryGetSemanticModel should succeed and must always return that same semantic model.
This approach is working around that not being working. But having that contract be violable is a very worrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CyrusNajmabadi Agreed, something is up. If we want the perf optimization that's fine, but this is incredibly easy to break. :-)
@sharwell I obviously got the removal of the evil TryGetSemanticModel calls; is the threading of SemanticModel through just so we're not refetching it all the time? Was there evidence that the refetch wasn't happening, and thus we have a perf problem somewhere as I think @heejaechang is alluding to? |
My hypothesis from information provided so far is it's possible to have a case where an ongoing FAR operation ends up only holding a Passing the semantic model through ensures a true root is held for the duration of processing the associated document. |
I suggested SemanticModel as parameter rather than doing TryGetSemanticModel inside of extension methods since the contract that someone else MUST hold onto semanticModel before calling that two extension methods are not shown anywhere. so it is so fragile to be broken. since this call can be buried inside, it must propagate to the top otherwise, same issue. or we can change TryGetSemanticModel to await document.GetSemanticModel like I did to make sure the extension method to just works in any condition. ... that being said, @sharwell actually did the work to propagate the parameter to the top, so now we can actually be sure in current implementation, all the callers who end up call these 2 extension methods. if this (#25839 (review)) is the only top call that pass in model from await document.GetSemanticModelAsync then, we have other problem. (using TryGetSemanticModel should still be fixed in my opinion ragardless of this) either among the calls,
|
This would be weird, given that TryGet is synchronous, and woudl the be calling something sync. |
If we just deprecated TryGetSemanticModel it'd be a happy day for me, so I don't mind in the slightest replacing it with GetSemanticModelAsync which was the original change IIRC for this. My guess is at one point that helper wasn't async and so we did this to be "fancy". I despise fancy. I'm now trying to understand why we're propogating it everywhere. Is that just since we know everybody will need it so it's cheaper than having them re-call it, or something else? If it's a correctness fix, I'm terrified. |
both extension methods are already async. and in whole find all references code, that 2 extension methods only one that uses TryGetSemanticModel. if we want TryGetSemanticModel, we should change all in find all reference except the top one to use TryGetSemanticModel. |
. My hypothesis from information provided so far is it's possible to have a case where an ongoing FAR operation ends up only holding a SemanticModel via a weak reference, and if that reference gets collected before one of the methods that assumed it was available gets called, then it crashes. This should not be possible. See the code here: Lines 24 to 39 in 2c383ed
We not only get the semantic model, but we also hold onto it explicitly until the finally afterwards. It should not be the case that the only thing holding on is a weak-ref, because it was always supposed to be the case that the outer ProcessDocumentQueueAsync method was the one that created and held onto the model once. If we are failing TryGetSemanticModel then something bad is happening. Either we are getting the model for another document (which doesn't seem likely given your PR, where we can see how that the semantic model only flows in from one location), or there is some sort of bug in TryGetSemanticModel and it is not abiding by the contract we originally had for our workspace TryGetXXX methods. |
I would be fine with this as well. :) However, as long as we have this method, we should try to figure out why it is not working properly right now in this scenario. |
+1 on
we should deprecate TryGetSemanticModel (not delete but make it private) and just make GetSemanticModelAsync to return ValueTask. it will let us to have same perf. |
Note: TryGetSemanticModel was not used here for perf. It was used for correctness. It was a statement in the code of "if i can't get hte semantic model immediately, then something is very bad, because i'm being called in a context where my parent didn't create and cache that semantic model." That's why i'm opposed to the simple fix of using 'await GetSemanticModelAsync'. It woudl certainly workaround the crash. But it would not address the fundamental problem of: how did we get to a state where we needed the semantic model, but it hadn't already been cached? (or, how did we get to the state where we needed the semantic model, it had been cached, but TryGetSemanticModel fails?) Both are concerning (though it seems like it's the latter thing that is happening). |
if we are rejecting using await document.GetSemanticModelAsync in the extension methods, then that method should not get semantic model from anywhere else except at the top. otherwise, what is point not doing await document.GetSemanticModelAsync in the extension methods, but in the caller right before calls the extension methods? |
Note: it's a public part of the Roslyn API. So we'd have to go through compat council on this. |
There's a fourth case. Our contract between GetSemanticModelAsync and TryGetSemanticModel has a bug (i.e. a race condition). It is somehow possible to call GetSemanticModelAsync and get a model, but then have a race that causes TryGetSemanticModel to fail. At this point i have no clue which of the above cases is going on. And i don't think we shoudl just take something that fixes the symptom, rather than understanding the underlying cause. Otherwise, we'll just hit this elsewhere in other code that depends on this contract. -- Alternatively, if we deprecate, then we are essentially saying: no one can or should depend on the contract. In which case, i'm fine with us then just side stepping the issue entirely. |
it doesn't need to be another document (another file), it can be another document forked from original document (same file). I believe one of these 3 is happening.
|
The point is: with our understanding of roslyn, and how FindRefs is structured, what is going on should be impossible. It implies that there is a bug between the interactions TryGetSemanticModel and GetSemanticModelAsync. That bug may be in roslyn itself, or it may be deeper. I.e. we have a race condition in teh workspace, or there's something even scarier deeper below in the runtime. The former is probably more likely. But maybe it's the latter. The overall point is: we should not be only trying to workaround this bug with a FindReferences codechange. That fixes the FindReferences manifestation of the bug, but the underlying bug still remains. Either we should find and fix hte underlying bug. Or we should deprecate and disallow "TryGetSemanticModel" so that there is no more "interaction" between the two that could even lead to a problem. |
tagging @jinujoseph it is your call.
|
@heejaechang I don't understand your 3 choices. What do each mean? And what are the potential consequences of each? |
for |
True. Though it would be super weird for FindRefs to do any forking of documents. I can't think of any reason or cases that would ever happen. But an audit would be a fine way to validate that.
True. Though again, i don't see how htat happens. and if it does, it indicates a deeper bug. FindRefs is a basic async-flow. There are no fire/forget parts of it. And if something is doing that, it is violating part of the design of FindRefs where all document processing is supposed to happen with ProcessDocumentQueueAsync, and all work for a document is supposed to be done by the time we call that finally block. |
I don't think that one should even be on the table. it woudl literally be an example of us not understanding a bug and trying to avoid having to understand it just to fix a symptom rather than a cause. Not fixing/understanding the cause just means we'll hit this sort of thing elsewhere. We should always put in the effort to understand the true cause, and at that point, a determination can be made as to the appropriate fix. |
so I wrote the 3 cases it can happen. since I didn't audit the code, I don't know whether someone forked document or not. since I didn't audit the code, I dont know whether one of call insde of the top call escaped the scope (task, await can easily do that without realizing) or clr issue. I believe @genlu encountered issue with weak reference before due to clr bug right? |
I'm a little concerned about doing things in this order. Because, as we've seen in the past, that investigation rarely happens. i would just go with '1'. We should investigate to figure what is going on. Either FindRefs isn't working as it is supposed to**. Or TryGetSemanticModel is not. In either case, adding 'await' fixes the symptom, but doesn't address the underlying problem. -- ** Note: if FindRefs is not working as it's supposed to, i would still want to understand that. Because it's very intentional that we process a single document at a time, and that we expect to only have a single semantic model alive at a time. If FindRefs has changed such that that isn't so, we need to know. |
"We are verifying the correctness of our cache" is to me a statement of performance, not correctness -- the cache is a perf fix, nothing more. 😄 |
Can you give an example of how 'await' can easily do that? We need to know as we use similar patterns all over the place (especially in oop). i.e. it's super common for us to have: try
{
await lots of code
}
finally
{
// do work
} FindRefs was written with the expectation that that scoping would not be escaped, and thus it could trust that the lifetime of the model woudl exist across all those async calls. If that's not the case, there is a lot of roslyn that is going to have problems. For example, if that finally does any sort of resource cleanup, etc, then all of that async code is suspect if it can "easily escape the scope". |
that's why I wrote it as @jinujoseph call. my opinion is sure, if she has resource to go and figure out then why not. if practically, whether it is going to worth the effort. I dont know. asking someone to take a look for like several hours and if one can't figure out then drop it, sure that will work as well. |
If we don't have the resources to actually investigate root causes for bugs, then that would be deeply concerning :) Such an environment would lead to a preponderance of root issues, continually papered over with fixes of symptoms. I would like for Roslyn to never get to that stage :) If we identify that there is likely some other root issue at hand, some amount of due diligence should always go into understanding it. This is truly something that impacts and affects the long term health of the project. -- Note: this is similar to an issue we had before where was had an assumption that if a document was open you could always map between certain objects if you were on the UI thread. I believe that rename critically depended on this. At one point we broke this, and a simple band-aid fix was proposed to address the problem. However, we went and actually investigated and figured out the problem inside our system where we were doing things improperly. |
change await to async. if someone doesn't wait task properly, it can always escape the scope. |
Another concern for me is that we have TryGetSyntaxTree/GetSyntaxTreeAsync and TryGetSemanticModel/GetSemanticModelAsync, and they don't have nearly identical logic. That seems super wonky to me as i would expect them to have almost exactly the same logic around caching values and ensuring consistency between the pairs. |
That works for me. Note: the cache can't be removed entirely. It needs to be passed along so that the individual finders can benefit from the caching it does of other intermediary values. |
7dfdcc6
to
ef8fa26
Compare
if you want to keep your approach, you probably want to remove other places where it uses await document.GetSemanticModelAsync. otherwise, code will be inconsistent where semantic model is all passed in but only just call await doucment.SemanticModelAsync again. I actually like that my simple approach where the extension method just calls await document.GetSementicModelAsync. if one believe that assert is important, one can always add TrySemanticModel as assert before calling await .. and explicitly do NFW. |
I assume of course if the caching of SemanticModels was important, we have a test that ensures that somehow and a perf test that shows the perf benefit....right? 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Heejae's suggestion also makes sense though. If we have any other cases in the ReferenceFinders where we use .GetSemanticModelAsync, we can probably just remove them and pass along this semantic model.
FindReferenceCache is static since the Finder API was not factory but MEF service. felt weird to put it as argument to each call. so it ends up the static cache thing. if people are okay to pass the cache in as argument +1. if not, it would be also nice to make each finder to be factory so that one can create finder with the cache from the factory that get imported from MEF. that will also let other static caches in the finder such as (http://source.roslyn.io/#Microsoft.CodeAnalysis.Workspaces/FindSymbols/FindReferences/DependentTypeFinder.cs,51) to not static and not based on evil ConditionalWeakTable. (by the way, there are many more) ... that can also use that as stepping stone to make multiple find reference call to same solution more efficient (such as HR or code lens), by sharing caches. right now, some static conditionalweaktable cache off solution might be re-used, and some might not. one can't reason about it since it is spread in many places deep in finder and caller can't control those. |
I'm not sure what this means. FindReferences is supposed to just be an API. External clients call it. Under the covers it can create whatever data it wants and can pass that to all its internal impl details. |
I dont know about those, but FindReferenceCache is important since they only cache stuff that showed up in etl that each finder is doing repeatedly. if I remember correctly, it improved avg find all perf from 12 seconds to 1 seconds. but now with IOperation, we might can just use IOperation rather than having this cache because semanticModel.GetSymbolInfo does same thing repeatedly. IOperation basically does what the FindReferenceCache is trying to do. |
I don't believe so. For example, it looks like the stuff you're caching are things like:
i.e. a mapping from a string, to all the tokens matching that string in the syntax tree. This isn't something IOperation can provide. And, it's something we woudl want to only happen once per identifier, so that it could be reused over all the finders. |
8548a0c
to
dda4a8c
Compare
Basically, it's a way to memoize common operations that individual finders do while allowing them to operate in parallel. Such a thing does not have to be static. It can just be state that is passed to all of htem. |
what I meant by it is that which directly expose but it can be something like this and then do FindReferencesSearchEngine(... finders) |
Updated all instances under FindReferences where a method operates on a single |
that's true, I was more focusing on this cache that is solely there since calling GetSymbolInfo many many times, setting up context for binder is expansive relatively even if there is bound node cache. IOperation do solve that problem. |
@sharwell :) +1 I think just changing TryGetSemanticModel to await doucment.GetSemanticModel feels still simpler though :) especially, since this also lose the ability to assert on TryGetSemanticModel returning null case. |
That is definitely not MEF. :) You can see that here: It is pure static data. |
I def prefer Sharwell's approach. A core design idea in FindRefs is that we do all the processing for a document at once. So it should always be the case that we have a single SemanticModel that we are processing at once. I think it would be too easy for us to accidentally start processing a different document accidently, causing us to realize other semantic models accidentally. :) This is actually super critical for performance as we may end up literally walking the entire solution, and we don't want to destroy the GC by doing major allocations across many documents simultaneously. By ensuring we only look at one doc at a time, we great limit the overall memory we'll need at any point at time, to whatever is needed for a single document. |
oh. I thought they are MEF imported so that people can extend finder like formatter. I guess not. but anyway, that's even better. we can change this not to use singleton but create new instance of finder each search without having factory abstraction then. and drop all static caches. |
having 2 that must be from same snapshots creates new issue where one might given 2 objects from 2 different snapshot. like semanticmodel not actually from document. but I am fine with @sharwell approach as well. and if passing in 2 different snapshot object is problem, we can use http://source.roslyn.io/#Microsoft.CodeAnalysis.Workspaces/Workspace/Solution/SemanticDocument.cs,9 to make sure things related together moves together. |
@Pilchie for ask mode |
// it looks like cache is gone. reset the cache. | ||
original.SetTarget(result); | ||
return result; | ||
original.SetTarget(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have 5 other calls to SetTarget
- should we be worried about race conditions in any of them?
https://github.com/dotnet/roslyn/search?utf8=%E2%9C%93&q=SetTarget&type=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on whether weak reference is purely for caching or to preserve observable reference equality.
now I am thinking more I am not sure whether we ever guaranteed reference equality for semantic model.
we do so for compilation/symbols (original def and etc but not local, generic instances and etc) and syntaxtree/nodes. but we created new semantic model for Compilation.GetSemanticModel
so, not sure why document.GetSemanticModel suddenly tries to guarantee that. otherwise, it is purely for perf purpose.
symbol equality is tied to compilation not semantic model. same goes for node.
anything we want that reference equality preserved we use AsyncLazy or ValueSource. the fact that SemanticModel suddently not use it seems kind of tip that we never supported reference equality and weak reference is there solely for perf reason.
otherwise, we could change GetSemanticModelAsync implementation to use AsyncLazy like others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I am thinking more I am not sure whether we ever guaranteed reference equality for semantic model.
From my recollection, this was intended to be a contract of the Workspace. So much so that we put these sorts of assets in our features.
otherwise, we could change GetSemanticModelAsync implementation to use AsyncLazy like others?
I'm not sure if that would work. AsyncLazy can say that it does not want to cache a result (which is what we'd want to prevent unbounded memory usage). But if you don't ask to cache results, there's nothing (afaict) that will ensure it will return the existing value if you ask it for it's value multiple times (while someone is holding onto a previous result).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we could consider CachedWeakValueSource. I think the purpose of it is to be able to compute values on demand, not hold them alive, but return the same value each time as long as the initial value is alive.
Note: looking at the impl, it uses double-checked locking. So i already don't trust it. But, at least it seems like this is the problem space it's trying to solve. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➡️ There are two cases I couldn't rule out, and filed #25881 for them.
Customer scenario
After clicking in a document, the IDE displays a gold bar that a problem was encountered, and highlighting references to the item under the caret does not work.
Bugs this fixes
Fixes #25836
Workarounds, if any
Restart the IDE if this occurs.
Risk
Low. The code now explicitly passes references where an implicit reference was used in the past. A rarely-hit race condition is fixed by placing a lock on a slow path in code.
Performance impact
Minimal. A new
lock
is used for synchronization, but only on a slow path.Is this a regression from a previous update?
No.
Root cause analysis
The race condition requires three threads to interact with the garbage collector, so this situation is relatively rare. Under the "good path" of the race, a value normally held by a
WeakReference<T>
is placed in a cache so a strong reference prevents the value from getting garbage collected. On the "bad path", the value placed in the cache is not the same as the value held by the weak reference, allowing the weak reference to be garbage collected during a Find References operation.It looks like there is a race condition in
GetSemanticModelAsync
. Two concurrent calls can return different semantic models in the following sequence, where A is the thread that calls it here:roslyn/src/Workspaces/Core/Portable/FindSymbols/FindReferences/FindReferencesSearchEngine_DocumentProcessing.cs
Line 24 in 2c383ed
C enters
GetSemanticModelAsync
C returns Model1
Model1 gets collected by the GC
A enters
GetSemanticModelAsync
for the same documentB enters
GetSemanticModelAsync
for the same documentA and B reach the following line:
roslyn/src/Workspaces/Core/Portable/Workspace/Solution/Document.cs
Line 286 in 2c383ed
A sets the target to Model2 and returns Model2
B sets the target to Model3 and returns Model3
A places Model2 in the
FindReferenceCache
Model3 gets collected by the GC
Subsequent call to
TryGetSemanticModelAsync
returnsnull
How was the bug found?
User feedback.
Test documentation updated?
N/A