-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final operators stats are not always propagated #5172
Comments
cc @dain |
Question is -- are stats "best effort", or guaranteed? in tests we would want them to be guaranteed, but maybe it's not necessary in general, and additional synchronisation or latency at comnpletion is not warranted? |
I'm not sure |
IIRC the DriverContext is supposed to be single threaded. Maybe the problem is on the code that is reading these. @sopel39 can you describe the problem we are actually seeing in these tests? |
@dain it looks like stats from not all drivers (pipelines?) are propagated back to coordinator before query is finished. All linked flaky tests share that characteristic. |
In addition to the risk of losing driver stat, driver context clear up is also skipped in that case (code) |
Would you be able to track the root cause of the issue? |
For my case, it's because DriverContext#failed(..) is called before DriverContext#finished() when query is interrupted by exception (code), thus there's no chance PipelineContext#driverFinished() can be called to remove driverContext. |
After quite some time of reading code and analyzing logs, I think I figured it out.
I've submitted a pr #9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100. |
Does that happen on early cancellation? Otherwise worker couldn't do any actual work ( |
Could you point that in code? I don't think it's the case |
They are best effort. If we can get stats great, but do not add latency to the query. |
i can imagine use-cases where having accurate stats is worth small additional latency. e.g. chargeback. |
Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in trinodb#5172
Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in #5172
Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in trinodb#5172
Looking at:
io.prestosql.operator.DriverContext#finished
, it seems that driver might be set as done while it's stats are still not populated into pipeline stats (pipelineContext.driverFinished(this);
happens after). This might mark task as finished (and final task info set) with driver stats lost.Relates to: #5120
The text was updated successfully, but these errors were encountered: