Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final operators stats are not always propagated #5172

Closed
sopel39 opened this issue Sep 15, 2020 · 15 comments · Fixed by #9888 or #9913
Closed

Final operators stats are not always propagated #5172

sopel39 opened this issue Sep 15, 2020 · 15 comments · Fixed by #9888 or #9913
Labels
bug Something isn't working

Comments

@sopel39
Copy link
Member

sopel39 commented Sep 15, 2020

Looking at: io.prestosql.operator.DriverContext#finished, it seems that driver might be set as done while it's stats are still not populated into pipeline stats (pipelineContext.driverFinished(this); happens after). This might mark task as finished (and final task info set) with driver stats lost.

Relates to: #5120

@sopel39
Copy link
Member Author

sopel39 commented Sep 15, 2020

cc @dain

@findepi findepi added the bug Something isn't working label Sep 15, 2020
@findepi
Copy link
Member

findepi commented Sep 15, 2020

Question is -- are stats "best effort", or guaranteed?

in tests we would want them to be guaranteed, but maybe it's not necessary in general, and additional synchronisation or latency at comnpletion is not warranted?

@sopel39
Copy link
Member Author

sopel39 commented Sep 15, 2020

I'm not sure DriverContext is a problem though since it's not marked as @ThreadSafe and it seems to be accessed from thread of TaskExecutor

@dain
Copy link
Member

dain commented Oct 14, 2020

IIRC the DriverContext is supposed to be single threaded. Maybe the problem is on the code that is reading these.

@sopel39 can you describe the problem we are actually seeing in these tests?

@sopel39
Copy link
Member Author

sopel39 commented Oct 14, 2020

@dain it looks like stats from not all drivers (pipelines?) are propagated back to coordinator before query is finished.
This causes certain assertions to fail (e.g input row count is less than expected).

All linked flaky tests share that characteristic.

@yansun7
Copy link
Contributor

yansun7 commented May 17, 2021

In addition to the risk of losing driver stat, driver context clear up is also skipped in that case (code)
In my test, I've seen cases from heap dump where lots of DriverContext objects are not collected by GC in time due to this issue.

@sopel39
Copy link
Member Author

sopel39 commented May 17, 2021

In addition to the risk of losing driver stat, driver context clear up is also skipped in that case (code)
In my test, I've seen cases from heap dump where lots of DriverContext objects are not collected by GC in time due to this issue.

Would you be able to track the root cause of the issue?

@yansun7
Copy link
Contributor

yansun7 commented May 17, 2021

For my case, it's because DriverContext#failed(..) is called before DriverContext#finished() when query is interrupted by exception (code), thus there's no chance PipelineContext#driverFinished() can be called to remove driverContext.

@atanasenko
Copy link
Member

atanasenko commented Oct 21, 2021

After quite some time of reading code and analyzing logs, I think I figured it out.
There are at least 3 issues that caused test flakiness in stats. All stem from the asynchronous nature of task updates coming from workers to coordinator, and StateMachine listener events which might also race with other code to update TaskInfo instances.

  • The most frequent one is because of worker's SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.
  • Second, less frequent, but still prominent is when final TaskInfo on the coordinator is being constructed from a final task status and a partial TaskInfo received previously from the worker which might not have all the stats collected just yet, while the final TaskInfo on the worker is built just a bit later.
  • Third one is similar to previous, but in this case the final status of the task is set during substage cancellation on parent's FLUSHING status. Sometimes final TaskInfo (or even TaskStatus) have not yet reached coordinator, meaning it's stage is not yet completed. Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

I've submitted a pr #9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100.

@sopel39
Copy link
Member Author

sopel39 commented Oct 22, 2021

The most frequent one is because of worker's SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.

Does that happen on early cancellation? Otherwise worker couldn't do any actual work (io.trino.execution.SqlTask#updateTask wasn't called) so stats should be empty.

@sopel39
Copy link
Member Author

sopel39 commented Oct 22, 2021

Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

Could you point that in code? I don't think it's the case

@itsinthebag
Copy link

Not sure this is totally related, but in my case, there is operatorSummaries, but operatorInfo is missing sometime. This only happened in TableScanOperator and ScanFilterAndProjectOperator.
correct operatorStats:
image

missing operatorInfo operatorStats:
image

@dain
Copy link
Member

dain commented Nov 5, 2021

Question is -- are stats "best effort", or guaranteed?

They are best effort. If we can get stats great, but do not add latency to the query.

@findepi
Copy link
Member

findepi commented Nov 8, 2021

They are best effort. If we can get stats great, but do not add latency to the query.

i can imagine use-cases where having accurate stats is worth small additional latency. e.g. chargeback.
What additional latency tax this would take?

atanasenko added a commit to starburstdata/trino that referenced this issue Nov 9, 2021
Task may have its stats populated and state updated to FINISHED during
the createTaskInfo() call, which could potentially create TaskInfo with
FINISHED state, but with some of the stats missing.
Creating TaskStatus first makes sure that stats are already present.

This handles a rare case of flaky tests reported in trinodb#5172
@findepi
Copy link
Member

findepi commented Nov 9, 2021

Additional fix #9913
Follow up cleanup #9898

sopel39 pushed a commit that referenced this issue Nov 9, 2021
Task may have its stats populated and state updated to FINISHED during
the createTaskInfo() call, which could potentially create TaskInfo with
FINISHED state, but with some of the stats missing.
Creating TaskStatus first makes sure that stats are already present.

This handles a rare case of flaky tests reported in #5172
sumannewton pushed a commit to sumannewton/trino that referenced this issue Jan 17, 2022
Task may have its stats populated and state updated to FINISHED during
the createTaskInfo() call, which could potentially create TaskInfo with
FINISHED state, but with some of the stats missing.
Creating TaskStatus first makes sure that stats are already present.

This handles a rare case of flaky tests reported in trinodb#5172
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
6 participants