Final operators stats are not always propagated #5172

sopel39 · 2020-09-15T15:08:53Z

Looking at: io.prestosql.operator.DriverContext#finished, it seems that driver might be set as done while it's stats are still not populated into pipeline stats (pipelineContext.driverFinished(this); happens after). This might mark task as finished (and final task info set) with driver stats lost.

Relates to: #5120

The text was updated successfully, but these errors were encountered:

sopel39 · 2020-09-15T15:08:58Z

cc @dain

findepi · 2020-09-15T15:11:34Z

Question is -- are stats "best effort", or guaranteed?

in tests we would want them to be guaranteed, but maybe it's not necessary in general, and additional synchronisation or latency at comnpletion is not warranted?

sopel39 · 2020-09-15T15:21:40Z

I'm not sure DriverContext is a problem though since it's not marked as @ThreadSafe and it seems to be accessed from thread of TaskExecutor

dain · 2020-10-14T20:28:14Z

IIRC the DriverContext is supposed to be single threaded. Maybe the problem is on the code that is reading these.

@sopel39 can you describe the problem we are actually seeing in these tests?

sopel39 · 2020-10-14T20:49:09Z

@dain it looks like stats from not all drivers (pipelines?) are propagated back to coordinator before query is finished.
This causes certain assertions to fail (e.g input row count is less than expected).

All linked flaky tests share that characteristic.

yansun7 · 2021-05-17T19:57:43Z

In addition to the risk of losing driver stat, driver context clear up is also skipped in that case (code)
In my test, I've seen cases from heap dump where lots of DriverContext objects are not collected by GC in time due to this issue.

sopel39 · 2021-05-17T20:03:05Z

In addition to the risk of losing driver stat, driver context clear up is also skipped in that case (code)
In my test, I've seen cases from heap dump where lots of DriverContext objects are not collected by GC in time due to this issue.

Would you be able to track the root cause of the issue?

yansun7 · 2021-05-17T20:33:00Z

For my case, it's because DriverContext#failed(..) is called before DriverContext#finished() when query is interrupted by exception (code), thus there's no chance PipelineContext#driverFinished() can be called to remove driverContext.

atanasenko · 2021-10-21T21:11:59Z

After quite some time of reading code and analyzing logs, I think I figured it out.
There are at least 3 issues that caused test flakiness in stats. All stem from the asynchronous nature of task updates coming from workers to coordinator, and StateMachine listener events which might also race with other code to update TaskInfo instances.

The most frequent one is because of worker's SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.
Second, less frequent, but still prominent is when final TaskInfo on the coordinator is being constructed from a final task status and a partial TaskInfo received previously from the worker which might not have all the stats collected just yet, while the final TaskInfo on the worker is built just a bit later.
Third one is similar to previous, but in this case the final status of the task is set during substage cancellation on parent's FLUSHING status. Sometimes final TaskInfo (or even TaskStatus) have not yet reached coordinator, meaning it's stage is not yet completed. Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

I've submitted a pr #9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100.

sopel39 · 2021-10-22T11:54:24Z

The most frequent one is because of worker's SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.

Does that happen on early cancellation? Otherwise worker couldn't do any actual work (io.trino.execution.SqlTask#updateTask wasn't called) so stats should be empty.

sopel39 · 2021-10-22T11:55:11Z

Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

Could you point that in code? I don't think it's the case

itsinthebag · 2021-10-29T18:06:07Z

Not sure this is totally related, but in my case, there is operatorSummaries, but operatorInfo is missing sometime. This only happened in TableScanOperator and ScanFilterAndProjectOperator.
correct operatorStats:

missing operatorInfo operatorStats:

dain · 2021-11-05T20:26:08Z

Question is -- are stats "best effort", or guaranteed?

They are best effort. If we can get stats great, but do not add latency to the query.

findepi · 2021-11-08T09:36:35Z

They are best effort. If we can get stats great, but do not add latency to the query.

i can imagine use-cases where having accurate stats is worth small additional latency. e.g. chargeback.
What additional latency tax this would take?

Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in trinodb#5172

findepi · 2021-11-09T15:10:01Z

Additional fix #9913
Follow up cleanup #9898

Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in #5172

Task may have its stats populated and state updated to FINISHED during the createTaskInfo() call, which could potentially create TaskInfo with FINISHED state, but with some of the stats missing. Creating TaskStatus first makes sure that stats are already present. This handles a rare case of flaky tests reported in trinodb#5172

findepi added the bug Something isn't working label Sep 15, 2020

laurachenyu mentioned this issue Sep 15, 2020

Integrate Coral with Presto to enable querying hive views #4661

Merged

This was referenced Sep 23, 2020

Flaky test TestMongoDistributedQueries.testWrittenStats #4275

Closed

Flaky TestHiveDistributedQueries.testWrittenStats #4586

Closed

Flaky io.prestosql.plugin.phoenix.TestPhoenixDistributedQueries.testWrittenStats #4386

Closed

This was referenced Sep 30, 2020

Variable precision timestamp support for Hive write operations #5283

Merged

Mark testWrittenStats test as @Flaky #5405

Merged

Cleanup use of executors #5403

Closed

sopel39 mentioned this issue Oct 14, 2020

Flaky test TestHiveDynamicPartitionPruning.testJoinWithEmptyBuildSide #5120

Closed

findepi mentioned this issue Oct 30, 2020

Flaky TestKafkaIntegrationPushDown.testTimestampLogAppendModePushDown #5734

Closed

sopel39 mentioned this issue Dec 9, 2020

Flaky io.prestosql.plugin.memory.TestMemorySmoke.testCrossJoinDynamicFiltering #6266

Closed

This was referenced Dec 15, 2020

Reconstruct NOT IN when pushing down to JDBC source #6343

Merged

Mark TestMemorySmoke#testCrossJoinDynamicFiltering as @Flaky #6346

Merged

sopel39 mentioned this issue Jan 25, 2021

Flaky TestJoinQueriesWithoutDynamicFiltering.testOutputDuplicatesInsensitiveJoin #6715

Closed

findepi mentioned this issue Jul 29, 2021

Flaky test TestMemoryConnectorTest.testCustomMetricsScanFilter #8691

Closed

findepi mentioned this issue Sep 20, 2021

Test Iceberg split pruning when filtering on partition column #9304

Merged

sopel39 mentioned this issue Sep 23, 2021

Mark TestMemoryConnectorTest.testCustomMetricsScanOnly as flaky #9336

Merged

findepi mentioned this issue Sep 23, 2021

Fix predicate pushdown for Parquet decimal columns #9338

Merged

This was referenced Oct 13, 2021

Implement Dereference pushdown for the Iceberg connector #8129

Merged

Local dynamic filter support for Iceberg #9538

Merged

Stess test for final operator stats not present #9632

Closed

findepi mentioned this issue Oct 20, 2021

Remove flaky assertions in DF test #5184

Merged

atanasenko mentioned this issue Oct 21, 2021

Fix propagation of query stats. #9733

Closed

This was referenced Nov 5, 2021

Avoid dynamic filter current predicate computation when not used #9867

Merged

Do not add sources when creating SqlTaskExecution #9888

Merged

sopel39 closed this as completed in #9888 Nov 8, 2021

sopel39 mentioned this issue Nov 8, 2021

Remove TODOs and flaky annotations that depend on https://github.com/trinodb/trino/issues/5172 being fixed #9898

Closed

atanasenko mentioned this issue Nov 9, 2021

Create TaskStatus before stats when reporting taskInfo to coordinator #9913

Merged

findepi mentioned this issue Nov 22, 2021

Remove redundant flexible assertions #10022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final operators stats are not always propagated #5172

Final operators stats are not always propagated #5172

sopel39 commented Sep 15, 2020 •

edited

Loading

sopel39 commented Sep 15, 2020

findepi commented Sep 15, 2020

sopel39 commented Sep 15, 2020

dain commented Oct 14, 2020

sopel39 commented Oct 14, 2020 •

edited

Loading

yansun7 commented May 17, 2021

sopel39 commented May 17, 2021

yansun7 commented May 17, 2021

atanasenko commented Oct 21, 2021 •

edited by findepi

Loading

sopel39 commented Oct 22, 2021 •

edited

Loading

sopel39 commented Oct 22, 2021

itsinthebag commented Oct 29, 2021

dain commented Nov 5, 2021

findepi commented Nov 8, 2021

findepi commented Nov 9, 2021

Final operators stats are not always propagated #5172

Final operators stats are not always propagated #5172

Comments

sopel39 commented Sep 15, 2020 • edited Loading

sopel39 commented Sep 15, 2020

findepi commented Sep 15, 2020

sopel39 commented Sep 15, 2020

dain commented Oct 14, 2020

sopel39 commented Oct 14, 2020 • edited Loading

yansun7 commented May 17, 2021

sopel39 commented May 17, 2021

yansun7 commented May 17, 2021

atanasenko commented Oct 21, 2021 • edited by findepi Loading

sopel39 commented Oct 22, 2021 • edited Loading

sopel39 commented Oct 22, 2021

itsinthebag commented Oct 29, 2021

dain commented Nov 5, 2021

findepi commented Nov 8, 2021

findepi commented Nov 9, 2021

sopel39 commented Sep 15, 2020 •

edited

Loading

sopel39 commented Oct 14, 2020 •

edited

Loading

atanasenko commented Oct 21, 2021 •

edited by findepi

Loading

sopel39 commented Oct 22, 2021 •

edited

Loading