Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging in DeltaScheduler #8909

Closed
vladsud opened this issue Jan 27, 2022 · 4 comments
Closed

Improve logging in DeltaScheduler #8909

vladsud opened this issue Jan 27, 2022 · 4 comments
Assignees
Labels
bug Something isn't working perf
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Jan 27, 2022

Forking from #8908 concrete work:

I see 531K "InboundOpsProcessingTime" events in one day.
So yes, DeltaScheduler seems like is being hit a lot.

One thing that I do not like is that we issue event only if we hit 2K idle states.
It feels like once we set a timer in batchEnd(), we have to report an event that it happened, and ideally

  1. how many ops we already processed till that moment?
  2. how many ops are remained in queue?
    3.Once done (reached idle), how many ops actually were processed (will be different from # 2, as ops are coming in).

BTW, this.isScheduling seems can be removed. It's always true when we process ops, and always false when we stop processing ops. Given that we always alternate between these states, there is no reason to check the value - we know what the value is

@vladsud vladsud added bug Something isn't working perf labels Jan 27, 2022
@vladsud vladsud added this to the February 2022 milestone Jan 27, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Jan 27, 2022

I can't find obvious bug in logic, but data does not match my intuition about how it should work.
Given that graph mentioned in parent bug tracks ack of own ops, there should not be that many ops in between to process (unless we were already holding on ops when sending an op). Might be worth having this data (as boolean) on fluid:telemetry:OpPerf:OpRoundtripTime event.

Do we test that DeltaScheduler does not pause ops if there not many of them / they are processed fast?
Glancing at existing UTs, processOp() always injects 30ms delay per op, which suggest we only test the opposite.

Can we convert existing tests to use mock timer?

Please take a closer look, it would be said if our latencies are affected not because of network / service, but because of silly bugs

@vladsud
Copy link
Contributor Author

vladsud commented Jan 30, 2022

Please note that #8912 tracks more generic problem of tracking who pauses inbound queue and for how long

@vladsud
Copy link
Contributor Author

vladsud commented Mar 30, 2022

This issue is related to #9505. I'd say this bug is lower priority over removing unneeded pauses in DeltaScheduler.
#8912 should also bring some more clarity on how long it takes for ops to get to processed / if they get stuck in the queue due to pausing.

@vladsud
Copy link
Contributor Author

vladsud commented Apr 12, 2022

Closing as fixed by above mentioned PR.

@vladsud vladsud closed this as completed Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf
Projects
None yet
Development

No branches or pull requests

4 participants