-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks can be run several times under certain conditions #2644
Comments
That's strange, I don't think Worker B waits with sending the status of the Parent task like the picture demonstrates. It should run I haven't used luigi for almost 2 years now so I might be a bit rusty (read plain wrong). :) |
@Tarrasch, I figured out how to resolve my issue without any changes: using |
@GoodDok is there any documentation which should be updated to make this more clear? |
@Tarrasch @dlstadther please don't close the issue for a while, it seems like the issue was reproduced even with the discovered parameters, but I want to make it sure. |
Take your time and let us know of any discoveries. |
@Tarrasch, I've created yet another unit test (tried to make it simpler) that involves only one task (
Note: I added |
I’ve never seen this in many years of |
Hi @kwilcox, yes, most of the time I was testing it with |
@GoodDok, again huge thanks for taking the time to deep dive into this. As for your test case, whatvever you see when using |
Also, how often does thing rerun for you in prod? In most cases if it doesn't happen to often I wouldn't be worried (your tasks should be able two run twice without bad side effects), but it of course depends on how much money you pay when a task runs (and how much money you spend on debugging this). |
@Tarrasch, most of the time I was testing it with |
Alright. I see what happens now, and yes your diagram looks correct to me. To me it's basically these series of state transitions I have to admit that this is a fully valid bug you've found and that the current model of the luigi scheduler isn't able to handle this gracefully. Normally when a task is Given how our general recommendation to luigi users is to "rerun until it works", it would make sense that there's a period of "PENDING-distrust" directly after task completion (or actually after any DONE-marking, that would work too). Either the tasks can be configurable for this distrust time or probably better, it's dynamic based on how long the completion check took for the task that reported it as DONE (and you add it again to "double that" to get the distrust time). What do you think? |
@Tarrasch i believe it's a problem that should be considered thoroughly. Lines 187 to 192 in ff2be7e
But I'm not so familiar with the codebase and don't understand what kind of consequences such change may have. Besides that, maybe there are some other options to consider, at the end of the day collaborators are those who make the final decision. |
It won't be "fail and retry afterwards", it's just that the scheduler will not give the task to the worker to run - however if there are a common parents between the worker that already ran the task and the ignored worker, they will continue to "cooperate". I'm not sure if we should do a complete() check before a task is run - it has one downside that I see if complete() is slow you don't want it to run twice, and if it's fast you won't have this problem to begin with anyway. But yes I agree we should discuss this, but I'm not sure how many luigi users care about this given how very long time it took us to find this scheduling bug. |
Well, sounds like a good way to solve the problem. I agree, double I don't know, maybe it's not that important, but maybe we face an example of survival bias: the bug is pretty hard to reproduce, so someone may have moved from Luigi without a possibility to reproduce nor fix the problem. |
You're certainly right about the survival bias. Also it's likely those guys had the same issue. I'll be here to help reviewing if you decide to take a shot at it. i thought about this again. What about just have a configurable distrust time only on the server? No need to force the client to send the information on how long complete takes to run. I'm pretty sure nobody will be hurt by having it default to say 10s. What do you think? |
@Tarrasch I like your idea to set a configurable distrust time on the server and 10s period seems reasonable for me. Maybe I'll think about applying this approach to resolve the issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions. |
Waiting for the PR review: #2645 P.S.: @Tarrasch @dlstadther @honnix guys, please help me with reopening this one, I see no available options to do it myself. |
stable_done_cooldown_secs does not appear in the configuration docs, changing previous behavior... |
Hi,
I encountered a problem of repeatedly running tasks in my pipeline running on Luigi 2.8.0.
Basically, it consists of several tasks:
A(input_dt)
— the lowest level luigi task, it's result is stored in DB, soexists
check on target takes some time;B(input_dt)
— hasA(input_dt)
as it's dependency;C(input_dt)
— triggered by cron once in 10 min, depends onB(input_dt)
andB(input_dt - 10 min)
.As you may see, each
B
task is checked twice, and sometimes it leads to duplicate runs.I tried to reproduce the bug for a long time and finally I wrote a test that reproduces it locally:
Here's the test result log: test_result.log,
logger.info("PARENT IS RUNNING")
was executed twice.AFAIK, the main difference between
luigi.build
in the test and the real execution is inno_lock
flag, but this safeguard does not work in case of the real pipeline sinceC
tasks started from cron have different parameters.I suspect the main problem is that while
process2
checkChild
's state,Parent
state can be changed from PENDING to DONE byprocess1
, so the message to scheduler with PENDING status after DONE status triggersParent
task again.Can you please take a look? Thanks in advance!
The text was updated successfully, but these errors were encountered: