You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the solution implemented on #1321 is impossible to ensure frames killed by one of the memory-related flows are marked to be retried without changing RQD. The way this was implemented on Cuebot relies on updating the frame's int_exit_status in the database with Dispatcher.EXIT_STATUS_MEMORY_FAILURE and expecting no other logic will update the same frame until it finally gets killed and a FrameCompleteReport is processed by Cuebot.
This logic is currently implemented on dispatchSupport.updateFrameMemoryError, called by DispatchRqdKillFrameMemory.java
And after a FrameCompleteReport is processed the same db field is captured by FrameCompleteHandler.java on handlePostFrameCompleteOperations where frameDetail.exitStatus is queried from the database:
/** An exit status of 33 indicates that the frame was killed by the* application due to a memory issue and should be retried. In this* case, disable the optimizer and raise the memory by what is* specified in the show's service override, service or 2GB.*/if (report.getExitStatus() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE
|| report.getExitSignal() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE
|| frameDetail.exitStatus == Dispatcher.EXIT_STATUS_MEMORY_FAILURE) {
...
}
The problem with this logic is that a lot can happen between the killRequest being created and the frameCompleteReport being finally processed after RQD receives the request, kills the frame and queues a FrameCompleteReport to be sent back.
Proposed Solution
The proposed solution is changing RQD so it starts sending 33 as existStatus when a frame is killed due to memory issues. Reading Cuebot's code, it looks like it already supports this feature on the same code snipped mentioned above report.getExitStatus() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE. But RQD at any point sets the exitStatus as 33.
The way the current kill logic works on RQD, the killRunningFrame function on rqdservicers.py gets the frame on rqCode and calls kill on the frame:
defKillRunningFrame(self, request, context):
"""RPC call that kills the running frame with the given id"""log.warning("Request received: killRunningFrame")
frame=self.rqCore.getRunningFrame(request.frame_id)
ifframe:
frame.kill(message=request.message)
else:
log.warning("Wasn't able to find frame(%s) to kill", request.frame_id)
returnrqd.compiled_proto.rqd_pb2.RqdStaticKillRunningFrameResponse()
which simply finds the frame pid and calls killpg to eliminate the process with SIGKILL (exit_signal=9): rqnetwork.py
The RQD main process monitors all running frames from rqcore.py at runLinux() when the process is forked with Popen, it waits for its completion, and when it updates then frame cache with the final exitStatus and exitSignal
returncode=frameInfo.forkedCommand.wait()
# Find exitStatus and exitSignalifreturncode<0:
# Exited with a signalframeInfo.exitStatus=1frameInfo.exitSignal=-returncodeelse:
frameInfo.exitStatus=returncodeframeInfo.exitSignal=0
The proposed solution is that RunningFrame's kill function (rqnetwork) not only kills the proc but also updates RunningFrame's exit status with 33 on rqcore's cache (FrameAttendantThread.frameInfo). With this, we can make it so that rqcore.pyrunLinux() doesn't replace the exitStatus by returncode, which is always 1 for killed procs, as stated in the snippet above.
To make this possible we need to change rqd.proto to add a new field to RqdStaticKillRunningFrameRequest, which should be an enum with KillCause, currently defined at HostReportHandler.java, that should be removed from the java code and transformed into a protobuf enum at rqd.proto to make sure both server and client have the same understanding of KillCause.
publicenumKillCause {
FrameOverboard("This frame is using way more than it had reserved."),
HostUnderOom("Frame killed by host under Oom pressure"),
FrameTimedOut("Frame timed out"),
FrameLluTimedOut("Frame LLU timed out"),
FrameVerificationFailure("Frame failed to be verified on the database");
privatefinalStringmessage;
privateKillCause(Stringmessage) {
this.message = message;
}
@OverridepublicStringtoString() {
returnmessage;
}
}
The text was updated successfully, but these errors were encountered:
Problem
In the solution implemented on #1321 is impossible to ensure frames killed by one of the memory-related flows are marked to be retried without changing RQD. The way this was implemented on Cuebot relies on updating the frame's
int_exit_status
in the database withDispatcher.EXIT_STATUS_MEMORY_FAILURE
and expecting no other logic will update the same frame until it finally gets killed and a FrameCompleteReport is processed by Cuebot.This logic is currently implemented on
dispatchSupport.updateFrameMemoryError
, called by DispatchRqdKillFrameMemory.javaAnd after a FrameCompleteReport is processed the same db field is captured by
FrameCompleteHandler.java
onhandlePostFrameCompleteOperations
whereframeDetail.exitStatus
is queried from the database:The problem with this logic is that a lot can happen between the
killRequest
being created and theframeCompleteReport
being finally processed after RQD receives the request, kills the frame and queues aFrameCompleteReport
to be sent back.Proposed Solution
The proposed solution is changing RQD so it starts sending 33 as
existStatus
when a frame is killed due to memory issues. Reading Cuebot's code, it looks like it already supports this feature on the same code snipped mentioned abovereport.getExitStatus() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE
. But RQD at any point sets theexitStatus
as 33.The way the current kill logic works on RQD, the
killRunningFrame
function onrqdservicers.py
gets the frame onrqCode
and calls kill on the frame:which simply finds the frame pid and calls
killpg
to eliminate the process withSIGKILL
(exit_signal=9
):rqnetwork.py
The RQD main process monitors all running frames from
rqcore.py
atrunLinux()
when the process is forked withPopen
, it waits for its completion, and when it updates then frame cache with the finalexitStatus
andexitSignal
The proposed solution is that
RunningFrame
's kill function (rqnetwork
) not only kills theproc
but also updatesRunningFrame
's exit status with 33 onrqcore
's cache (FrameAttendantThread.frameInfo
). With this, we can make it so thatrqcore.py
runLinux()
doesn't replace theexitStatus
byreturncode
, which is always 1 for killed procs, as stated in the snippet above.To make this possible we need to change
rqd.proto
to add a new field toRqdStaticKillRunningFrameRequest
, which should be anenum
withKillCause
, currently defined atHostReportHandler.java
, that should be removed from the java code and transformed into a protobuf enum atrqd.proto
to make sure both server and client have the same understanding ofKillCause
.The text was updated successfully, but these errors were encountered: