[rqd/cuebot] Improve RQD kill logic #1608

ramonfigueiredo · 2024-12-03T19:01:55Z

Problem

In the solution implemented on #1321 is impossible to ensure frames killed by one of the memory-related flows are marked to be retried without changing RQD. The way this was implemented on Cuebot relies on updating the frame's int_exit_status in the database with Dispatcher.EXIT_STATUS_MEMORY_FAILURE and expecting no other logic will update the same frame until it finally gets killed and a FrameCompleteReport is processed by Cuebot.

This logic is currently implemented on dispatchSupport.updateFrameMemoryError, called by DispatchRqdKillFrameMemory.java

 @Override
    public void run() {
        long startTime = System.currentTimeMillis();
        try {
            if (dispatchSupport.updateFrameMemoryError(frame) && !isTestMode) {
                rqdClient.killFrame(hostname, frame.getFrameId(), message);
            }
        } catch (RqdClientException e) {
            [logger.info](http://logger.info/)("Failed to contact host " + hostname + ", " + e);
        } finally {
            long elapsedTime =  System.currentTimeMillis() - startTime;
            [logger.info](http://logger.info/)("RQD communication with " + hostname +
                    " took " + elapsedTime + "ms");
        }
    }

And after a FrameCompleteReport is processed the same db field is captured by FrameCompleteHandler.java on handlePostFrameCompleteOperations where frameDetail.exitStatus is queried from the database:

/*
* An exit status of 33 indicates that the frame was killed by the
* application due to a memory issue and should be retried. In this
* case, disable the optimizer and raise the memory by what is
* specified in the show's service override, service or 2GB.
*/
if (report.getExitStatus() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE
    || report.getExitSignal() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE
    || frameDetail.exitStatus == Dispatcher.EXIT_STATUS_MEMORY_FAILURE) {
                ...
}

The problem with this logic is that a lot can happen between the killRequest being created and the frameCompleteReport being finally processed after RQD receives the request, kills the frame and queues a FrameCompleteReport to be sent back.

Proposed Solution

The proposed solution is changing RQD so it starts sending 33 as existStatus when a frame is killed due to memory issues. Reading Cuebot's code, it looks like it already supports this feature on the same code snipped mentioned above report.getExitStatus() == Dispatcher.EXIT_STATUS_MEMORY_FAILURE. But RQD at any point sets the exitStatus as 33.

The way the current kill logic works on RQD, the killRunningFrame function on rqdservicers.py gets the frame on rqCode and calls kill on the frame:

   def KillRunningFrame(self, request, context):
        """RPC call that kills the running frame with the given id"""
        log.warning("Request received: killRunningFrame")
        frame = self.rqCore.getRunningFrame(request.frame_id)
        if frame:
            frame.kill(message=request.message)
        else:
            log.warning("Wasn't able to find frame(%s) to kill", request.frame_id)
        return rqd.compiled_proto.rqd_pb2.RqdStaticKillRunningFrameResponse()

which simply finds the frame pid and calls killpg to eliminate the process with SIGKILL (exit_signal=9):
rqnetwork.py

def kill(self, message=""):
    """Kills the frame"""
    [log.info](http://log.info/)("Request received: kill")
    ...
    try:
        if platform.system() == "Windows":
            subprocess.Popen('taskkill /F /T /PID %i' % self.pid, shell=True)
        else:
            os.killpg(self.pid, rqd.rqconstants.KILL_SIGNAL)
    finally:
        log.warning("kill() successfully killed frameId=%s pid=%s", self.frameId, self.pid)
        rqd.rqutil.permissionsLow()
    ...

The RQD main process monitors all running frames from rqcore.py at runLinux() when the process is forked with Popen, it waits for its completion, and when it updates then frame cache with the final exitStatus and exitSignal

returncode = frameInfo.forkedCommand.wait()

# Find exitStatus and exitSignal
if returncode < 0:
    # Exited with a signal
    frameInfo.exitStatus = 1
    frameInfo.exitSignal = -returncode
else:
    frameInfo.exitStatus = returncode
    frameInfo.exitSignal = 0

The proposed solution is that RunningFrame's kill function (rqnetwork) not only kills the proc but also updates RunningFrame's exit status with 33 on rqcore's cache (FrameAttendantThread.frameInfo). With this, we can make it so that rqcore.py runLinux() doesn't replace the exitStatus by returncode, which is always 1 for killed procs, as stated in the snippet above.

To make this possible we need to change rqd.proto to add a new field to RqdStaticKillRunningFrameRequest, which should be an enum with KillCause, currently defined at HostReportHandler.java, that should be removed from the java code and transformed into a protobuf enum at rqd.proto to make sure both server and client have the same understanding of KillCause.

public enum KillCause {
        FrameOverboard("This frame is using way more than it had reserved."),
        HostUnderOom("Frame killed by host under Oom pressure"),
        FrameTimedOut("Frame timed out"),
        FrameLluTimedOut("Frame LLU timed out"),
        FrameVerificationFailure("Frame failed to be verified on the database");
        private final String message;

        private KillCause(String message) {
            this.message = message;
        }
        @Override
        public String toString() {
            return message;
        }
    }

The text was updated successfully, but these errors were encountered:

ramonfigueiredo · 2024-12-03T19:02:12Z

FYI

@DiegoTavares

ramonfigueiredo added the enhancement Improvement to an existing feature label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rqd/cuebot] Improve RQD kill logic #1608

[rqd/cuebot] Improve RQD kill logic #1608

ramonfigueiredo commented Dec 3, 2024

ramonfigueiredo commented Dec 3, 2024

[rqd/cuebot] Improve RQD kill logic #1608

[rqd/cuebot] Improve RQD kill logic #1608

Comments

ramonfigueiredo commented Dec 3, 2024

Problem

Proposed Solution

ramonfigueiredo commented Dec 3, 2024