Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent http response between presto-native and presto-spark during error #19609

Open
vermapratyush opened this issue May 10, 2023 · 12 comments
Labels
prestissimo Presto Native Execution

Comments

@vermapratyush
Copy link
Member

vermapratyush commented May 10, 2023

The interaction between presto-native cpp server and presto-spark during unhappy flows is not consistent.

In the API TaskResource#createOrUpdateTaskImpl (

} catch (const velox::VeloxException& e) {
), when a VELOX exception is thrown response has a status code of 200 with the message body containing exception error message. When the API throws non-velox exception, a 5XX response is returned with the exception's error message (but no stacktrace).

In TaskResource::deleteTask (

} catch (const velox::VeloxException& e) {
), Velox and non-velox exception, both are handled the same way and 500 response is sent along with error message (yet no stacktrace).

There are few problems with this approach:

  1. Clients needs to handle both the scenarios differently adding to complexity. This affects how retries are to be handled.
  2. The lack of stacktrace in the error message makes it difficult to debug issues.

Some proposal to the problem

  1. Make the HTTP APIs RESTful, this would mean leverage the HTTP response appropriately.
    1. Any internal server error should have 5XX status code, this includes velox and non-velox error
    2. Any user error is 4XX, example: VELOX_USER_* are 4xx.
  2. 4XX and 5XX are allowed to have http response body, continue to pass appropriate error message to the client for error categorisation.
  3. Use boost::exception to throw exceptions. boost::exception provides a much structured format for exception and includes stacktrace. VELOX_CHECK and VELOX_USER_CHECK already captures stack trace, extend the functionality and surface it as PRESTO_* macros with mirroring funcitonality.

If we go ahead with proposal 1 and 2, would we need to modify how presto coordinator handles http status code?

@vermapratyush
Copy link
Member Author

cc @mbasmanova

@mbasmanova
Copy link
Contributor

CC: @spershin @amitkdutta @tanjialiang @xiaoxmeng

@mbasmanova
Copy link
Contributor

Proposals 1 and 2 make sense to me. Let's make these changes unless there are objections.

Use boost::exception to throw exceptions. boost::exception provides a much structured format for exception and includes stacktrace.

This one I'm not sure about. In Velox, we use VELOX_CHECK and VELOX_USER_CHECK macros to throw exceptions. These generate stacktraces as long as that functionality is enabled via gflags.

// Used in common/base/VeloxException.cpp
DEFINE_bool(
    velox_exception_user_stacktrace_enabled,
    false,
    "Enable the stacktrace for user type of VeloxException");

DEFINE_bool(
    velox_exception_system_stacktrace_enabled,
    true,
    "Enable the stacktrace for system type of VeloxException");

DEFINE_int32(
    velox_exception_user_stacktrace_rate_limit_ms,
    0, // effectively turns off rate-limiting
    "Min time interval in milliseconds between stack traces captured in"
    " user type of VeloxException; off when set to 0 (the default)");

DEFINE_int32(
    velox_exception_system_stacktrace_rate_limit_ms,
    0, // effectively turns off rate-limiting
    "Min time interval in milliseconds between stack traces captured in"
    " system type of VeloxException; off when set to 0 (the default)");

@mbasmanova mbasmanova added the prestissimo Presto Native Execution label May 10, 2023
@mbasmanova
Copy link
Contributor

CC: @miaoever

@mbasmanova
Copy link
Contributor

CC: @majetideepak

@vermapratyush
Copy link
Member Author

I see, VELOX_CHECK and VELOX_USER_CHECK indeed captures the stacktrace. In that case, we could extend and provide PRESTO_* macros which mirror their respective VELOX_* functionality.

@vermapratyush
Copy link
Member Author

vermapratyush commented May 10, 2023

If we go ahead with proposal 1 and 2, would we need to modify how presto coordinator handles http status code?

@amitkdutta
Copy link
Contributor

@vermapratyush Thanks for creating the issue. This is a great finding. Looking at

} catch (const velox::VeloxException& e) {
// Creating an empty task, putting errors inside so that next status
// fetch from coordinator will catch the error and well categorize it.
taskInfo = taskManager_.createOrUpdateErrorTask(
taskId, std::current_exception());
} catch (const std::exception& e) {
http::sendErrorResponse(downstream, e.what());
return;

It appears the worker is relying on coordinator to classify the error. It creates an empty task and with exception details.

I am thinking why we don't use same thing for non velox exception as well. Basically create an empty tasks and put message for following sections. Then co-ordinator needs no change.

        } catch (const std::exception& e) {
          http::sendErrorResponse(downstream, e.what());
          return;
        }

@vermapratyush
Copy link
Member Author

vermapratyush commented May 10, 2023

@amitkdutta

I am thinking why we don't use same thing for non velox exception as well. Basically create an empty tasks and put message for following sections. Then co-ordinator needs no change.

The assumption here is that coordinator does not rely on http status code to make decisions and we are ok with changing http status code from 5xx to 2xx along with change to payload structure (only for current 5xx response).

Also, if we go this route, what do we do for other APIs like TaskResource::deleteTask. They will still be returning 5XX with a non-json response body, which would be different from TaskResource::createOrUpdateTask

@amitkdutta
Copy link
Contributor

@vermapratyush Makes sense. I think our best option is to read the coordinator side code and decide based on that. Otherwise we end with random items :(. CC: @pranjalssh @rschlussel @arunthirupathi for design help here.

@rschlussel
Copy link
Contributor

A quick look at https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/server/remotetask/HttpRemoteTask.java makes me think that we basically only have a failure response when there was some connection issue and execution failures will just show that info in the TaskStatus itself (maybe TaskState and ExecutionFailureInfo fields) https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/TaskStatus.java. But I would defer to @ajaygeorge who knows much more about the worker <-> coordinator communication.

@ajaygeorge
Copy link
Contributor

ajaygeorge commented May 12, 2023

@vermapratyush
Answering your offline question below .

How is the response body of a 5XX response processed in coordinator code?

Failure handling is different for the JSON path and thrift path.
for json we have the SimpleHttpResponseHandler and for thrift we have a ThriftHttpResponseHandler
assuming prestissimo is still on JSON, see how it is handled in SimpleHttpResponseHandler

if (response.getStatusCode() == OK.code() && response.hasValue()) {
callback.success(response.getValue());
}
else if (response.getStatusCode() == HttpStatus.SERVICE_UNAVAILABLE.code()) {
callback.failed(new ServiceUnavailableException(uri));
}
else {
// Something is broken in the server or the client, so fail immediately (includes 500 errors)
Exception cause = response.getException();
if (cause == null) {
if (response.getStatusCode() == OK.code()) {
cause = new PrestoException(errorCode, format("Expected response from %s is empty", uri));
}
else {
cause = new PrestoException(errorCode, createErrorMessage(response));
}
}
else {
cause = new PrestoException(errorCode, format("Unexpected response from %s", uri), cause);
}
callback.fatal(cause);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prestissimo Presto Native Execution
Projects
None yet
Development

No branches or pull requests

5 participants