Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve server error handling and logging. #189

Closed
quinnj opened this issue Feb 4, 2018 · 17 comments
Closed

Improve server error handling and logging. #189

quinnj opened this issue Feb 4, 2018 · 17 comments

Comments

@quinnj
Copy link
Member

quinnj commented Feb 4, 2018

Here's the logs from recent FemtoCleaner.jl:

https://gist.github.com/Keno/e3f911d71efbbd41a22446aa78268ed2

I see several issues that we should address:

  • a lot of the warnings like W- ArgumentError("stream is closed or unusable") are completely out of context; it's hard to follow "thread of execution" for serving a single request; ideally we could do better w/ logging to tie messages for a specific request together
  • we're still seeing E- ArgumentError("function requires !(isempty(bytes))"): see BoundsError when closing MbedTLS.SSLContext #148
  • we should specifically catch | e = read: connection reset by peer (ECONNRESET) errors and just emit a warning; just because the client closed their socket isn't really cause for alarm on the server side
  • I've never seen E- SystemError("uv_getaddrinfocb", 0, nothing), but we should try to figure out what's happening here
  • FemtoCleaner.jl stops serving requests all together by the end of the gist, which is obviously unacceptable See Analysis of EOF Errors with GitHub API (Seems Resolved) #214

In general, I think things are just too obfuscated in our logs and w/ all the try-catch-rethrowing we're doing.

If anyone can jump in right away and help clean things up, that'd be great; I won't have time until Monday night to really dive in.

@samoconnor
Copy link
Contributor

WRT "function requires !(isempty(bytes))" I don't think that assertion exists in the latest version.

@samoconnor
Copy link
Contributor

@quinnj,
I agree that the mixing of logs from different requests is an issue. It might be best to reinstate the scheme you had where server logs could be separated per server instance. Or it might be better to have a log stream per request, that can be dumped to a master log in one atomic block when the request completes (or throws an exception). I'm not sure if this is something the new logging infrastructure will help with or not. See: JuliaLang/julia#25109 (comment)

I think we should treat things like SystemError("uv_getaddrinfocb", ArgumentError("stream is closed or unusable"), read: connection reset by peer (ECONNRESET) as upstream bugs and get them fixed in Base. A while back I made this change: https://github.com/JuliaLang/julia/pull/15879 to replace a specific type of UVErrorwithDNSError. I think more of this is needed to hide the underlying UVErrors` and emit more descriptive error types.

SystemError("uv_getaddrinfocb" comes from here and here. The spec says: "EAI_SYSTEM Other system error, check errno for details.". We would need to dig deeper to find out what happened in this case. I'm not sure if Base has access to the underlying errno.

ArgumentError("stream is closed or unusable"): it don't seem right to have a precondition error for a condition that the caller can't verify before the call. i.e. Even if I check isopen before calling the offending IO function, the stream could asynchronously become "closed or unusable" before the precondition is checked. This probably should be EofError or IOClosedError or similar.

Likewise the UVError:read: connection reset by peer (ECONNRESET) should be wrapped in something more specific. It would be nice if all the reasons that an IO connection can disappear from underneath you resulted in a something like IOClosedError, or even just IOError, with a code field for more info.

just because the client closed their socket isn't really cause for alarm on the server side

Agreed, but we need to be careful to only downgrade error importance in the case where the client explicitly closed the connection. If connections on my server are silently failing sometimes, I need to know how often that is happening so I can investigate to see if there is a fixable root cause. e.g. perhaps we should keep a count and/or a seperate log of the times and types of IO errors.

@samoconnor
Copy link
Contributor

See JuliaLang/julia#25109 (comment), it looks like the new logging code will help us here.

@samoconnor
Copy link
Contributor

@quinnj, you may have noticed this little debug web page that pops up and displays a live view of the connection pool state while you run test/async.jl:

HTTP.jl/test/async.jl

Lines 14 to 31 in b7b8e0f

@async HTTP.listen() do http
startwrite(http)
write(http, """
<html><head>
<title>HTTP.jl Connection Pool</title>
<meta http-equiv="refresh" content="1">
<style type="text/css" media="screen">
td { padding-left: 5px; padding-right: 5px }
</style>
</head>
<body><pre>
""")
write(http, "<body><pre>")
buf = IOBuffer()
HTTP.ConnectionPool.showpoolhtml(buf)
write(http, take!(buf))
write(http, "</pre></body>")
end

Now, this is a nasty hack using a 1 seconds brute-force refresh.
But, perhaps for server side stuff, we could have a simple server-console web page that shows a list of all of the instances of listen and a log of all of the requests. Clicking on a request could reveal the fine-grained logs for that request. We could use use SSE or WebSockets to stream log messages to the web page. This might help with the problem of "all the logs on one text console" by separating logs per listen, or per connection, or per request.

I'm not thinking of a full-blown logging system with persistent storage, hooks into CloudWatchLogs etc, just an in-memory thing that is only active if the browser is connected. Like a GUI-mode verbose option.

@samoconnor samoconnor changed the title Server problems Improve server error handling and logging. Mar 7, 2018
@essenciary
Copy link

essenciary commented Jun 20, 2018

I see there are a few issues regarding logging. The first problem I encountered is that logging can't be turned off by passing devnull as the IO:

web_server = HTTP.Servers.Server((req) -> begin
    setup_http_handler(req, req.response)
  end, devnull)

Because of this, there's a lot of spam, making it hard to follow my own logging -- I get a dump for each css and js file and image referenced in a web page!

The best thing would be to be able to choose between logging all requests, only dynamic requests (so no static files, these are a lot and usually not interesting), and no logging. An option would be to have a flag in the Response object to set the logging level per Response?

@essenciary
Copy link

Also, we need to be able to pass a custom logger.
When an app is running in production, on a server, the output should go both to the STDOUT and a number of loggers (text file, logging API, etc) so it can be later reviewed (especially errors).

@matbesancon
Copy link

Regarding the errors, it could be interesting to handle them in a more explicit way (ie any pottentially failing function returning a Result from ResultTypes.jl or equivalent. It's hard to know where to catch

@c42f
Copy link
Contributor

c42f commented Oct 16, 2018

I'm not sure where you guys got to with this, but to comment on the logging side of things (and without knowing much about HTTP.jl at all):

  • Tracking a particular connection or transaction is easy, you just need to tag the log record with a connection_id and transaction_id of some kind. Adding key=value pairs like this in the new logging system is completely freeform in the key names and value types. I'd suggest a uuid or some such thing to uniquely identify the connection and an incrementing integer for the transaction number (or just the sequence number? my knowledge of these things is limited), but whatever makes most sense to you. So something like:
connection = Connection(#=...=#)
connection.id = uuid4() # Or whatever
# ...

# Suppose we send a response
@info "HTTP Respose" content=message connection_id=connection.id sequence=transaction.sequence

# Suppose the connection times out.  Here's a warning for that:
@warn "Connection timeout"  connection_id=connection.id

Then you can capture the log records, filtering and grouping them by connection_id. It would be really easy to write a Logger to dump these records to disk in binary format, and later read them back into a DataFrame or as a DataSource. I think the JuliaData tools would be great for structured log analysis.

  • @essenciary I think I saw you were using Memento inside Genie. Memento was designed before 0.7 so I think there's some impedance mismatch with the new logging frontend interface. However you should be able to use Memento.substitue!() to make all HTTP.jl logs go to your Memento backend of choice.

For the record, I think that HTTP is an ideal library to motivate the next iteration of stdlib/Logging which needs to focus on a more capable set of backend logging tools. stdlib/Logging can't be developed without motivation from people with "serious" logging needs like HTTP and Genie. See JuliaLang/julia#29567 for related discussion.

I'm not sure where prototyping new features should go but MicroLogging works for me and I'm super happy to add collaborators.

@c42f
Copy link
Contributor

c42f commented Oct 16, 2018

Oh and regarding logging-then-throwing vs throwing-then-logging vs doing both — my take on this FWIW (from JuliaLang/julia#19979 (comment)):

In general I think the pattern of logging an exception and rethrowing it isn't great: Somewhere at an outer scope the exception will also be caught and reported in some way (probably via logging but without rethrowing). Logging and rethrowing just leads to duplicate error information due to several catch/log/rethrows down the stack. This is very confusing in log files. I'd say it's much better to preserve the information and make it available as a full stack of exceptions at outer scope instead.

Once JuliaLang/julia#28878 is (presumably) merged this will be easy.

@samoconnor
Copy link
Contributor

(Note that in HTTP.jl, Servers.jl is intended to be a low-level minimal HTTP server implementation. Whereas Handlers.jl is intended to add middleware/request routing features.)
Here's my take on error handling in Servers.jl:

  • We should not log errors at all.
  • We should catch errors thrown by the user's response processing function only so that we can close the connection and clean up. We should then transparently rethrow.
  • We should not send error message detail to the client. We should just send a generic 500 Internal Server Error message. Sending stack traces to the client could leak all kinds of sensitive information.
  • The only errors we should catch and not rethrow are IO errors where we are sure of the cause and where we have a sensible strategy for retrying, recovering, cleaning up etc.

As far as logging goes, I think that should be done at a higher layer than Servers.jl e.g.:

  • The Hanlders.jl framework could have a logging module that sits at the top of the routing stack and logs new connections, new transactions, errors in handlers, closed connections etc etc.
  • Such a module could have an option (or a variant) for sending stack traces to the client (this is sometimes useful when debugging internal systems).
  • It would then be easy to swap and plug logging modules for different purposes (Julia's logging system, syslog, CloudWatch logs, elastic search, a browser/websockets console etc etc).
  • It would also be straightforward to use no logging layer at all in cases where high throughput and low memory use are priorities.

There are probably a handful of log-worthy events that can occur in the Servers.jl layer before control is handed to the Hanlders.jl framework. It is probably good to have a way to pass these to the Handlers.jl framework so that they can be logged along with the other logs. e.g.:

  • If the client disconnects before sending headers, currently the user's response processing function is not called, so the Handlers framework would have no chance of logging this event.
  • In this case we could call the response processing function with a HTTP.Stream object that throws a connection reset error as soon as a read is attempted.

@c42f
Copy link
Contributor

c42f commented Oct 16, 2018

Sending stack traces to the client could leak all kinds of sensitive information

Right!

It would then be easy to swap and plug logging modules for different purposes (Julia's logging system, syslog, CloudWatch logs, elastic search, a browser/websockets console etc etc).

Regarding swapping out loggers, disabling logging, etc, this is exactly what the julia logging system is designed for, in the sense that you can implement and install your own AbstractLogger types (one for syslog, one for CloudWatch, etc). If you feel the need to use something other than the Base logging frontend (ie, the existing logging macros) then either (a) I've failed to communicate how it's meant to work or (b) the design has some critical failure ;-) Either way, we should chat if it comes to that!

In terms of disabling logging, just installing the NullLogger on the server task will do that for you.

@samoconnor
Copy link
Contributor

Regarding swapping out loggers, disabling logging, etc, this is exactly what the julia logging system is designed for, in the sense that you can implement and install your own AbstractLogger types (one for syslog, one for CloudWatch, etc). If you feel the need to use something other than the Base logging frontend (ie, the existing logging macros) then either (a) I've failed to communicate how it's meant to work or (b) the design has some critical failure ;-)

I must admit that I haven't had time to get into the details of the new logging system.

If we only need to have one logging module in the Hanlders layer (because all the different backends can be handled by the AbstractLogger interface) that's great!

In terms of disabling logging, just installing the NullLogger on the server task will do that for you.

To what degree does this disable logging?

Are we talking macros that evaluate to empty expression :() so the compiler can do inlining/optimisation etc as if the logging was commented out? Or is there still an if statement or a function call at each logging site?

For some things the distinction isn't important, but for others is.

On a related note: are there any practical limitations on the log event rate or overall log data size?
e.g. If I have a machine with 32G of ram, and I want to run with detailed logging turned on for a minute or two producing 20G of log data and 100s of millions of log events, is there any reason this won't work (assuming a suitable RAM-backed AbstractLogger backend).

@c42f
Copy link
Contributor

c42f commented Oct 16, 2018

To what degree does this disable logging?

The intention is that the cost of a disabled log statement would be a couple of loads, integer compare and branch (when disabled based on log level only). Thus it should be extremely cheap but still a nonzero cost. There are several layers of early testing in order to avoid doing any work if possible. For the full detail of the current early filtering system, see the logging documentation: https://docs.julialang.org/en/v1/stdlib/Logging/index.html#Early-filtering-and-message-handling-1

If you want verbose tracing which is compiled out completely in production, this implies compiling a version of each function with tracing on and with tracing off. There's no real way around this fact and if you do the code elision at the macro level it plays complete havoc with precompilation. Lucikly there's a trick which can be used to dynamically recompile such functions using the julia JIT if you truly need it. I'd be interested to see a use case which truly required this.

are there any practical limitations on the log event rate or overall log data size

This is completely up to the AbstractLogger backend which you can implement yourself. If it caches the events to RAM, you just need lots of RAM. If it pumps them out to disk you just need a nice SSD array, etc.

@samoconnor
Copy link
Contributor

a couple of loads, integer compare and branch

That's the runtime cost, but I was thinking more about the cost in terms of loosing potential optimisations. Consider a loop for x in itr... where x is a SubString and the body of the loop has a log message that includes that string. If the log macro results in :() then the compiler might see that no references to the nominal SubString object ever escape the body of the loop and the SubString object can be completely optimised away. However, I would worry that if the compiler can see that the SubString might sometimes be passed into some logging function that does who knows what with it, then it will have to actually heap-allocate the object.

If you want verbose tracing which is compiled out completely in production, this implies compiling a version of each function with tracing on and with tracing off. There's no real way around this fact and if you do the code elision at the macro level it plays complete havoc with precompilation.

Right.

I understand that this kind of logging in optimisation-critical loops is not the core focus of the logging system. In HTTP.jl we have a macro for fine-grained performance-killing debug logging (which implies recompiling the whole package):

HTTP.jl/src/debug.jl

Lines 5 to 8 in e0fe578

macro debug(n::Int, s)
DEBUG_LEVEL >= n ? :(println(debug_header(), $(esc(s)))) :
:()
end

We could always call the logging system from inside our debug macro. But it might be nice for the logging system to have an optional 1st class way of statically removing logging. This could be as simple as a global const STATIC_LOG_LEVEL. Or it could be a macro that says "everything in this block should be compiled without logs below level x". Something along the lines of @inbounds?.

Lucikly there's a trick which can be used to dynamically recompile such functions ...

The pattern for this in HTTP.jl is that we have a nested stack type that includes different layers depending on the request kw args. e.g. setting verbose=3 causes a whole extra DebugLayer to be inserted into the stack (this layer logs every IO call and every byte on the wire).

HTTP.jl/src/HTTP.jl

Lines 592 to 593 in e0fe578

(verbose >= 3 ||
DEBUG_LEVEL >= 3 ? DebugLayer : NoLayer){

We end up with different versions of request being compiled for different combinations of options, and hopefully, if you select a simple set of options (retry=false, redirect=false, status_exception=false) you end up with a very simple stack that can be completely inlined.

HTTP.jl/src/HTTP.jl

Lines 388 to 399 in e0fe578

## Request Execution Stack
The Request Execution Stack is separated into composable layers.
Each layer is defined by a nested type `Layer{Next}` where the `Next`
parameter defines the next layer in the stack.
The `request` method for each layer takes a `Layer{Next}` type as
its first argument and dispatches the request to the next layer
using `request(Next, ...)`.
The example below defines three layers and three stacks each with
a different combination of layers.

This is completely up to the AbstractLogger backend which you can implement yourself. If it caches the events to RAM, you just need lots of RAM. If it pumps them out to disk you just need a nice SSD array, etc.

Ok, nice. It would be cool to have an example AbstractLogger like:

  • create a 20GB mmap shared with a helper process.
  • log to the mmap in some super-efficient way (write to an IOBuffer? just serialise the arguments to the log function?)
  • loop back to the start if the log fills up.
  • when the main process crashes or ends the helper process can take its time to pass the logs to some real logging backend.

@quinnj
Copy link
Member Author

quinnj commented Oct 17, 2018

create a 20GB mmap shared with a helper process.
log to the mmap in some super-efficient way (write to an IOBuffer? just serialise the arguments to the log function?)
loop back to the start if the log fills up.

Time to bring back FIFOBuffer? 😉

@c42f
Copy link
Contributor

c42f commented Oct 17, 2018

I understand that this kind of logging in optimisation-critical loops is not the core focus of the logging system. In HTTP.jl we have a macro for fine-grained performance-killing debug logging (which implies recompiling the whole package):

Here's a trick you can use at a module level to dynamically recompile the module and avoid having to edit the package source code:

module A

# A function returning a constant boolean which will be redefined, causing dependent functions to be recompiled
_trace_enabled() = false

function enable_trace_logging(enable::Bool=true)
    if enable != _trace_enabled()
        eval(:(_trace_enabled() = $enable))
    end
    nothing
end

function test()
    if _trace_enabled()
        @info "A message"
    end
end

end

Thence:

julia> @code_typed A.test()
CodeInfo(
14 1 ─     goto #3 if not false                                                                                                                                                                                 │
   2nothing::Nothing15 3return                                                                                                                                                                                               │
) => Nothing

julia> A.enable_trace_logging()
_trace_logging (generic function with 1 method)

julia> @code_typed A.test()
CodeInfo(
15 1 ── %1   = Base.CoreLogging.Info::Const(Info, false)                                                                                                                     │╻            macro expansion
   │    %2   = π ($(QuoteNode(Info)), Const(Info, false))                                                                                                                    ││           
   │    %3   = Base.CoreLogging._min_enabled_level::Const(RefValue{LogLevel}(Debug), false)                                                                                  ││           
   │    %4   = (Base.getfield)(%3, :x)::LogLevel                                                                                                                             ││╻            getindex
   │    %5   = (Base.getfield)(%4, :level)::Int32                                                                                                                            │││╻╷╷╷         <=%6   = (Base.slt_int)(%5, 0)::Bool                                                                                                                                   ││││╻            <
[ ... ] etc etc

obviously this trick could be wrapped up in a macro to make it more convenient.

@c42f
Copy link
Contributor

c42f commented Oct 17, 2018

it might be nice for the logging system to have an optional 1st class way of statically removing logging.

Agreed. I've toyed with the idea of introducing the above trick to Base logging somehow. I think there's a need for a verbose tracing system to be part of the Base logging system, but the details have yet to be sorted out. BTW, high performance logging is absolutely within the design remit of stdlib/Logging if possible. The big challenge is to simultaneously keep it simple enough that the average package will use it rather than println ;-)

The pattern for this in HTTP.jl is that we have a nested stack type that includes different layers depending on the request kw args. e.g. setting verbose=3 causes a whole extra DebugLayer to be inserted into the stack (this layer logs every IO call and every byte on the wire).

Very neat, this makes complete sense. This obviously requires a library which is dedicated to thinking about these kind of performance issues, and willing to explicitly add the extra context on which to dispatch (ie, either DebugLayer or NoLayer). The Base logging macros do insert more boilerplate than I'd like in their current form and this could definitely affect inlining and subsequent optimizations.

From a module composability point of view, Base logging has to pass logger contexts implicitly rather than explicitly. In particular, module A should be able to call functions in module B, and module B should be able to create log events without module A knowing about it. Furthermore, someone calling a function in module A should be able to intercept and collect the log events originating from B. I ended up doing this by attaching the logger context to the Task which implies a certain amount of dynamic dispatch when a non-filtered log event is created. I felt this was the only way to make the API simple enough that the average module author would feel like using it!

Ok, nice. It would be cool to have an example AbstractLogger like:

  • create a 20GB mmap shared with a helper process.
  • log to the mmap in some super-efficient way (write to an IOBuffer? just serialise the arguments to the log function?)
    ...

Yes, I'd suggest an AbstractLogger which does binary serialization of the event into a (circular?) buffer is the right thing here.

@quinnj quinnj closed this as completed Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants