-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for logging #1555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orenbm added two small comments
design/logging.md
Outdated
|
||
### Backwards compatibility | ||
|
||
It's important to say that the UX will not change by default. We will do this by changing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this doesn't change for new setups, it will change for users that have a specific value set for CONJUR_LOG_LEVEL
. I think it's worth mentioning that when not using the default value, there will be an expected behavior change.
design/logging.md
Outdated
let's address each component to decide on the action items: | ||
- `evoke`: this component doesn't write to the log in INFO level so it won't be affected by the change. | ||
- `conjur-ui`: this components writes messages in DEBUG, INFO and WARN and should be addressed. | ||
- `ldap-sync`: this components writes messages in DEBUG, INFO and WARN and should be addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syslog-ng logs as well
design/logging.md
Outdated
1. DEBUG - messages that can help Conjur developers investigate an issue. These messages will include | ||
stack traces and DB messages. | ||
2. INFO - messages that can help Conjur users and operators investigate an issue. These messages will include | ||
the messages that we print in our logs (messages with the `CONJ` prefix). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that messages with prefix 'CONJ' with suffix 'E' need to be WARN (CONJ00007E for example). If i take your example:
WARN 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007E 'not_in_conjur' not found>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree generally. However, this particular line has a long history a needs approval from our security architects before we can this can show up in the Conjur logs by default (which is currently WARN
level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current default log level is info
. I think that this message should be written in INFO
(and the code should change accordingly). I address this issue in the doc.
f7a84e7
to
2dcd33d
Compare
design/logging.md
Outdated
- Conjur | ||
- DAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvanderhoof @alexkalish do we publish a version for changes like this or are they accumulated to the next release? same question for Conjur and DAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really depends on the need. If we want to get a change into the hands of a customer, we'll cut a release.
design/logging.md
Outdated
### Backwards compatibility | ||
|
||
As you can see, the UX has changed a bit, as the log will contain more data by default. | ||
The default log level is INFO, so instead of getting this log for the use-case described above: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend we change the default log level to WARN
. INFO
feels much too verbose to consider running in production (what I'd expect the default level to be acceptable for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jvanderhoof . I was going backwards and forward on this one and it's the biggest question of this design. Let's get some thoughts from @alexkalish, @InbalZilberman and @rafis3 on this and decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that WARN
is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.
Code Climate has analyzed commit 2dcd33d2 and detected 0 issues on this pull request. The test coverage on the diff in this pull request is 100.0% (50% is the threshold). This pull request will bring the total coverage in the repository to 85.4% (0.0% change). View more on Code Climate. |
|
||
We will not introduce new tests in this effort and the regression tests will be enough. | ||
|
||
## Documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add all messages to the DAP Wiki? This source will be used by the field and support (as well as our internal engineers 😄).
@orenbm, a big thanks for taking the lead on this. I really think it's going to impact our team and customers in a really positive way! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initiative. Super important IMO
not the application. The issue should be resolved soon, but the application can log the exception and continue running. | ||
|
||
We will go over our log messages and change their log level according to the above. | ||
DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why we are coupling the log code with the log level? Would appreciate a reference (if there's such)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about decoupling the log level from the message name? With our current approach, changing a message level requires we change the name message level (CONJ00012I
-> CONJ00012W
if we want to change a message from "Info" to "Warn". The log level is present in the error message output (INFO
, DEBUG
, etc.). Can we change the codes to remove the log level (ex. CONJ<some-number>
instead of CONJ<some-number>D
for a "Debug" level message)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can still change that to not have the codes as it is better. We can go over our docs are remove the suffix from all the instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orenbm: I think that our codes match the other PAS products, right? I would like to drop them too, but are we breaking a cross-product requirement? Could @InbalZilberman perhaps ask about that?
design/logging.md
Outdated
as they will be written to the log in `INFO` level (today they are written in `DEBUG`). | ||
The reason they should be logged in `INFO` and not in `ERROR` | ||
is that an `ERROR` log indicates `errors that cause an operation to fail (e.g., | ||
missing data or files) but not the application`. In some of our error messages, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical but note that datadog/rails have also fatal which we don't use so I guess our Errors can also be Fatal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we don't have any log messages that crash our server.
design/logging.md
Outdated
### Affected Components | ||
|
||
- Conjur | ||
- DAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you include the UI in this effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated now.
With a few minor suggestions, I really like the direction of this proposal. What I'm not clear on is the process for identifying the messages that should be re-classified into each level (debug, info, warn, error). It feels like we'll need to establish a common level across all of Conjur. This is going to be a substantial effort (but will be very valuable for troubleshooting). Given your familiarity with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, Oren. From a PO perspective, I think it's fine to move forward. Just please be sure to address the concerns/comments of other reviewers.
not the application. The issue should be resolved soon, but the application can log the exception and continue running. | ||
|
||
We will go over our log messages and change their log level according to the above. | ||
DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.
design/logging.md
Outdated
### Backwards compatibility | ||
|
||
As you can see, the UX has changed a bit, as the log will contain more data by default. | ||
The default log level is INFO, so instead of getting this log for the use-case described above: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that WARN
is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.
`authn-oidc` request with a username that is not defined in Conjur. If the log level | ||
is set to INFO (as the default value is), then the log will look like this: | ||
``` | ||
INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I really dislike the default Rails request logging in production. Spreading a request across multiple lines is fine for development, but is not conducive to log aggregation and reporting. In the past, I've always configured lograge for production environments. @orenbm, @jvanderhoof: What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new approach - where the default log level is warn - we won't have these messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if we go to WARN
, we lose basic request logging? I missed that detail and I'm not sure if that makes sense. I would expect basic request logging at the default log level in an application like this. Maybe we should go back to INFO
. Thoughts, @jvanderhoof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can read about it more here.
to summarize, if we look at the use-case presented in the doc (unsuccessful request).
The current log is:
INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000
INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Processing by AuthenticateController#authenticate_oidc as */*
INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Completed 401 Unauthorized in 94ms
If we will have WARN
as default in the new behaviour, the log will be empty.
If we will have INFO
as default in the new behaviour, the log will be:
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */*
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '1'
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00007 Working with Identity Provider https://keycloak:8443/auth/realms/master
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008 Identity Provider discovery succeeded
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00009 Fetched Identity Provider keys from provider successfully
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00016 Rate limited cache updated successfully
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00021 Concurrency limited cache updated successfully
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '0'
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00017 Fetched Identity Provider keys from cache successfully
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00005 Token decoded successfully
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00004 Extracted username 'not_in_conjur' from ID token field 'preferred_username'
INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found>
INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms
So we need to choose between less than today to more than today. I actually agree with you that the UX will be better without those 4 lines because it adds data to the log that we can live without. They are also quite DEBUG
y. If the user wants more data they can move to INFO
and get the 4 lines above + our additional info.
I really don't think there's a use case where someone will want just the basic info that we log today without the additional data that we log with the CONJ
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgnn7 can you take a look at this thread? In this one i state the outcome of having INFO
as the default value and having WARN
as the default value.
Please say what you think is better, having the above in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that both options in this example (Warn = no logs, Info = too much logging) are not great experience for both the operator and the developer. No messages would mean no tracing of functionality without changing the log level and the other one is just too noisy. The best blend of the two would be taking the output of the INFO and pushing all mostly irrelevant messages to debug level so you would get a much better condensed idea of what happened without flooding the logs:
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */*
INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found>
INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms
Edit: We may want to snip out the middle two lines too if we really wanted to but knowing that an action was taking place should be logged by the server.
I know some of this seems like we are adding unnecessary info in the logs but they are critical from a support standpoint since they don't require changing the customer's infrastructure to narrow down problems when they happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny enough, this is the current state :)
I actually think that this level doesn't give value as it is somewhere in the middle. I write about it more here.
I would actually like to get some preferences from SEs as they work with customers. I'll send a link to this thread in Slack as i can't tag them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More info is better unless it creates other complications like filling up disk space or exposing stuff people shouldn't see. One advantage of docker logs is "--since ..." which lets you easily view recent log messages w/o parsing timestamps. Current default log levels are pretty useless. When you bump the log level to debug you get all the backtrace garbage to wade through. So I'm for whatever gets more useful info into the logs as a responsible default. Currently when debugging K8s 401s I turn set LOG_LEVEL to DEBUG, RAILS_ENV to production (not sure if that's required), deploy the Follower or app, then run "docker logs --since 1m | grep CONJ" and that generally shows what the issue is. In a perfect world I'd just need to do "docker logs --since 1m" w/o changing log levels or grepping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jodyhuntatx, that's very helpful!
@jvanderhoof @alexkalish @Tovli I changed the default log level to |
will be logged in INFO. | ||
|
||
We will also remove the suffix from the log codes as the log level is already written | ||
in the log message (e.g `INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008I Identity Provider discovery succeeded`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be ensured to retain that info in some way when shipped to syslog/DB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the log level is written also in Syslog.
|
||
Currently, the default log level for Conjur OSS and DAP is `INFO`. We will change | ||
it to `WARN` so the default level will not have all the log messages that are now | ||
added to the `INFO` level from `DEBUG`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm on board with upping the log level to warn. The reason for my opinion here is that when problems occur, customers don't want to modify their servers to enable debugging but want to only send us the logs which I think will not provide enough troubleshooting information when set to "warn" even with the changes suggested for authentication issues. This makes support issues harder to fix and we would not be able to fix many inconsistent issues since we would be doing things reactively rather than proactively for collecting the info.
From a customer's perspective, they also may want to see when a person logs in, retrieves a secret, etc and other informational info that is not a warning level message and should not be one. I am not a fan of log4j but I do like their breakdown of log levels and some other vantage points from this SO thread. In general though, we also have to consider that we have a security product and that our logging is much more valuable than logging of a random system service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that's true also today. Today, we want to get the logs in DEBUG level as without it it doesn't have enough value for us in debugging. So the current default level also doesn't help us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in answering here as it's been pretty busy on the community team.
INFO
should ideally provide enough data for figuring out most problems and recording most actions on a high-enough level for good auditing/debugging while also reducing noise. I totally agree that we don't get enough data right now for resolving issues from the default INFO level but I don't see how we would be making this better by increasing the default logging level threshold which will end up reducing this output. I do see the point about making auth errors more obvious but I don't believe that conjur should throw warnings for anything that's not a real warning (eg. low disk space, not enough available file descriptors, unset authenticators string, OOM, etc).
Closing this proposal as we didn't get to an agreement on the desired UX. However, I did create this PR that addresses one of the issues raised in this design and that will give us a lot of value. |
Connected to #1539
Add a design doc for improving the Conjur log. More info can be found in the connected issue and in the design itself.
It's best to view the doc here and add your comments here.