Add design doc for logging #1555

orenbm · 2020-05-18T10:34:24Z

Connected to #1539

Add a design doc for improving the Conjur log. More info can be found in the connected issue and in the design itself.

It's best to view the doc here and add your comments here.

design/logging.md

rafis3

@orenbm added two small comments

rafis3 · 2020-05-18T15:12:44Z

design/logging.md

+
+### Backwards compatibility
+
+It's important to say that the UX will not change by default. We will do this by changing


While this doesn't change for new setups, it will change for users that have a specific value set for CONJUR_LOG_LEVEL. I think it's worth mentioning that when not using the default value, there will be an expected behavior change.

rafis3 · 2020-05-18T15:27:38Z

design/logging.md

+    let's address each component to decide on the action items:
+    - `evoke`: this component doesn't write to the log in INFO level so it won't be affected by the change.
+    - `conjur-ui`: this components writes messages in DEBUG, INFO and WARN and should be addressed.
+    - `ldap-sync`: this components writes messages in DEBUG, INFO and WARN and should be addressed.


Syslog-ng logs as well

eladkug · 2020-05-18T19:46:17Z

design/logging.md

+1. DEBUG - messages that can help Conjur developers investigate an issue. These messages will include
+           stack traces and DB messages.
+2. INFO - messages that can help Conjur users and operators investigate an issue. These messages will include
+          the messages that we print in our logs (messages with the `CONJ` prefix).


I think that messages with prefix 'CONJ' with suffix 'E' need to be WARN (CONJ00007E for example). If i take your example:
WARN 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007E 'not_in_conjur' not found>.

I agree generally. However, this particular line has a long history a needs approval from our security architects before we can this can show up in the Conjur logs by default (which is currently WARN level).

the current default log level is info. I think that this message should be written in INFO (and the code should change accordingly). I address this issue in the doc.

orenbm · 2020-05-19T12:24:49Z

design/logging.md

+- Conjur
+- DAP


@jvanderhoof @alexkalish do we publish a version for changes like this or are they accumulated to the next release? same question for Conjur and DAP.

It really depends on the need. If we want to get a change into the hands of a customer, we'll cut a release.

jvanderhoof · 2020-05-19T13:22:40Z

design/logging.md

+### Backwards compatibility
+
+As you can see, the UX has changed a bit, as the log will contain more data by default.
+The default log level is INFO, so instead of getting this log for the use-case described above:


I'd recommend we change the default log level to WARN. INFO feels much too verbose to consider running in production (what I'd expect the default level to be acceptable for).

thanks @jvanderhoof . I was going backwards and forward on this one and it's the biggest question of this design. Let's get some thoughts from @alexkalish, @InbalZilberman and @rafis3 on this and decide.

I think that WARN is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.

codeclimate · 2020-05-19T13:24:36Z

Code Climate has analyzed commit 2dcd33d2 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 85.4% (0.0% change).

View more on Code Climate.

jvanderhoof · 2020-05-19T13:25:13Z

design/logging.md

+
+We will not introduce new tests in this effort and the regression tests will be enough.
+
+## Documentation


Can we please also add all messages to the DAP Wiki? This source will be used by the field and support (as well as our internal engineers 😄).

jvanderhoof · 2020-05-19T13:27:13Z

@orenbm, a big thanks for taking the lead on this. I really think it's going to impact our team and customers in a really positive way!

Tovli

Great initiative. Super important IMO

Tovli · 2020-05-19T18:50:50Z

design/logging.md

+  not the application. The issue should be resolved soon, but the application can log the exception and continue running.
+
+We will go over our log messages and change their log level according to the above.
+DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`)


Can you elaborate why we are coupling the log code with the log level? Would appreciate a reference (if there's such)

What do you think about decoupling the log level from the message name? With our current approach, changing a message level requires we change the name message level (CONJ00012I -> CONJ00012W if we want to change a message from "Info" to "Warn". The log level is present in the error message output (INFO, DEBUG, etc.). Can we change the codes to remove the log level (ex. CONJ<some-number> instead of CONJ<some-number>D for a "Debug" level message)?

I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.

I think that we can still change that to not have the codes as it is better. We can go over our docs are remove the suffix from all the instances.

@orenbm: I think that our codes match the other PAS products, right? I would like to drop them too, but are we breaking a cross-product requirement? Could @InbalZilberman perhaps ask about that?

Tovli · 2020-05-19T18:52:12Z

design/logging.md

+as they will be written to the log in `INFO` level (today they are written in `DEBUG`). 
+The reason they should be logged in `INFO` and not in `ERROR`
+is that an `ERROR` log indicates `errors that cause an operation to fail (e.g., 
+missing data or files) but not the application`. In some of our error messages,


Not critical but note that datadog/rails have also fatal which we don't use so I guess our Errors can also be Fatal

yes, we don't have any log messages that crash our server.

Tovli · 2020-05-19T18:53:49Z

design/logging.md

+### Affected Components
+
+- Conjur
+- DAP


you include the UI in this effort?

yes, updated now.

jvanderhoof · 2020-05-21T22:13:01Z

With a few minor suggestions, I really like the direction of this proposal. What I'm not clear on is the process for identifying the messages that should be re-classified into each level (debug, info, warn, error). It feels like we'll need to establish a common level across all of Conjur. This is going to be a substantial effort (but will be very valuable for troubleshooting).

Given your familiarity with authn-azure, would it make sense to start in this authenticator? authn-K8s is another great option given the high number of support cases we deal with related to Kubernetes/OCP. Once we establish a common pattern, we can iteratively push through the rest of the Conjur workflows.

alexkalish

This looks great, Oren. From a PO perspective, I think it's fine to move forward. Just please be sure to address the concerns/comments of other reviewers.

alexkalish · 2020-05-22T14:34:40Z

design/logging.md

+  not the application. The issue should be resolved soon, but the application can log the exception and continue running.
+
+We will go over our log messages and change their log level according to the above.
+DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`)


I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.

alexkalish · 2020-05-22T14:36:47Z

design/logging.md

+### Backwards compatibility
+
+As you can see, the UX has changed a bit, as the log will contain more data by default.
+The default log level is INFO, so instead of getting this log for the use-case described above:


I think that WARN is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.

alexkalish · 2020-05-22T14:41:37Z

design/logging.md

+`authn-oidc` request with a username that is not defined in Conjur. If the log level
+is set to INFO (as the default value is), then the log will look like this:
+```
+ INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000


So, I really dislike the default Rails request logging in production. Spreading a request across multiple lines is fine for development, but is not conducive to log aggregation and reporting. In the past, I've always configured lograge for production environments. @orenbm, @jvanderhoof: What do you think?

In the new approach - where the default log level is warn - we won't have these messages.

So, if we go to WARN, we lose basic request logging? I missed that detail and I'm not sure if that makes sense. I would expect basic request logging at the default log level in an application like this. Maybe we should go back to INFO. Thoughts, @jvanderhoof?

Yes, you can read about it more here.

to summarize, if we look at the use-case presented in the doc (unsuccessful request).

The current log is:

INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000 INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Processing by AuthenticateController#authenticate_oidc as */* INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"} INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Completed 401 Unauthorized in 94ms

If we will have WARN as default in the new behaviour, the log will be empty.
If we will have INFO as default in the new behaviour, the log will be:

INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */* INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"} INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '1' INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00007 Working with Identity Provider https://keycloak:8443/auth/realms/master INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008 Identity Provider discovery succeeded INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00009 Fetched Identity Provider keys from provider successfully INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00016 Rate limited cache updated successfully INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00021 Concurrency limited cache updated successfully INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '0' INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00017 Fetched Identity Provider keys from cache successfully INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00005 Token decoded successfully INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00004 Extracted username 'not_in_conjur' from ID token field 'preferred_username' INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found> INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms

So we need to choose between less than today to more than today. I actually agree with you that the UX will be better without those 4 lines because it adds data to the log that we can live without. They are also quite DEBUGy. If the user wants more data they can move to INFO and get the 4 lines above + our additional info.

I really don't think there's a use case where someone will want just the basic info that we log today without the additional data that we log with the CONJ prefix.

@sgnn7 can you take a look at this thread? In this one i state the outcome of having INFO as the default value and having WARN as the default value.

Please say what you think is better, having the above in mind.

I think that both options in this example (Warn = no logs, Info = too much logging) are not great experience for both the operator and the developer. No messages would mean no tracing of functionality without changing the log level and the other one is just too noisy. The best blend of the two would be taking the output of the INFO and pushing all mostly irrelevant messages to debug level so you would get a much better condensed idea of what happened without flooding the logs:

INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */* INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"} INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found> INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms

Edit: We may want to snip out the middle two lines too if we really wanted to but knowing that an action was taking place should be logged by the server.

I know some of this seems like we are adding unnecessary info in the logs but they are critical from a support standpoint since they don't require changing the customer's infrastructure to narrow down problems when they happen.

Funny enough, this is the current state :)

I actually think that this level doesn't give value as it is somewhere in the middle. I write about it more here.

I would actually like to get some preferences from SEs as they work with customers. I'll send a link to this thread in Slack as i can't tag them.

More info is better unless it creates other complications like filling up disk space or exposing stuff people shouldn't see. One advantage of docker logs is "--since ..." which lets you easily view recent log messages w/o parsing timestamps. Current default log levels are pretty useless. When you bump the log level to debug you get all the backtrace garbage to wade through. So I'm for whatever gets more useful info into the logs as a responsible default. Currently when debugging K8s 401s I turn set LOG_LEVEL to DEBUG, RAILS_ENV to production (not sure if that's required), deploy the Follower or app, then run "docker logs --since 1m | grep CONJ" and that generally shows what the issue is. In a perfect world I'd just need to do "docker logs --since 1m" w/o changing log levels or grepping.

thanks @jodyhuntatx, that's very helpful!

orenbm · 2020-06-01T11:42:30Z

@jvanderhoof @alexkalish @Tovli I changed the default log level to WARN and added a section for removing the suffix from the CONJ codes. can you please re-review?

sgnn7 · 2020-06-01T15:12:13Z

design/logging.md

+will be logged in INFO.
+
+We will also remove the suffix from the log codes as the log level is already written 
+in the log message (e.g `INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008I Identity Provider discovery succeeded`).


This needs to be ensured to retain that info in some way when shipped to syslog/DB

Yes, the log level is written also in Syslog.

sgnn7 · 2020-06-01T15:23:50Z

design/logging.md

+
+Currently, the default log level for Conjur OSS and DAP is `INFO`. We will change
+it to `WARN` so the default level will not have all the log messages that are now
+added to the `INFO` level from `DEBUG`.


I'm not sure I'm on board with upping the log level to warn. The reason for my opinion here is that when problems occur, customers don't want to modify their servers to enable debugging but want to only send us the logs which I think will not provide enough troubleshooting information when set to "warn" even with the changes suggested for authentication issues. This makes support issues harder to fix and we would not be able to fix many inconsistent issues since we would be doing things reactively rather than proactively for collecting the info.

From a customer's perspective, they also may want to see when a person logs in, retrieves a secret, etc and other informational info that is not a warning level message and should not be one. I am not a fan of log4j but I do like their breakdown of log levels and some other vantage points from this SO thread. In general though, we also have to consider that we have a security product and that our logging is much more valuable than logging of a random system service.

but that's true also today. Today, we want to get the logs in DEBUG level as without it it doesn't have enough value for us in debugging. So the current default level also doesn't help us.

Sorry for the delay in answering here as it's been pretty busy on the community team.

INFO should ideally provide enough data for figuring out most problems and recording most actions on a high-enough level for good auditing/debugging while also reducing noise. I totally agree that we don't get enough data right now for resolving issues from the default INFO level but I don't see how we would be making this better by increasing the default logging level threshold which will end up reducing this output. I do see the point about making auth errors more obvious but I don't believe that conjur should throw warnings for anything that's not a real warning (eg. low disk space, not enough available file descriptors, unset authenticators string, OOM, etc).

orenbm · 2020-06-29T11:52:11Z

Closing this proposal as we didn't get to an agreement on the desired UX. However, I did create this PR that addresses one of the issues raised in this design and that will give us a lot of value.

orenbm added this to the R&D Boost milestone May 18, 2020

orenbm requested review from alexkalish, InbalZilberman, rafis3 and a team May 18, 2020 10:34

orenbm self-assigned this May 18, 2020

orenbm commented May 18, 2020

View reviewed changes

design/logging.md Outdated Show resolved Hide resolved

rafis3 reviewed May 18, 2020

View reviewed changes

eladkug reviewed May 18, 2020

View reviewed changes

orenbm force-pushed the logs-design branch from e82ce2a to eba9ad7 Compare May 19, 2020 12:00

orenbm requested a review from shaharglazner May 19, 2020 12:01

orenbm force-pushed the logs-design branch 3 times, most recently from f7a84e7 to 2dcd33d Compare May 19, 2020 12:23

orenbm commented May 19, 2020

View reviewed changes

jvanderhoof reviewed May 19, 2020

View reviewed changes

Tovli suggested changes May 19, 2020

View reviewed changes

alexkalish reviewed May 22, 2020

View reviewed changes

orenbm mentioned this pull request May 27, 2020

Enumerate Login and Authenticator Error Message Sources #1464

Closed

Add design doc for logging

4271083

orenbm force-pushed the logs-design branch from 2dcd33d to 4271083 Compare June 1, 2020 11:41

sgnn7 reviewed Jun 1, 2020

View reviewed changes

orenbm closed this Jun 29, 2020

ismarc deleted the logs-design branch April 13, 2022 19:09


		### Backwards compatibility

		It's important to say that the UX will not change by default. We will do this by changing


		We will not introduce new tests in this effort and the regression tests will be enough.

		## Documentation

Add design doc for logging #1555

Add design doc for logging #1555

Conversation

orenbm commented May 18, 2020 • edited Loading

rafis3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codeclimate bot commented May 19, 2020

Choose a reason for hiding this comment

jvanderhoof commented May 19, 2020

Tovli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanderhoof commented May 21, 2020

alexkalish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgnn7 Jun 11, 2020 • edited Loading

Choose a reason for hiding this comment

orenbm Jun 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orenbm commented Jun 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orenbm commented Jun 29, 2020

orenbm commented May 18, 2020 •

edited

Loading

sgnn7 Jun 11, 2020 •

edited

Loading

orenbm Jun 14, 2020 •

edited

Loading