Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for logging #1555

Closed
wants to merge 1 commit into from
Closed

Add design doc for logging #1555

wants to merge 1 commit into from

Conversation

orenbm
Copy link
Member

@orenbm orenbm commented May 18, 2020

Connected to #1539

Add a design doc for improving the Conjur log. More info can be found in the connected issue and in the design itself.

It's best to view the doc here and add your comments here.

@orenbm orenbm added this to the R&D Boost milestone May 18, 2020
@orenbm orenbm requested review from alexkalish, InbalZilberman, rafis3 and a team May 18, 2020 10:34
@orenbm orenbm self-assigned this May 18, 2020
Copy link
Member

@rafis3 rafis3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orenbm added two small comments


### Backwards compatibility

It's important to say that the UX will not change by default. We will do this by changing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this doesn't change for new setups, it will change for users that have a specific value set for CONJUR_LOG_LEVEL. I think it's worth mentioning that when not using the default value, there will be an expected behavior change.

let's address each component to decide on the action items:
- `evoke`: this component doesn't write to the log in INFO level so it won't be affected by the change.
- `conjur-ui`: this components writes messages in DEBUG, INFO and WARN and should be addressed.
- `ldap-sync`: this components writes messages in DEBUG, INFO and WARN and should be addressed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syslog-ng logs as well

1. DEBUG - messages that can help Conjur developers investigate an issue. These messages will include
stack traces and DB messages.
2. INFO - messages that can help Conjur users and operators investigate an issue. These messages will include
the messages that we print in our logs (messages with the `CONJ` prefix).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that messages with prefix 'CONJ' with suffix 'E' need to be WARN (CONJ00007E for example). If i take your example:
WARN 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007E 'not_in_conjur' not found>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree generally. However, this particular line has a long history a needs approval from our security architects before we can this can show up in the Conjur logs by default (which is currently WARN level).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current default log level is info. I think that this message should be written in INFO (and the code should change accordingly). I address this issue in the doc.

@orenbm orenbm requested a review from shaharglazner May 19, 2020 12:01
@orenbm orenbm force-pushed the logs-design branch 3 times, most recently from f7a84e7 to 2dcd33d Compare May 19, 2020 12:23
Comment on lines 533 to 534
- Conjur
- DAP
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanderhoof @alexkalish do we publish a version for changes like this or are they accumulated to the next release? same question for Conjur and DAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends on the need. If we want to get a change into the hands of a customer, we'll cut a release.

### Backwards compatibility

As you can see, the UX has changed a bit, as the log will contain more data by default.
The default log level is INFO, so instead of getting this log for the use-case described above:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend we change the default log level to WARN. INFO feels much too verbose to consider running in production (what I'd expect the default level to be acceptable for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jvanderhoof . I was going backwards and forward on this one and it's the biggest question of this design. Let's get some thoughts from @alexkalish, @InbalZilberman and @rafis3 on this and decide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that WARN is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.

@codeclimate
Copy link

codeclimate bot commented May 19, 2020

Code Climate has analyzed commit 2dcd33d2 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 85.4% (0.0% change).

View more on Code Climate.


We will not introduce new tests in this effort and the regression tests will be enough.

## Documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add all messages to the DAP Wiki? This source will be used by the field and support (as well as our internal engineers 😄).

@jvanderhoof
Copy link
Contributor

@orenbm, a big thanks for taking the lead on this. I really think it's going to impact our team and customers in a really positive way!

Copy link
Contributor

@Tovli Tovli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative. Super important IMO

not the application. The issue should be resolved soon, but the application can log the exception and continue running.

We will go over our log messages and change their log level according to the above.
DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why we are coupling the log code with the log level? Would appreciate a reference (if there's such)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about decoupling the log level from the message name? With our current approach, changing a message level requires we change the name message level (CONJ00012I -> CONJ00012W if we want to change a message from "Info" to "Warn". The log level is present in the error message output (INFO, DEBUG, etc.). Can we change the codes to remove the log level (ex. CONJ<some-number> instead of CONJ<some-number>D for a "Debug" level message)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can still change that to not have the codes as it is better. We can go over our docs are remove the suffix from all the instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orenbm: I think that our codes match the other PAS products, right? I would like to drop them too, but are we breaking a cross-product requirement? Could @InbalZilberman perhaps ask about that?

as they will be written to the log in `INFO` level (today they are written in `DEBUG`).
The reason they should be logged in `INFO` and not in `ERROR`
is that an `ERROR` log indicates `errors that cause an operation to fail (e.g.,
missing data or files) but not the application`. In some of our error messages,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical but note that datadog/rails have also fatal which we don't use so I guess our Errors can also be Fatal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we don't have any log messages that crash our server.

### Affected Components

- Conjur
- DAP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you include the UI in this effort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated now.

@jvanderhoof
Copy link
Contributor

With a few minor suggestions, I really like the direction of this proposal. What I'm not clear on is the process for identifying the messages that should be re-classified into each level (debug, info, warn, error). It feels like we'll need to establish a common level across all of Conjur. This is going to be a substantial effort (but will be very valuable for troubleshooting).

Given your familiarity with authn-azure, would it make sense to start in this authenticator? authn-K8s is another great option given the high number of support cases we deal with related to Kubernetes/OCP. Once we establish a common pattern, we can iteratively push through the rest of the Conjur workflows.

Copy link
Contributor

@alexkalish alexkalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, Oren. From a PO perspective, I think it's fine to move forward. Just please be sure to address the concerns/comments of other reviewers.

not the application. The issue should be resolved soon, but the application can log the exception and continue running.

We will go over our log messages and change their log level according to the above.
DB calls will stay in DEBUG, whereas most of the logs that we write on our own (i.e `CONJ<some-number>D`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jvanderhoof. It would be nice to have stable log codes that do not change based on their log level. But maybe that 🚢 has sailed, since our codes already include them? Also, I think it's probably more important that error codes do not change, as opposed to informational log messages.

### Backwards compatibility

As you can see, the UX has changed a bit, as the log will contain more data by default.
The default log level is INFO, so instead of getting this log for the use-case described above:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that WARN is a fine default. Using a product for the first time in production, I would prefer to not worry about high log volume.

`authn-oidc` request with a username that is not defined in Conjur. If the log level
is set to INFO (as the default value is), then the log will look like this:
```
INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I really dislike the default Rails request logging in production. Spreading a request across multiple lines is fine for development, but is not conducive to log aggregation and reporting. In the past, I've always configured lograge for production environments. @orenbm, @jvanderhoof: What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new approach - where the default log level is warn - we won't have these messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we go to WARN, we lose basic request logging? I missed that detail and I'm not sure if that makes sense. I would expect basic request logging at the default log level in an application like this. Maybe we should go back to INFO. Thoughts, @jvanderhoof?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can read about it more here.

to summarize, if we look at the use-case presented in the doc (unsuccessful request).

The current log is:

INFO 2020/05/18 09:49:20 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-18 09:49:20 +0000
 INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Processing by AuthenticateController#authenticate_oidc as */*
 INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421]   Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
 INFO 2020/05/18 09:49:21 +0000 [pid=411] [origin=127.0.0.1] [request_id=0487ea6a-a974-481a-8b7f-d7d62e96aec6] [tid=421] Completed 401 Unauthorized in 94ms

If we will have WARN as default in the new behaviour, the log will be empty.
If we will have INFO as default in the new behaviour, the log will be:

INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */*
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413]   Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '1'
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00007 Working with Identity Provider https://keycloak:8443/auth/realms/master
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008 Identity Provider discovery succeeded
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00009 Fetched Identity Provider keys from provider successfully
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00016 Rate limited cache updated successfully
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00021 Concurrency limited cache updated successfully
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00023 Concurrency limited cache concurrent requests updated to '0'
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00017 Fetched Identity Provider keys from cache successfully
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00005 Token decoded successfully
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00004 Extracted username 'not_in_conjur' from ID token field 'preferred_username'
 INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found>
 INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms

So we need to choose between less than today to more than today. I actually agree with you that the UX will be better without those 4 lines because it adds data to the log that we can live without. They are also quite DEBUGy. If the user wants more data they can move to INFO and get the 4 lines above + our additional info.

I really don't think there's a use case where someone will want just the basic info that we log today without the additional data that we log with the CONJ prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgnn7 can you take a look at this thread? In this one i state the outcome of having INFO as the default value and having WARN as the default value.

Please say what you think is better, having the above in mind.

Copy link
Contributor

@sgnn7 sgnn7 Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that both options in this example (Warn = no logs, Info = too much logging) are not great experience for both the operator and the developer. No messages would mean no tracing of functionality without changing the log level and the other one is just too noisy. The best blend of the two would be taking the output of the INFO and pushing all mostly irrelevant messages to debug level so you would get a much better condensed idea of what happened without flooding the logs:

 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Started POST "/authn-oidc/keycloak/cucumber/authenticate" for 127.0.0.1 at 2020-05-17 17:39:58 +0000
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Processing by AuthenticateController#authenticate_oidc as */*
 INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413]   Parameters: {"service_id"=>"keycloak", "account"=>"cucumber"}
 INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Authentication Error: #<Errors::Authentication::Security::RoleNotFound: CONJ00007 'not_in_conjur' not found>
 INFO 2020/05/17 17:39:59 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] Completed 401 Unauthorized in 143ms

Edit: We may want to snip out the middle two lines too if we really wanted to but knowing that an action was taking place should be logged by the server.

I know some of this seems like we are adding unnecessary info in the logs but they are critical from a support standpoint since they don't require changing the customer's infrastructure to narrow down problems when they happen.

Copy link
Member Author

@orenbm orenbm Jun 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny enough, this is the current state :)

I actually think that this level doesn't give value as it is somewhere in the middle. I write about it more here.

I would actually like to get some preferences from SEs as they work with customers. I'll send a link to this thread in Slack as i can't tag them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More info is better unless it creates other complications like filling up disk space or exposing stuff people shouldn't see. One advantage of docker logs is "--since ..." which lets you easily view recent log messages w/o parsing timestamps. Current default log levels are pretty useless. When you bump the log level to debug you get all the backtrace garbage to wade through. So I'm for whatever gets more useful info into the logs as a responsible default. Currently when debugging K8s 401s I turn set LOG_LEVEL to DEBUG, RAILS_ENV to production (not sure if that's required), deploy the Follower or app, then run "docker logs --since 1m | grep CONJ" and that generally shows what the issue is. In a perfect world I'd just need to do "docker logs --since 1m" w/o changing log levels or grepping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jodyhuntatx, that's very helpful!

@orenbm
Copy link
Member Author

orenbm commented Jun 1, 2020

@jvanderhoof @alexkalish @Tovli I changed the default log level to WARN and added a section for removing the suffix from the CONJ codes. can you please re-review?

will be logged in INFO.

We will also remove the suffix from the log codes as the log level is already written
in the log message (e.g `INFO 2020/05/17 17:39:58 +0000 [pid=401] [origin=127.0.0.1] [request_id=f39fb7e2-59f9-4f3a-89a9-bb9a50b09860] [tid=413] CONJ00008I Identity Provider discovery succeeded`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be ensured to retain that info in some way when shipped to syslog/DB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the log level is written also in Syslog.


Currently, the default log level for Conjur OSS and DAP is `INFO`. We will change
it to `WARN` so the default level will not have all the log messages that are now
added to the `INFO` level from `DEBUG`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm on board with upping the log level to warn. The reason for my opinion here is that when problems occur, customers don't want to modify their servers to enable debugging but want to only send us the logs which I think will not provide enough troubleshooting information when set to "warn" even with the changes suggested for authentication issues. This makes support issues harder to fix and we would not be able to fix many inconsistent issues since we would be doing things reactively rather than proactively for collecting the info.

From a customer's perspective, they also may want to see when a person logs in, retrieves a secret, etc and other informational info that is not a warning level message and should not be one. I am not a fan of log4j but I do like their breakdown of log levels and some other vantage points from this SO thread. In general though, we also have to consider that we have a security product and that our logging is much more valuable than logging of a random system service.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's true also today. Today, we want to get the logs in DEBUG level as without it it doesn't have enough value for us in debugging. So the current default level also doesn't help us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in answering here as it's been pretty busy on the community team.

INFO should ideally provide enough data for figuring out most problems and recording most actions on a high-enough level for good auditing/debugging while also reducing noise. I totally agree that we don't get enough data right now for resolving issues from the default INFO level but I don't see how we would be making this better by increasing the default logging level threshold which will end up reducing this output. I do see the point about making auth errors more obvious but I don't believe that conjur should throw warnings for anything that's not a real warning (eg. low disk space, not enough available file descriptors, unset authenticators string, OOM, etc).

@orenbm
Copy link
Member Author

orenbm commented Jun 29, 2020

Closing this proposal as we didn't get to an agreement on the desired UX. However, I did create this PR that addresses one of the issues raised in this design and that will give us a lot of value.

@orenbm orenbm closed this Jun 29, 2020
@ismarc ismarc deleted the logs-design branch April 13, 2022 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants