Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log instead of store CC fails by default BA-5721 #5095

Merged
merged 1 commit into from
Aug 22, 2019
Merged

Conversation

kshakir
Copy link
Contributor

@kshakir kshakir commented Jul 31, 2019

No description provided.


```hocon
call-caching {
log-cache-hit-failures: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to simply log these at log level INFO instead of introducing a custom boolean?

"informational messages" seem like a great fit for that log level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option has been removed.


```hocon
call-caching {
add-cache-hit-failures-to-metadata: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we need this? The ticket was pretty specific that there exists no customer requirement around recording to metadata, and I can't see ourselves really needing this as developers either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like leaving this in and defaulting to true to avoid surprising anyone currently relying on it in other instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on no more options for cromwell admins.

Copy link
Contributor

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, a lot of "I wasn't sure why this refactor happened"

Re "should we have a config option for logging" my gut feeling is like Adam, I'd be inclined to wait until this option is requested before adding it to the maintenance burden. OTOH since it's already wired in and it would be more work to undo that, I'm not sure that I'd particularly prioritize removing it either... 🤷‍♂

Now where did I leave my fence, I could do with a good sit down...

CHANGELOG.md Outdated

By default, call cache failure messages are now sent to the logs instead of the workflow metadata. Failure message
delivery to the logs and/or metadata may be adjusted via new configuration keys. See [the Cromwell call caching
documentation](https://cromwell.readthedocs.io/en/stable/cromwell_features/CallCaching/) for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a PR reference (per tech talk this morning)

[Configuration](../Configuring#call-caching) and the behavior can be modified via
[Workflow Options](../wf_options/Overview). If you are adding Workflow options, do not set
[`read_from_cache` or `write_to_cache`](../wf_options/Overview#call-caching-options) = false, as it will impact the
following process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOI: not that I mind, just curious what triggered this multi-lining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested the links. I think someone moved the page down a directory under cromwell_features, but didn't fix the links on the page with a ../ prefix. So over on cromwell.rtfd.io a bunch of links on these lines actually link to the 404 page. Briefly looked into a link checker to add to testCheckPublish.sh but didn't see anything quick. If anyone knows of one they like we can add it to the build.

@@ -0,0 +1,19 @@
# Optional call-caching configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL also move the engine conf section here from core/.../reference.conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the whole thing. We'll fix the reference-confs-belong-with-the-code-that-references-them in future PRs.


# Whether to put cache hit failure reasons into workflow metadata.
# (default: false)
add-cache-hit-failures-to-metadata = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would default this to true to avoid changing the ground from underneath users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket description from product indicate that this is not a user-facing feature, so I don't think it matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more config.

log-cache-hit-failures = true

# Whether to put cache hit failure reasons into workflow metadata.
# (default: false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're in reference.conf is there any value in stating that this is the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the other reference.conf options that are previously here in this stanza are listed as (paraphrasing) some-key: "some-value" # (default some-value). So the "default" comment was also added for consistency to try not to confuse others looking through the whole file.

labels: Map[String, String],
serviceRegistry: ActorRef): IOChecked[Unit] = {
val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId"
Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we want to wrap Unit returns as pure IoChecked monads. An in-person explanation might be good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this change was triggered by IntelliJ warning about a dupe block of code. That and your comment also helped me notice a bug where one of these IOChecked[Unit]s were being created but never run. 👍

Happy to chat in-person if I'm off target here. The SomeBox[Unit] vs. Unit is because the latter only allows throwing execptions, not returning them. Could have been as simple as a Try[Unit], but in this case the previously-duplicated-and-now-consolidated block was written to use the feature rich IOChecked[Unit].

Copy link
Contributor

@cjllanwarne cjllanwarne Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get in the production WorkflowStoreSubmitActor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I didn't write it. Just de-duped it. I concur the function probably won't throw an exception. I'm pretty sure one-of-the-original-dupes was originally wrapped so it would fit into this IOChecked[_] for-comprehension.

instrumentJobComplete(response)
pushExecutionEventsToMetadataService(jobDescriptorKey, eventList)
recordExecutionStepTiming(stateName.toString, currentStateDuration)
context stop self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're removing a lot more than just the metadata writing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at what was removed, it was removed twice twice and copypastad back once. FTFY'ed the rest of the file while here cleaning up an IntelliJ warning.


def errorOrCallCachingBoolean(path: String): ErrorOr[Boolean] = {
import common.validation.Validation._
validate(callCachingConfig.getBoolean(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you switched from Option to ErrorOr to indicate "found or not found" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the new tests. All the configs are using ErrorOrs now. When misconfigured with something like my-config-enabled = tru, a naked getBoolean will quick-crash instead of continuing to validate other config values.

@@ -120,14 +126,36 @@ object MaterializeWorkflowDescriptorActor {
}
}

val errorOrMaybePrefixes = workflowOptions.getVectorOfStrings("call_cache_hit_path_prefixes")
val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems weird to treat a missing config value as a per-workflow validation error rather than applying a default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing's ever missing AFAIK, and the newly added tests validate this.

The defaults are still there in the reference.conf (the belt). I just removed the duplicated defaults that were never used by main scala (the suspenders).

val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results")
val errorOrLogCacheHitFailures = errorOrCallCachingBoolean("log-cache-hit-failures")
val errorOrAddCacheHitFailuresToMetadata = errorOrCallCachingBoolean("add-cache-hit-failures-to-metadata")
val errorOrCallCachingOptions = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do stick to ErrorOrs, could we do the "config" part "once per instance" instead of "once per workflow"?

Copy link
Contributor Author

@kshakir kshakir Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. My only hesitation is the changing logic. Currently the logic only crashes-on-a-bad-config for workflows that request call caching. I can take a shot at updating this and add a line to the ChangeLog and docs describing the change in behavior. Edit: Instead filed https://broadworkbench.atlassian.net/browse/BA-5923

@aednichols
Copy link
Collaborator

Don't just create the IO, run it.

That's one way to avoid side effects, for sure

Copy link
Contributor

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet to be convinced on the whole "wrap it in an IO then call .get on it" thing, but otherwise I only have TOL comments

@@ -81,6 +87,12 @@ Cromwell would search cache hits in all of the `gs://alice_bucket`, `gs://bob_bu

If no `call_cache_hit_path_prefixes` are specified then all matching cache hits will be considered.

***Call cache failure logging***

When searching for previous results to cache, the first three times (per call) that caching is unable to copy results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL: Mildly awkward sentence structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English hard is.

**Runtime Attributes**

As well as call inputs and the command to run, call caching considers the following [runtime attributes](https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/) of a given task when determining whether to call cache:
As well as call inputs and the command to run, call caching considers the following [runtime
attributes](../RuntimeAttributes) of a given task when determining whether to call cache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link didn't work for me when I tested it with the View file option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test. I tried them and... it didn't work for me either. 🤦‍♂ It's supposed to be '../../Page/' as apparently mkdocs adds a trailing / to the urls.

By the way, these links won't work on GitHub as these files are hosted on RTFD. Use mkdocs serve and then try out the link locally.

s"Failed copying cache results for job $jobDescriptorKey (${reason.getClass.getSimpleName}: ${reason.getMessage})"
if (invalidationRequired) {
// Whenever invalidating a cache result, always log why the invalidation occurred
workflowLogger.warn( s"$problemSummary, invalidating cache entry.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL: Weird whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoo ps

data
} else if (data.logCacheHitFailureCount < 3) {
workflowLogger.info(problemSummary)
data.copy(logCacheHitFailureCount = data.logCacheHitFailureCount + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL: Part of me likes the idea of:

  • Always appending 1 to the count regardless
  • Renaming it something like cacheHitFailureCount
  • Having a single log/metric emitted once the call cache sequence has completes, along the lines of "Call cache hit process had ${data.cacheHitFailureCount} total hit failures before completing successfully/unsuccessfully"

@@ -43,7 +43,7 @@ final case class WorkflowStoreSubmitActor(store: WorkflowStore, serviceRegistryA
val wfTypeVersion = cmd.source.workflowTypeVersion.getOrElse("Unspecified version")
log.info("{} ({}) workflow {} submitted", wfType, wfTypeVersion, workflowSubmissionResponse.id)
val labelsMap = convertJsonToLabelsMap(cmd.source.labelsJson)
publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap)
publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap, serviceRegistryActor).toErrorOr.toTry.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the publish to metadata fails? Would we get a workflow store crash? Do we have suitable recovery logic for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No recovery logic added. Maybe there should be some. This was a de-dupe based on a suggestion by IntelliJ. Someone had copypasta'd the exact same publishLabelsToMetadata into two different places.

labels: Map[String, String],
serviceRegistry: ActorRef): IOChecked[Unit] = {
val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId"
Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry))
Copy link
Contributor

@cjllanwarne cjllanwarne Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get in the production WorkflowStoreSubmitActor?

@kshakir kshakir requested a review from cjllanwarne August 19, 2019 17:56
@kshakir
Copy link
Contributor Author

kshakir commented Aug 19, 2019

I think I got everything, hopefully without creating new issues. 🤞 Failing tests are due to GPU issues that I believe are being investigated elsewhere.

***Call cache failure logging***

When Cromwell fails to cache a job from a previous result the reason will be logged. To reduce the verbosity of the logs
only the first three failure reasons will be logged per job. Cromwell will continue to try copying previous results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per job => per shard of each job?

@kshakir kshakir force-pushed the ks_log_cache_fail branch from 5800f00 to 7ab5582 Compare August 22, 2019 12:44
@kshakir kshakir merged commit b18b976 into develop Aug 22, 2019
@kshakir kshakir deleted the ks_log_cache_fail branch August 22, 2019 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants