-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log instead of store CC fails by default BA-5721 #5095
Conversation
|
||
```hocon | ||
call-caching { | ||
log-cache-hit-failures: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to simply log these at log level INFO instead of introducing a custom boolean?
"informational messages" seem like a great fit for that log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option has been removed.
|
||
```hocon | ||
call-caching { | ||
add-cache-hit-failures-to-metadata: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we need this? The ticket was pretty specific that there exists no customer requirement around recording to metadata, and I can't see ourselves really needing this as developers either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like leaving this in and defaulting to true
to avoid surprising anyone currently relying on it in other instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on no more options for cromwell admins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, a lot of "I wasn't sure why this refactor happened"
Re "should we have a config option for logging" my gut feeling is like Adam, I'd be inclined to wait until this option is requested before adding it to the maintenance burden. OTOH since it's already wired in and it would be more work to undo that, I'm not sure that I'd particularly prioritize removing it either... 🤷♂
Now where did I leave my fence, I could do with a good sit down...
CHANGELOG.md
Outdated
|
||
By default, call cache failure messages are now sent to the logs instead of the workflow metadata. Failure message | ||
delivery to the logs and/or metadata may be adjusted via new configuration keys. See [the Cromwell call caching | ||
documentation](https://cromwell.readthedocs.io/en/stable/cromwell_features/CallCaching/) for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a PR reference (per tech talk this morning)
[Configuration](../Configuring#call-caching) and the behavior can be modified via | ||
[Workflow Options](../wf_options/Overview). If you are adding Workflow options, do not set | ||
[`read_from_cache` or `write_to_cache`](../wf_options/Overview#call-caching-options) = false, as it will impact the | ||
following process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOI: not that I mind, just curious what triggered this multi-lining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested the links. I think someone moved the page down a directory under cromwell_features
, but didn't fix the links on the page with a ../
prefix. So over on cromwell.rtfd.io a bunch of links on these lines actually link to the 404 page. Briefly looked into a link checker to add to testCheckPublish.sh but didn't see anything quick. If anyone knows of one they like we can add it to the build.
@@ -0,0 +1,19 @@ | |||
# Optional call-caching configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL also move the engine
conf section here from core/.../reference.conf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the whole thing. We'll fix the reference-confs-belong-with-the-code-that-references-them in future PRs.
|
||
# Whether to put cache hit failure reasons into workflow metadata. | ||
# (default: false) | ||
add-cache-hit-failures-to-metadata = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would default this to true
to avoid changing the ground from underneath users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket description from product indicate that this is not a user-facing feature, so I don't think it matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more config.
log-cache-hit-failures = true | ||
|
||
# Whether to put cache hit failure reasons into workflow metadata. | ||
# (default: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're in reference.conf
is there any value in stating that this is the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the other reference.conf options that are previously here in this stanza are listed as (paraphrasing) some-key: "some-value" # (default some-value)
. So the "default" comment was also added for consistency to try not to confuse others looking through the whole file.
labels: Map[String, String], | ||
serviceRegistry: ActorRef): IOChecked[Unit] = { | ||
val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId" | ||
Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we want to wrap Unit
returns as pure IoChecked
monads. An in-person explanation might be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this change was triggered by IntelliJ warning about a dupe block of code. That and your comment also helped me notice a bug where one of these IOChecked[Unit]
s were being created but never run. 👍
Happy to chat in-person if I'm off target here. The SomeBox[Unit]
vs. Unit
is because the latter only allows throwing execptions, not returning them. Could have been as simple as a Try[Unit]
, but in this case the previously-duplicated-and-now-consolidated block was written to use the feature rich IOChecked[Unit]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get
in the production WorkflowStoreSubmitActor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I didn't write it. Just de-duped it. I concur the function probably won't throw an exception. I'm pretty sure one-of-the-original-dupes was originally wrapped so it would fit into this IOChecked[_]
for-comprehension.
instrumentJobComplete(response) | ||
pushExecutionEventsToMetadataService(jobDescriptorKey, eventList) | ||
recordExecutionStepTiming(stateName.toString, currentStateDuration) | ||
context stop self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're removing a lot more than just the metadata writing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at what was removed, it was removed twice twice and copypastad back once. FTFY'ed the rest of the file while here cleaning up an IntelliJ warning.
|
||
def errorOrCallCachingBoolean(path: String): ErrorOr[Boolean] = { | ||
import common.validation.Validation._ | ||
validate(callCachingConfig.getBoolean(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you switched from Option
to ErrorOr
to indicate "found or not found" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the new tests. All the configs are using ErrorOr
s now. When misconfigured with something like my-config-enabled = tru
, a naked getBoolean
will quick-crash instead of continuing to validate other config values.
@@ -120,14 +126,36 @@ object MaterializeWorkflowDescriptorActor { | |||
} | |||
} | |||
|
|||
val errorOrMaybePrefixes = workflowOptions.getVectorOfStrings("call_cache_hit_path_prefixes") | |||
val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems weird to treat a missing config value as a per-workflow validation error rather than applying a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing's ever missing AFAIK, and the newly added tests validate this.
The defaults are still there in the reference.conf
(the belt). I just removed the duplicated defaults that were never used by main scala (the suspenders).
val errorOrInvalidateBadCacheResults = errorOrCallCachingBoolean("invalidate-bad-cache-results") | ||
val errorOrLogCacheHitFailures = errorOrCallCachingBoolean("log-cache-hit-failures") | ||
val errorOrAddCacheHitFailuresToMetadata = errorOrCallCachingBoolean("add-cache-hit-failures-to-metadata") | ||
val errorOrCallCachingOptions = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do stick to ErrorOr
s, could we do the "config" part "once per instance" instead of "once per workflow"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. My only hesitation is the changing logic. Currently the logic only crashes-on-a-bad-config for workflows that request call caching. I can take a shot at updating this and add a line to the ChangeLog and docs describing the change in behavior. Edit: Instead filed https://broadworkbench.atlassian.net/browse/BA-5923
That's one way to avoid side effects, for sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet to be convinced on the whole "wrap it in an IO
then call .get
on it" thing, but otherwise I only have TOL comments
@@ -81,6 +87,12 @@ Cromwell would search cache hits in all of the `gs://alice_bucket`, `gs://bob_bu | |||
|
|||
If no `call_cache_hit_path_prefixes` are specified then all matching cache hits will be considered. | |||
|
|||
***Call cache failure logging*** | |||
|
|||
When searching for previous results to cache, the first three times (per call) that caching is unable to copy results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL: Mildly awkward sentence structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English hard is.
**Runtime Attributes** | ||
|
||
As well as call inputs and the command to run, call caching considers the following [runtime attributes](https://cromwell.readthedocs.io/en/develop/RuntimeAttributes/) of a given task when determining whether to call cache: | ||
As well as call inputs and the command to run, call caching considers the following [runtime | ||
attributes](../RuntimeAttributes) of a given task when determining whether to call cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link didn't work for me when I tested it with the View file
option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test. I tried them and... it didn't work for me either. 🤦♂ It's supposed to be '../../Page/' as apparently mkdocs adds a trailing /
to the urls.
By the way, these links won't work on GitHub as these files are hosted on RTFD. Use mkdocs serve
and then try out the link locally.
s"Failed copying cache results for job $jobDescriptorKey (${reason.getClass.getSimpleName}: ${reason.getMessage})" | ||
if (invalidationRequired) { | ||
// Whenever invalidating a cache result, always log why the invalidation occurred | ||
workflowLogger.warn( s"$problemSummary, invalidating cache entry.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL: Weird whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoo ps
data | ||
} else if (data.logCacheHitFailureCount < 3) { | ||
workflowLogger.info(problemSummary) | ||
data.copy(logCacheHitFailureCount = data.logCacheHitFailureCount + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL: Part of me likes the idea of:
- Always appending 1 to the count regardless
- Renaming it something like
cacheHitFailureCount
- Having a single log/metric emitted once the call cache sequence has completes, along the lines of
"Call cache hit process had ${data.cacheHitFailureCount} total hit failures before completing successfully/unsuccessfully"
@@ -43,7 +43,7 @@ final case class WorkflowStoreSubmitActor(store: WorkflowStore, serviceRegistryA | |||
val wfTypeVersion = cmd.source.workflowTypeVersion.getOrElse("Unspecified version") | |||
log.info("{} ({}) workflow {} submitted", wfType, wfTypeVersion, workflowSubmissionResponse.id) | |||
val labelsMap = convertJsonToLabelsMap(cmd.source.labelsJson) | |||
publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap) | |||
publishLabelsToMetadata(workflowSubmissionResponse.id, labelsMap, serviceRegistryActor).toErrorOr.toTry.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the publish to metadata fails? Would we get a workflow store crash? Do we have suitable recovery logic for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No recovery logic added. Maybe there should be some. This was a de-dupe based on a suggestion by IntelliJ. Someone had copypasta'd the exact same publishLabelsToMetadata
into two different places.
labels: Map[String, String], | ||
serviceRegistry: ActorRef): IOChecked[Unit] = { | ||
val defaultLabel = "cromwell-workflow-id" -> s"cromwell-$workflowId" | ||
Monad[IOChecked].pure(labelsToMetadata(workflowId, labels + defaultLabel, serviceRegistry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't get why this would ever throw an exception? It seems like you're assuming it won't either because later on you use .get
in the production WorkflowStoreSubmitActor
?
I think I got everything, hopefully without creating new issues. 🤞 Failing tests are due to GPU issues that I believe are being investigated elsewhere. |
***Call cache failure logging*** | ||
|
||
When Cromwell fails to cache a job from a previous result the reason will be logged. To reduce the verbosity of the logs | ||
only the first three failure reasons will be logged per job. Cromwell will continue to try copying previous results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per job
=> per shard of each job
?
5800f00
to
7ab5582
Compare
No description provided.