Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation details for DELETE #14

Merged
merged 4 commits into from
Nov 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions flows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,66 @@ __POST /eth/v1/keystores__
- any individual failure should not stop the batch process, but at worst cause a
failure to load an individual keystore.
4. Collate the response objects in the same order as the request keystores, and return a `200` status with the batch results.

## Delete

__DELETE /eth/v1/keystores__

1. Deserialize the JSON request from the caller. On failure return an HTTP `400` status. Let `request` be
the deserialized request.

_NOTE_: after this point the response will be a 200 OK. Any 500 response is to be considered a
bug.

2. Remove each key from `request.pubkeys` one-by-one, recording its provisional status:
1. Remove the key from the in-memory registry of keys so that it is no
longer actively signing messages. If no matching key is found, provisionally
set the key's status to `not_found` and continue to the next key.

2. (Optional) If the keymanager keeps a database of known keys on disk (i.e. metadata),
remove or disable the key in this database. Disabling the key first may
be useful for certain storage schemes, to provide crash safety.
This is implementation dependent.

If the delete/disable fails, set the key's status to `error` with the
reason for the failure and continue to the next key.
Comment on lines +46 to +52
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff about disabling/deleting may be too Lighthouse specific to include here. I acknowledge that not all clients will have a separate on-disk record as we do, nor the same semantics. The reason we have to do the disable-then-delete dance is:

  1. Lighthouse will attempt to load any key listed as enabled in the metadata DB, so we have to delete from the metadata DB before deleting the keystore file to prevent Lighthouse from trying to load a deleted keystore in the case where the keystore was deleted but the call crashed or was aborted before the metadata entry could be deleted.
  2. On the other hand, we can't just delete the metadata and then the keystore because Lighthouse will reload keystores that are found on disk and re-add them to the metadata DB. In case of a crash before deleting the keystore, the key may remain active (although no response would have been sent in this case, so this would technically be OK, but we err on the side of caution).

Happy to remove these sections if they're deemed too complicated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking it optional is fine i think - this is a flow guideline and it's never going to cover everyones implementation...

The more important thing I was hoping to capture is basically - batch process first, export slashing data second, and the whole idea of failure conditions...
The rest is a guide and concrete examples like you've added certainly make it a lot clearer...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very helpful regardless


3. Delete the keystore file or keystore material from disk. If this fails, set the provisional
status to `error` with an appropriate message and continue to the next key.

4. (Optional) If the key was disabled rather than deleted in (ii),
permanently remove the key metadata from the on-disk database. If this fails, set the
provisional status to `error` with an appropriate message and continue to the next key.

5. Set the key's provisional status to `deleted` (all previous steps must have succeeded).

3. For all keys from `request.pubkeys` with provisional status equal to `deleted` or `not_found`,
export slashing protection data in EIP-3076 format.

- If the export fails, set the final status for all provisionally `deleted` or `not_found`
keys to `error`, with the `error` message describing the slashing protection failure.
The response should include a slashing protection object with valid `metadata` and
`data == []`.
Comment on lines +66 to +69
Copy link
Collaborator Author

@michaelsproul michaelsproul Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain whether a 500 response is preferable to this. Currently Lighthouse will send a 500 in this case, which is simpler than having to craft an empty slashing_protection object with valid metadata. However I think returning a 200 with statuses is more consistent.

Would appreciate your thoughts on this @rolfyone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i think given that it's a batch api, exporting slashing data isn't great, but its still a 200 with statuses of removal of keys, potentially without slashing protection data populated, otherwise corrective action via list would be needed to even know what keys got removed...

I think for delete given we have to process the batch first, the only non 200 response I can see at the moment is when the request body is invalid or empty...

It would mean they couldn't get the slashing data, but its arguably more important that they know they've deleted a bunch of keys at a bare minimum...

My hope is the slashing protection data export could be fairly robust, and exporting as many as can be found is better than erroring out of that part of the process...
If no keys are found during that export, i'd probably not include the key in the response object...

{
data: []
}

I think this description pretty much sells to me that slashing_protection in the response has to be optional, otherwise we may have to export empty slashing exports, which would be sub-optimal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that may be worth a thought/comment about.

the rest of the logic here is order important, but slashing protection data isn't. I'd be applying set logic to the keys being passed in, we only need 1 set of slashing export data per key, but we still need to apply the 'not_active' logic to all the duplicates if that makes sense...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I'm coming around to the "always return 200 unless the request is completely malformed approach" as well.

still a 200 with statuses of removal of keys, potentially without slashing protection data populated, otherwise corrective action via list would be needed to even know what keys got removed...
It would mean they couldn't get the slashing data, but its arguably more important that they know they've deleted a bunch of keys at a bare minimum...

I'm not sure if I'm understanding this correctly, but it sounds like you're saying in the case of slashing protection export error we should still return deleted as the status and just omit each failed key from the slashing_protection data. I think this is unsafe because it looks identical to a successful delete for an inactive key (i.e. one with no slashing protection data). I think if there's an error during slashing protection export we have to convey that to the caller, and the only mechanism we have for doing that is the error status (or a 500 error code). I.e. the semantics should be "if slashing protection export explicitly failed for a key, then its status is error regardless of whether it was actually deleted". In the case of a batch export failing, then it sets an error status for all keys that hadn't already errored. This is unfortunate in that it leaves the system in a quasi unknown state, but as you said, we should make sure that slashing protection export is as robust as possible so that this doesn't occur often.

These semantics are also more consistent with an implementation that wants to export slashing protection one-at-a-time for each key, where each export may fail or succeed independently.

I think this description pretty much sells to me that slashing_protection in the response has to be optional, otherwise we may have to export empty slashing exports, which would be sub-optimal

I think given the above, the slashing protection data with data == [] is an OK compromise. The metadata can be pulled out of thin air and doesn't pose any risk.

Comment on lines +66 to +69
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the slashing_protection field being optional - it does make sense if the data is empty to have none passed rather than requiring an empty box with metadata...


- On success, for all keys with provisional status `not_found`, set their final status
accordingly:
* If slashing protection data was found, set the final status to `not_active`.
* Otherwise, set the final status to `not_found`.
michaelsproul marked this conversation as resolved.
Show resolved Hide resolved

- On success set the final status for all `deleted` keys to `deleted` regardless of whether
slashing protection data was found. It could be that a key lacks slashing protection data
because it was never active. The caller may warn the user in such a case if it wishes.

4. Collate the final statuses for each key in the same order as the request and send them in a
correctly-structured response along with the slashing protection data from (3).

Notes:

- The flow above is only one example of a conformant implementation, implementations are free to
implement the API in any way they wish while remaining compatible. E.g. if a keymanager stores all
key data in a single database and is able to perform an atomic delete for all keys, then it may do
this. Likewise if an implementation would like to export slashing protection data for each key one
at a time and then merge these into the final response, then it may do this.
Comment on lines +85 to +89
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good disclaimer...

The main thing to take away from these docs is more like

  • order of processing for delete means batch first
  • order of processing for insert means batch LAST.

this has huge implications for erroring out, and that's what is particularly helpful to align on, without being overly perscriptive of delete/disable, except to say that if a key is deleted, it is no longer able to be retrieved by the validator client that previously had access to the keystore...

I also think fleshing this out has made a few implementation details a lot clearer, so it's been a good exercise.

- The keymanager must never delete the slashing protection data for a key.
- The keymanager must never return slashing protection data for keys that are not part of the
request, and should never return slashing protection data for a key with an `error` status.
3 changes: 3 additions & 0 deletions keymanager-oapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,9 @@ paths:

DELETE should never return a 404 response, even if all pubkeys from request.pubkeys have no extant keystores
nor slashing protection data.

Slashing protection data must only be returned for keys from `request.pubkeys` for which a
`deleted` or `not_active` status is returned.
security:
- bearerAuth: []
tags:
Expand Down