Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tee): TEE Prover Gateway #2270

Closed
wants to merge 19 commits into from
Closed

feat(tee): TEE Prover Gateway #2270

wants to merge 19 commits into from

Conversation

pbeza
Copy link
Collaborator

@pbeza pbeza commented Jun 19, 2024

What ❔

The TEE Prover Gateway is a service component within our system infrastructure that functions as an intermediary between the TEE enclave and the server's HTTP API, introduced in commit eca98cc (#1993). It first registers TEE attestation using the /tee/register_attestation endpoint, then regularly invokes the server's HTTP API via the /tee/proof_inputs endpoint to obtain proof-related data, and finally submits the proof through the /tee/submit_proofs/<l1_batch_number> endpoint.

Why ❔

This PR contributes to the effort outlined in the docs:

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.
  • Spellcheck has been run via zk spellcheck.

@pbeza
Copy link
Collaborator Author

pbeza commented Jun 24, 2024

@haraldh @popzxc @RomanBrodetski, this PR isn't fully polished yet. It needs better error handling, less copy-pasting, and the performance optimizations mentioned below (unless you decide to refactor it completely). However, the code works and gets the job done. Before refining it further, I need some advice to make sure it's heading in the right direction. The main changes are in prover/prover_tee_gateway. The rest is just some boilerplate configuration stuff and moving around existing code.

The idea behind this PR is best described here, and here:

/// This application is a TEE verifier (a.k.a. a prover, or worker) that interacts with three
/// endpoints of the TEE prover interface API:
/// 1. `/tee/proof_inputs` - Fetches input data about a batch for the TEE verifier to process.
/// 2. `/tee/submit_proofs/<l1_batch_number>` - Submits the TEE proof, which is a signature of the
/// root hash.
/// 3. `/tee/register_attestation` - Registers the TEE attestation that binds a key used to sign a
/// root hash to the enclave where the signing process occurred. This effectively proves that the
/// signature was produced in a trusted execution environment.
///
/// Conceptually it works as follows:
/// 1. Generate a new priv-public key pair.
/// 2. Generate an attestation quote for the public key.
/// 3. Register the attestation via the `/tee/register_attestation` endpoint.
/// 4. Run a loop:
/// a. Fetch the next batch data via the `/tee/proof_inputs` endpoint.
/// b. Verify the batch data.
/// c. If verification is successful, sign the root hash of the batch data with the private key.
/// d. Submit the signature (a.k.a. proof) via the `/tee/submit_proofs/<l1_batch_number>`
/// endpoint.

I tested it locally using our SGX-enabled machines in --simulation mode, meaning no SGX enclave was used yet. It will be relatively easy to switch to an actual SGX enclave once we have it running in simulation mode.

patrick@patrick:~/zksync-era/prover$ RUST_LOG=info cargo run --package zksync_prover_tee_gateway -- --config-path ../../test.yaml --simulation
Attestation quote was successfully registered
2024-06-24T10:36:25.235123Z  INFO zksync_prover_tee_gateway: Starting TEE Prover Gateway
2024-06-24T10:36:25.236227Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Starting periodic job: TeeVerifierInputDataFetcher with frequency: 1s
2024-06-24T10:36:25.236384Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:36:25.237476Z  INFO vise_exporter::exporter: Created metrics exporter with 144 metrics in 45 groups from crates zksync_merkle_tree 0.1.0, zksync_object_store 0.1.0, zksync_eth_client 0.1.0, zksync_prover_tee_gateway 0.1.0, zksync_web3_decl 0.1.0, ...
2024-06-24T10:36:25.239396Z  INFO vise_exporter::exporter: Starting Prometheus exporter web server on 0.0.0.0:3314
2024-06-24T10:36:29.357001Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/1
2024-06-24T10:36:30.364839Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:36:35.273205Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/2
2024-06-24T10:36:36.281514Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:36:39.766674Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/3
2024-06-24T10:36:40.774033Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:36:46.074133Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/4
2024-06-24T10:36:47.082928Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:36:55.756997Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/5
2024-06-24T10:36:56.765820Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:37:09.389936Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/6
2024-06-24T10:37:10.397319Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/proof_inputs
2024-06-24T10:37:11.358309Z  INFO zksync_prover_tee_gateway: Stop signal received, shutting down
2024-06-24T10:37:11.358889Z  INFO vise_exporter::exporter: Stop signal received, Prometheus metrics exporter is shutting down
2024-06-24T10:37:11.359218Z  INFO vise_exporter::exporter: Prometheus metrics exporter server shut down
2024-06-24T10:37:16.360379Z  WARN zksync_utils::wait_for_tasks: Failed to terminate actors in 5s
2024-06-24T10:37:27.412702Z  INFO zksync_prover_tee_gateway::api_data_fetcher: Sending request to http://127.0.0.1:3320/tee/submit_proofs/7

(notice the ~10-second time gap between /tee/submit_proofs calls! 🐢)

State of the SQL db after running the above command:

zksync_local=# SELECT * FROM tee_proof_generation_details ORDER BY l1_batch_number LIMIT 10;
 l1_batch_number |       status       |                                                             signature                                                              |                                pubkey                                |                               proof                                | tee_type |         created_at         |         updated_at         |      prover_taken_at       
-----------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+--------------------------------------------------------------------+----------+----------------------------+----------------------------+----------------------------
               1 | generated          | \xae1e7e3997f1a125adcb4d1e6d1ad1348750804d8611492bc068813f5725d6521925e3aa19733544f5b12b0aefd9128bbeab378190280466adf0593e0099a42d | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \xcc8e8d36891b92eb080a01d22590e0a42ef074b884b2548c4e4101b6067b37a0 | sgx      | 2024-06-20 11:18:38.23828  | 2024-06-24 10:36:29.360226 | 2024-06-24 10:36:25.245783
               2 | generated          | \xc97aa8af5f4ace74765cf88dac32e25b7223f08eac35e8afebd15416857246344606f03a42429d5a29ca799f4aee5387496f30d37f413001787e85821f66181c | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \xcebf3e584510e88fe73fe4ec5f6f03fb790b598a09318182afd3ee5856c7e1aa | sgx      | 2024-06-20 11:21:42.299929 | 2024-06-24 10:36:35.275415 | 2024-06-24 10:36:30.36784
               3 | generated          | \x7950c131e0485ed439b783151b69b2b09305023196fb68a8fdd9871e8bfc946a4a13621cd6c110d337c828838599e0e525df0175dea53efef64787eae355f01f | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \xa98979b4bb45781e4d5ab4204b7965718825fbaccf531b86e890d191e1f63a96 | sgx      | 2024-06-20 11:21:44.316315 | 2024-06-24 10:36:39.768866 | 2024-06-24 10:36:36.284384
               4 | generated          | \x0f9fca2c61b7491a1e925e6ac4ad97bac31f0ede8e8cddfe1ce5038167cdc5401a5a2b9b72fa526a501ac0cbc6ea184f5958188f1f74748732ed1a710745ca52 | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \xe3cb8b2b8a01d150df3875936eb2b5a14d252f46775e888fc83ba4069bccc596 | sgx      | 2024-06-20 11:22:00.35344  | 2024-06-24 10:36:46.076382 | 2024-06-24 10:36:40.777835
               5 | generated          | \xf3aa6de1713e32a295369ea5a5b2ac17b0f0c7d33910c4a6872e6f70f0c9f18f4cd33adafc7a9cb9ee8316b0ca298f66c4625981384ed84bded4c440a64cea1d | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \x5fb0fc34f9d3d99b228708e38974f1261c9a1d66a35f8ddf905ae38bff88f2f0 | sgx      | 2024-06-20 11:22:02.41282  | 2024-06-24 10:36:55.759575 | 2024-06-24 10:36:47.086305
               6 | generated          | \xddd00e264dc7b07c802c300fb22b36e23e4349ed9ebe391c8e72a2d9847856f406d8e3b0dc5b6aad4ffb0a032fd38e79561be910ed179ced3f891c2e98866f9e | \x03065d641f791d7d5336a5fe666dd2c7d6dd59868d7300355e9e52177a4a809dee | \x2494f589891fcd100f183c4289a73cdf17556cddef7b82a5134df7a1b4cd7fb0 | sgx      | 2024-06-20 11:22:11.467021 | 2024-06-24 10:37:09.392016 | 2024-06-24 10:36:56.769566
               7 | picked_by_prover   |                                                                                                                                    |                                                                      |                                                                    |          | 2024-06-20 11:22:14.514738 | 2024-06-24 10:37:10.400956 | 2024-06-24 10:37:10.400956
               8 | ready_to_be_proven |                                                                                                                                    |                                                                      |                                                                    |          | 2024-06-20 11:22:23.558881 | 2024-06-21 11:04:07.218994 | 
               9 | ready_to_be_proven |                                                                                                                                    |                                                                      |                                                                    |          | 2024-06-20 11:22:25.613403 | 2024-06-21 11:04:27.581246 | 
              10 | ready_to_be_proven |                                                                                                                                    |                                                                      |                                                                    |          | 2024-06-20 11:22:35.686172 | 2024-06-21 11:04:49.115012 | 
(10 rows)

My main concerns/doubts around my code changes:

  1. I was 'inspired' by the prover/prover_fri_gateway package, which plays a similar role to what we want to implement in this PR, as it uses the same API. Would it be a good idea to reuse some of its code (specifically impl PeriodicApiStruct and trait PeriodicApi)? If so, I will refactor both packages to avoid copy-pasting:
    // TODO inspired (i.e. copy-pasted) by prover/prover_fri_gateway/src/api_data_fetcher.rs
    // TODO remove this file completely and reuse existing one
  2. Following up on the above, there's an issue with the current impl PeriodicApiStruct implementation. It sends an HTTP request and then handles the response in the same Tokio task, which blocks sending more HTTP requests (a classic problem). In our case, the response handling involves running the verify() function, which can block for a few seconds, as seen in the tests I ran. Isn't this a problem? Should I refactor the loop so that we can process responses concurrently?
  3. Why there is this tokio::time::sleep after handling the response?
    _ = sleep(self.poll_duration) => {}

    Why was it set to 1000 seconds (~17 minutes) for the original FRI Prover Gateway?
    api_poll_duration_secs: 1000

    I set it to 1 second (it could probably be 0 seconds too) for the TEE prover gateway since, as far as I know, we don't need any artificial delays, right?

Copy link
Member

@popzxc popzxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be implemented as a part of the prover workspace. Prover has its own reasons for design decisions, and a bunch of technical debt to address, so reusing its code should have strong reasons for it. Here used abstractions don't seem to be suitable (which is indicated, for example, by the fact that you use () as job ID, TeeProofGenerationDataRequest is an empty structure, and get_next_request always return Some(..)).
Here the code looks fairly simple, so we probably can implement a stand-alone binary for it. Using node framework will allow you to reuse some of the existing layers.

For example, you may do something like this (pseudocode):

use zksync_node_framework::task::Task;

struct TeeProver { /* some fields */ }

#[async_trait::async_trait]
impl Task for TeeProver {
    fn id() -> TaskId {
      "tee_prover".into()
    }

    fn run(self: Box<Self>, stop_receiver: StopReceiver) -> anyhow::Result<()> {
      self.register_attestation(&mut stop_receiver).await?;
      while !stop_receiver.0.borrow() {
        let job = self.get_job(&mut stop_receiver).await?; // Wait until the job is available.
        let output = self.verify(job, &mut stop_receiver).await?; // Generate the proof.
        self.submit_proof(output, &mut stop_receiver).await?; // Submit the proof.
      }
      Ok(())
    }
}

Then you can implement a wiring layer for it, and the resulting binary will be composed as:

fn main() -> anyhow::Result<()> {
  /* setup observability, load configs */

  ZkStackServiceBuilder::new()
    .add_layer(sigint_layer) // handle sigint
    .add_layer(prometheus_exporter_layer) // export metrics
    .add_layer(tee_prover_layer) // run the prover
    .build()?
    .run()
}

This binary will live in core/bin and IMHO it will be significantly easier to read.

@pbeza
Copy link
Collaborator Author

pbeza commented Jun 25, 2024

Thanks for the suggestions @popzxc!

I have a quick question, as I'm not yet fluent with the node framework and I want to provide more context on how we want to run the TEE Prover Gateway. Does the node framework make it easy to build a standalone binary, decoupled from the zksync-era server, so that I could run it with the tee-key-preexec binary like this: ./tee_key_preexec ./tee_prover_gateway? Or is it acceptable to run the whole zksync-era node with the tee-key-preexec as follows: ./tee_key_preexec zk server --components tee_prover_gateway? What's the preferred way?

(We need tee-key-preexec to 'inject' some environment variables into the TEE Prover Gateway.)

cc @haraldh

@popzxc
Copy link
Member

popzxc commented Jun 25, 2024

@pbeza the node framework is not tied to the server -- the last code snippet I provided is a sample for a stand-alone binary (which has nothing to do with the server).

@pbeza
Copy link
Collaborator Author

pbeza commented Jun 26, 2024

The code changes needed for this PR are too major to continue work on it. Closing it and going with the simpler #2333 instead.

@pbeza pbeza closed this Jun 26, 2024
@pbeza pbeza mentioned this pull request Jun 26, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants