Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gas/metering, terminate runaway vats #516

Closed
warner opened this issue Feb 6, 2020 · 6 comments
Closed

gas/metering, terminate runaway vats #516

warner opened this issue Feb 6, 2020 · 6 comments
Labels
SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Feb 6, 2020

Once #398 is implemented, we'll have a source-to-source transformation that will instrument user-supplied code with meter checks. The resulting API is still under development, but will probably involve adding a meter endowment (an integer) to the Compartment in which the rewritten code is evaluated, and watching for a RangeError to be thrown when executing the code.

We'll need to define how this is managed in the SwingSet world. We have some interesting source material to work with (KeyKOS, Meters, Keepers, Ethereum's "gas"), and some new constraints (I don't think we can synchronously invoke a Keeper while the overrunning code is paused, waiting for a decision). It's pretty cheap for us to either terminate the Vat, or allow it to run to completion. But if the keeper wants to pause it for later, we must in fact terminate it, and then reload it from a previous checkpoint (which currently requires us to replay the entire transcript, which is super expensive).

We have a lot of decisions to make about user-level control of metering questions. But the simplest place to start is a fixed number of computrons for each message delivery, where the only goal is to catch a runaway loop. We'll respond to this by unconditionally terminating the vat (#514). Higher-level code like Zoe/ContractHost will attempt to put the untrusted user-supplied code into a new Vat, so a runaway contract won't threaten Zoe's ability to maintain offer safety (in particular refund safety).

@warner warner added the SwingSet package: SwingSet label Feb 6, 2020
@warner
Copy link
Member Author

warner commented Feb 6, 2020

Metering gets even more interesting when we consider replicated consistency (chain-based swingset machines). As discussed in today's meeting, we're likely to land in one of two worlds:

  • 1: Every replica performs a cycle-accurate simulation of the specified computation, every replica either finishes the crank or decides it is taking too long. This is likely to be the most expensive, both in overhead (time consumed by injected metering code, or fine-grained cycles of a well-specified VM like Ethereum's EVM), and in specification consequences (the metering transform is part of the spec, as is every detail of the JS engine's behavior).
  • 2: We use @erights 's "Pack of Watchdogs" idea, which tolerates mere approximate agreement as to when a crank can be abandoned, but introduces a new batch of consensus science and incentive engineering to ensure the group of replicas behaves correctly.

@erights
Copy link
Member

erights commented Feb 6, 2020

For reference, pack of watchdogs paper is at https://medium.com/@erights/a-pack-of-watchdogs-is-cheaper-than-gas-7e118edfb4cc

@zarutian
Copy link
Contributor

I do not know if this belongs here or not but here goes:
As someone who has used watchdog timers on MCUs I have to
note that those timers are often ticked down by the same clock
signal that drives the instruction execution.
In other words they pretty much are equiv to gas metering hardwareimplementation wise. (Specially when there is no variability in cycle count on execution of a given instruction)

Control transfer upon a watchdog bark has traditionaly been
soft reset with loss of where the MCU was when the watchdog
barked. Nowdays, most MCUs such as ATMEGA328p have the
option to treat the watchdog bark as a non maskable vectored interrupt. Meaning, if the programmer so desires, the MCU can
continue whatever it was doing when the dog barked or do the equiv of a task switch, or keeper invocation.

What this src2src transform does is equiv to down tick a watchdog
at the start of a basic-block by the statically know length of that block.

As a src2src transform is being done anyway, why not instrument all assignments to capture state updates done during a event-loop turn?
It is a trade off between a bit slower execution speed during a turn versus full transcript replay.

Hope this gave some insight.

@michaelfig
Copy link
Member

For an initial, hacky implementation in SwingSet that would at least prove the concept, I propose adding an endowment to all static vats that allows swapping the global meter, and doing the resetting of the meter in the kernel. This would not be any worse than the status quo (static vats can cause the kernel to hang), and dynamic code would be explicitly loaded into a metering-enabled environment that does not have access to this endowment.

This mechanism will soon be replaced by the proper implementation: metering as an option to dynamic vat creation (still using the same kernel modifications, but not the endowment), and removal of support of the hack from Zoe and Spawner as they change to use dynamic vats.

@warner
Copy link
Member Author

warner commented Jun 19, 2020

Updates on our metering plans:

Background

@michaelfig's packages/transform-metering transformation injects calls like getMeter().decrement() at the beginning of every code block, and prevents the input/guest code from using getMeter itself. The transformed code is meant to be evaluated with a getMeter endowment (specifically a name added to the global lexical scope, and specifically not a global where the guest code might use globalThis['getMeter'] to retrieve it).

The companion tame-metering package injects getGlobalMeter().decrement() into all the builtin objects that might be used to consume too much (e.g. Array(1e9).map(Object.create)), and provides a setGlobalMeter() function to control it.

(the method isn't actually named decrement, but it's easier to explain this way)

If either decrement call causes the meter to underflow/expire/exhaust, it throws an exception. The transformation injects decrement into catch blocks too, meaning this exception cannot be caught by code subject to the same meter that just expired. It can, however, be caught by code that was not subject to the same transform (e.g. the catcher uses a different getMeter, whose meter is still full).

The intended pattern is:

  • each "metering domain" is defined by a single meter object, and a getMeter() function which 1: calls setGlobalMeter(meter), and 2: returns that meter
  • the metering domain covers all the code that was evaluated with that particular getMeter as an endowment (assuming it was subject to the transform), plus any code it eval()s, plus (eventually) any code it evaluates by using a new Compartment (somehow, TBD) (assuming it isn't explicitly moved into a new domain)
  • any time control passes to a new metering domain, the "globalMeter" will follow, because of what getMeter does
  • control transfer might happen because:
    • 1: code in one domain makes a synchronous call through a "Near" reference to code in a different domain. In this case, the caller might be able to catch the underflow exception.
    • 2: code returns out to a caller in a different domain
    • 3: code makes an eventual-send to code in a different metering domain but still in the same vat: the new "turn" will execute under the new meter, but it shares a "crank" with the caller
    • 4: code in some other vat sends a message to an object in a metered domain: the first turn of the crank starts out under the new meter

Currrent (old-SES) implementation

In current trunk, each Vat is given a special registerEndOfCrank endowment that allows it to schedule meter-refilling calls to happen after each crank. They also get replaceGlobalMeter to control the metering of builtins. The Zoe vat uses these to build a suitable getMeter that is imposed upon guest code.

Initial New-SES implementation

My plan for the first phase of metering under new-SES (whose goal is "do the minimum amount of work that doesn't make things noticeably worse", and yes I'm relying upon some things to escape notice) will instead provide makeGetMeter and transformMetering to vats. It will pass them in a vatPowers argument to the build-a-root-object function, rather than as ambient globals, so vats can partition their internal authorities better.

These should be passed into the importBundle function, in an option that activates/imposes metering on the guest code it is importing. makeGetMeter is used to create a getMeter endowment that wraps a new Meter object. The kernel remains aware of the meter, and is responsible for refilling it between cranks. transformMetering is a powerless string-to-string transformation function which cannot currently be bundled into vats because it imports babel, which imports things like fs and path (some day we may switch to babel-standalone, which could be bundled directly, and then we can get rid of transformMetering).

Other than passing these through, Vats will be nominally unaware of meters. They can create new metering domains with importBundle, but they won't (nominally) have access to the Meter objects, and they are not resposible for refilling them. The metering rules are:

  • the top-most static vat code (e.g. vat-zoe.js) is not metered
  • every crank is a new day
  • on each day, the sum of the execution time/space/etc within a single metering domain is finite
  • when that time/space/etc is exhausted, all code within the exhausted domain starts throwing an exception, even catch blocks
  • if the stack has code from a non-exhausted metering domain, that code can catch the exception
    • this can only happen via synchronous/Near calls
    • if the exhausted domain got control via an eventual-send, or an external message, nothing can catch the exception
  • the code that gets interrupted cannot learn that it was interrupted until some future crank, and even then, only if either:
    • someone else was able to catch the exception and notified them about it
    • the interrupted code would have set a sentinel at the end of the operation, and the future crank sees that it remains unset

In this phase, the unmetered top-most static vat code does not have a way to reset the "globalMeter" when it obtains control. If it calls into metered code and then gets control back again, the builtins will still be operating on the guest code's meter, which might expire at a surprising time. The current old-SES implementation allows this code to call setGlobalMeter(null) to reset the built-in, but:

  • the metered code might call into the unmetered code, rather than returning back to it, so a secure Zoe/ZCF would need to guard against all inbound pathways (every object that was ever given to the guest), which doesn't sound like fun
  • the metered code might do an eventual-send into the unmetered code, so it must also guard against every object that was given to any external vat, since those vats might pass the reference back into the guest code, who might then do a same-crank different-turn eventual-send, which wouldn't reset the globalMeter
  • this is a regression, but it will be resolved in the next phase

In this phase, Vats are not killed when a meter expires. The only consequence of a meter expiring is that exceptions are thrown by that code, until the meter is refilled (which only+always happens at the end of the crank). Execution within the metering domain might suddenly stop at any time during a crank, without the metered code being aware of it. This could leave broken invariants lying around. It can only be caused by the execution time/etc of the metered code, however code from some other same-vat metering domain could deliberately call into this one multiple times in a single crank, enough to provoke an underflow at some critical location. In this phase, we tolerate this possibility.

meter-per-Vat implementation

The next step is to add metering to the top-level code of all dynamic vats, and maybe also specially-marked static vats (the configuration object could have a flag to enable metering, and we'd activate it on the Zoe vat until we move to split-Zoe vat-per-contract-instance). We don't necessarily want to impose metering on vats that don't need it, because of the performance hit (which we still need to measure).

We'll probably start with a "bottomless" top-level vat meter. It would never expire, but the fact that it's getMeter() calls through to setGlobalMeter means it will fix the misbehavior in the previous phase, without requiring action/caution on the part of ZCF. This meter is active for the top-level code upon entry, which includes returns to this code from other metering domains.

Then, we'll make this into a normal meter: it expires, but the kernel refills it between cranks. In this mode, the initial vat code might be halted at surprising (invariant-violating) times, just as above. We may then decide to react to exhaustion of the vat's top-level meter by killing the vat entirely ("death before confusion"). In the future, this may be relaxed to merely rewind the vat upon meter exhaustion (assuming some keeper/handler mechanism to avoid walking into the same hole twice), which removes the confusion problem.

Any code that wants to protect itself against exhaustion-based confusion will need to run in its own vat, enable terminate- or rewind- on-exhaustion, and not share their vat with any other metering domain.

future meter-within-Vat implementation

For now, we're mainly trying to prevent runaway (misbehaving) contract code from denying service to other (well-behaving) contracts/instances. But sooner or later we'll want to incentivize something by "charging" more for expensive computation, and we'll be interested in measuring resource usage of execution. At that point, vats need to become more aware of meters and their current contents.

Zoe contract instance vats, in particular, will have a "ZCF" component (that acts a bit like a supervisor object), that sits next to the metered guest code. @Chris-Hibbert points out that it might seem rude to charge contract operators for the time spent by the supervisor we imposed upon them. So we may want to measure the time/space/etc consumed by each piece separately.

To support that, I'm thinking we should rearrange the Meter object a bit. The current Meter object is a collection of functions that can be invoked to increment/decrement the counters it contains, and is paired with a refillFacet that resets those counters. The injected code mostly moves the counters in just one direction (towards exhaustion), but the "Stack Meter" goes in both (as frames are pushed and popped). As a result, having access your own Meter (in this shape) lets you escape metering: you can just refill it yourself.

I think we should reshape it into an object which lets you observe the level, and pass it into importBundle, but not increment/decrement the counters. Maybe there would be a WeakMap shared between the injected metering code and the Meter creator. If you have one Meter, you can split off some of its tokens to create a second Meter, but you can't create new execution tokens this way.

Some Meters could be bottomless, and/or refilled by the kernel between cranks. Others would be more bounded, and not refill automatically. The ZCF component could be given an unbounded meter to operate with, from which it can create bounded ones for the contract's metering domain.

pre-paid/post-paid messages

All of the above looks at metering applied to certain pieces of code, delimited by importBundle boundaries. We've also discussed metering being applied to specific messages. In Ethereum, each message includes a gasLimit, and enough ETH to cover it. This pays for all computation that happens in the same transaction (the equivalent of our "crank", given that Ethereum doesn't have Promises or turns or eventual-sends), and if it runs out, the entire transaction is rewound (except for the transfer of the forfeited gas to the block miner). If we did something similar, each message would include a purse of execution tokens, which would populate a new Meter, used for the execution of the initial message handler. That handler might examine the message and decide "hey, this is great, this is worth spending my own money on", and then switch to a different Meter for the rest of the execution.

In Ethereum, all the gas for each transaction must come from the initial (private-key-signing) sender. It's like a clockwork vending machine with no springs: you must push the button hard enough to provide all the kinetic energy necessary to complete the computation. This makes it difficult to build interestingly complex multi-step systems. In an auction, the last bid will cause very different actions than the previous ones (maybe sending out payments and refunds), which might need more gas, but the submitter of that bid might not even know what they're about to trigger, and might not provide enough. Ethereum contracts have dealt with this by moving to a "withdrawal" model: the bids cause local state updates (but never send funds to anyone else), and all participants are obligated to come back later to withdraw their winnings/refunds (with enough gas to cover their own needs).

I think we want to enable "spring-loaded computation", where the vending machine uses stored energy to decouple the individual event trigger's contribution from the overall machinery being activated. However we cannot let this become a denial-of-service vector. The attacker should not be able to push the button so frequently that the mere "are we done yet" check causes our stored execution tokens to become exhausted.

Exposing an object to the rest of the world means sharing some authority with the rest of the world. But you should be able to share useful authorities without also sharing a "deplete all my stored execution tokens" authority. It may help to have the initial execution paid for by the submitter, but then allow subsequent execution to run off stored tokens.

In the model above, this would be implemented by having the Vat's externally-visible objects run in a metering domain whose Meter was empty and not automatically refilled by anybody. These bastion objects would have closely-held references to objects in a second metering domain which has a fully-charged Meter, and it only forwards the requests that it likes. Incoming messages from other vats would have to carry purses with sufficient execution tokens to power the bastion object long enough to pass judgement and pass along the request.

Somewhere in this, we need to enable the second metering domain to protect itself against confusion. The bastion object might need to inspect the second Meter to check that it has enough tokens left to complete the action. Or the inner domain might be put into a separate vat entirely, where we can eventually use state rollback to prevent early-termination confusion. Then we might have the second vat run its computation entirely on its own meter, relying upon the limited access to that vat to protect it against attack.

And, somehow, all of this needs to be folded into an escalator scheduler, where additional tokens are spent to bid for execution priority slots. These tokens probably won't be seen by the vat at all, but rather are consumed by the scheduler.

@warner
Copy link
Member Author

warner commented Sep 2, 2020

this is basically done

@warner warner closed this as completed Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

4 participants