Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logic for selecting untrusted Os for fast verification #2097

Closed
yondonfu opened this issue Nov 9, 2021 · 11 comments
Closed

Improve logic for selecting untrusted Os for fast verification #2097

yondonfu opened this issue Nov 9, 2021 · 11 comments
Assignees

Comments

@yondonfu
Copy link
Member

yondonfu commented Nov 9, 2021

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

During fast verification, B defaults to selecting 2 sessions from the untrusted pool.

Describe the solution you'd like
A clear and concise description of what you want to happen.

There isn't a particularly strong reason to select 2 sessions from the untrusted pool and B can instead default to selecting 1 session from the untrusted pool which is easier to reason about and also decreases the # of Os that a single segment can bounce around to while we're still testing fast verification (sending the same segment to additional Os will be less of a big deal once we ensure that the workflow works well when just using a single untrusted O).

We could consider making this # configurable, but I don't see a strong need to do so right now, so we may just be able to use 1.

@leszko
Copy link
Contributor

leszko commented Nov 9, 2021

I may be wrong, but I think that changing the default value to 1 has some implications. @jailuthra @yondonfu PTAL, wdyt?

Problem to consider

Consider the following scenario. We have 4 orchestrators:

  • OTR - trusted
  • OUN1, OUN2 - untrusted orchestrator doing correct transcoding
  • OUNM - untrusted malicious orchestrator, sending garbage instead of transcoding

Now, B selects OTR to the trusted orchestrator pool and randomly 2 orchestrators into the untrusted pool. What can happen now?

  1. OUN1 and OUN2 are selected into the untrusted pool - in that case
  • transcoding is done by one of the untrusted orchestrators OUN1 or OUN2 (which is correct)
  • transcoding bounces between OUN1 and OUN2, since the verified transcoder is selected randomly from all transcoders that provide the correct output (bouncing is not optimal)
  1. OUN1 and OUNM are selected into the untrusted pool - in that case
    • transcoding is always done by OUN1 (which is correct)

If we change the size of the untrusted pool to 1, then it's what can happen

  1. OUN1 is selected into the untrusted pool - in that case, everything works fine, no bouncing, and all the transcoding is done by OUN1
  2. OUNM is selected into the untrusted pool - in that case, all transcoding is done by OTR and it never changes, which is not correct (since we want to use untrusted orchestrators in the first place)

Possible solutions

  1. Instead of decreasing the size of the untrusted pool, we can make sure that it's always the verified orchestrator that is first checked in chooseResults(). Then, we'll not experience the bouncing between orchestrators

  2. Implement removing (or even blacklisting) the orchestrator from the pool if it didn't transcode the segment correctly.

@jailuthra
Copy link
Contributor

jailuthra commented Nov 9, 2021

Ah good catch, I completely missed that we currently don't remove an orchestrator from the selector if it fails verification.

Implement removing (or even blacklisting) the orchestrator from the pool if it didn't transcode the segment correctly.

IIRC this was a stretch goal in the spec but for the initial implementation we decided to not call suspendAndRemove on Os that fails verification. Primary reason was that we didn't know what false-positive/false-negative rate to expect in the wild for verification. Now that we have some information around it, maybe we can at least suspend and remove a failing O from the selector. WDYT @yondonfu @darkdarkdragon?

Instead of decreasing the size of the untrusted pool, we can make sure that it's always the verified orchestrator that is first checked in chooseResults(). Then, we'll not experience the bouncing between orchestrators

Yeah, this should solve the issue of excessive swaps between untrusted Os. For the secondary motivation of cutting down the session load on Os, we'd still want to reduce the pool size to 1.

But whenever we want to switch back again to > 1 untrusted Os, reducing unnecessary swaps would be useful - so would be great to fix it now regardless!

@leszko
Copy link
Contributor

leszko commented Nov 9, 2021

Thanks for the comment @jailuthra. If the motivation is also to cut down the session load for the segments under the fast verification (so that it's always sent to (1 trusted O and 1 untrusted O) and not to (1 trusted O and 2 untrusted O)), then I think there is no need to implement the switch for now.

In such a case, I think we should at least remove the failing O from the untrusted selector (blacklist it in selectSessions()). Wdyt?

@yondonfu
Copy link
Member Author

Now that we have some information around it, maybe we can at least suspend and remove a failing O from the selector.

I think there are two mechanisms to consider here:

  • Temporarily suspending an O for a particular stream so that B can try another O for that stream. At the moment, the suspension mechanism in B is on a per stream basis. As a result, the O can still be used for another stream that the B is processing. I see the primary purpose of this mechanism to allow the B to rotate through other Os for a stream if it seems like the current O is not doing an adequate job.
  • Temporarily banning an O which would prevent the O from being used at all for any stream being processed by B. A possible implementation of this mechanism could be to check if the fast verification failure rate for a B exceeds some threshold over a period of time with the threshold being defined based on historic data on false negative rates of the phash check. I see the primary purpose of this mechanism to allow the B to flag an O that is suspected to be malicious so that it stops using it all.

IIUC the current state of things is:

  • The suspension mechanism is implemented, but B currently does not suspend a B for failing the phash check for fast verification.
  • The ban mechanism is not implemented.

If the above statements are true, then my suggestion would be:

  • Hold off on the ban mechanism until we're more certain about the false negative rate.
  • Update the suspension mechanism so that B does suspend a B for failing fast verification.

But, now that we're discussing this I think I do see the benefit of selecting more than 1 untrusted O. If we only select 1 untrusted O and then suspend that O for a fast verification failure then we'd expect B to swap to the trusted O at a rate roughly equal to the false negative rate of the phash check. An improvement to this could be to select 2 untrusted Os and only suspend an O for failing fast verification if the other untrusted O passes fast verification. If both Os fail fast verification, then skip the suspension. I'm assuming that the probability of a double false negative for both untrusted Os is less than the probability of a single false negative, but I'm not certain about this. cc @oscar-davids for thoughts on this statement.

If we implement the above, then we can also implement the below suggestion from @leszko to minimize swaps in the happy case where the verified O passes fast verification:

Instead of decreasing the size of the untrusted pool, we can make sure that it's always the verified orchestrator that is first checked in chooseResults().

@jailuthra @darkdarkdragon @leszko WDYT?

@darkdarkdragon
Copy link
Contributor

@yondonfu

Update the suspension mechanism so that B does suspend a O for failing fast verification.

I was sure we already suspend failed O, but turned out we don't.
Surely we should do this first thing.

I'm assuming that the probability of a double false negative for both untrusted Os is less than the probability of a single false negative, but I'm not certain about this

We can implement this and see how it will work on practice.

we can make sure that it's always the verified orchestrator that is first checked in chooseResults()

Sure we should do this.

@leszko
Copy link
Contributor

leszko commented Nov 10, 2021

Thanks for all the comments. I think that the comment from @yondonfu summarizes the current state of the art.

So I think we suggest the following changes:

  1. Implement suspending O when it fails fast verification AND the other untrusted O passes the verification
  2. Implement checking verified orchestrator first in chooseResults()
  3. Keep selecting 2 sessions from the untrusted pool for fast verification

@jailuthra are you ok with such changes?

@oscar-davids is it correct that "double false negative for both untrusted Os is less than the probability of a single false negative"?

@jailuthra
Copy link
Contributor

So I think we suggest the following changes:

  1. Implement suspending O when it fails fast verification AND the other untrusted O passes the verification
  2. Implement checking verified orchestrator first in chooseResults()
  3. Keep selecting 2 sessions from the untrusted pool for fast verification

Sounds good to me 👍

@oscar-davids
Copy link
Contributor

oscar-davids commented Nov 10, 2021

is it correct that "double false negative for both untrusted Os is less than the probability of a single false negative"?

This problem can be formulated as choosing nodes out of n nodes when there is m dishonest nodes. This probability follows hypergeometric distribution.
Assuming there is m dishonest nodes out of n nodes, the equation below holds:

Pr (x=2) = (m¦2)((n-m)¦0)/((n¦2) ) = (m/n)*((m-1)/(n-1))
Pr (x=1) = (m/n)

In other words, when we choose two nodes, we are likely to choose dishonest node at Pr(x=2) probability, which is apparently smaller than Pr(x=1).

@leszko
Copy link
Contributor

leszko commented Nov 10, 2021

Thanks @oscar-davids for the comment. So, I think it's worth implementing what @yondonfu suggested: suspending O when it fails fast verification AND the other untrusted O passes the verification. I'll work on that.

btw. I've fired a PR to fix this Implement checking verified orchestrator first in chooseResults() #2100. PTAL @yondonfu @jailuthra . Note that while I did local testing, I still encountered some bouncing between different orchestrators, because of the orchestrator swapping mechanism. I'll try to dig deeper into it.

@yondonfu yondonfu changed the title Default to selecting 1 session from the untrusted pool for fast verification Improve logic for selecting untrusted Os for fast verification Nov 10, 2021
@leszko
Copy link
Contributor

leszko commented Nov 12, 2021

An update from my side:

  1. We (me and @jailuthra) checked together and it seems that Check verified session first while choosing the result from multiple untrusted sessions #2100 should fix the excessive session swapping (I double-checked the local setup and I still sometimes experience session swapping, but it's probably because of the overload of my local machine).

  2. I've sent a second PR server: Add suspending sessions that did not pass the pHash verification #2103 . Together with Check verified session first while choosing the result from multiple untrusted sessions #2100 , they should close this GH issue.

@leszko
Copy link
Contributor

leszko commented Nov 22, 2021

Fixed by #2100 and #2103

@leszko leszko closed this as completed Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants