Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Not all orders are being shared between browsers #803

Open
albrow opened this issue May 5, 2020 · 7 comments
Open

Not all orders are being shared between browsers #803

albrow opened this issue May 5, 2020 · 7 comments
Labels
browser bug Something isn't working

Comments

@albrow
Copy link
Contributor

albrow commented May 5, 2020

Context

Augur has reported this issue on Discord.

Please provide any relevant information about your setup

  • Are you running Mesh in the browser or as a standalone server? Are running Mesh inside of Docker or directly running the binary?

All browsers across all OS.
I.e. currently an issue between chrome, Firefox, and safari across Linux (mint with Ubuntu 20.04 base), windows 10, macos mojave and iOS 13.

  • What version of Mesh are you running? Be as specific as possible (e.g., 8.0.1 instead of latest or 8).

9.2.1 and 9.3.0

Expected Behavior

Setting aside fills/expiries, all orders should be propagated between two browser nodes.

Current Behavior

One node is receiving > 7k orders. The other is only receiving 1390.

Failure Information (for bugs)

Two different clients with the same contracts


chwy (in new jersey) is getting >7k orders during the getOrdersAsync().

I get 1390

and they never come in / its not eventually consistent
You know what I never finished upgrading to 9.3.0 I wonder if that would fix this......

Steps to Reproduce

(See comment below).

Failure Logs

(See screenshots above).

@albrow albrow added browser bug Something isn't working labels May 5, 2020
@pgebheim
Copy link
Contributor

pgebheim commented May 5, 2020

Platform

All browsers across all OS.
I.e. currently an issue between chrome, Firefox, and safari across Linux (mint with Ubuntu 20.04 base), windows 10, macos mojave and iOS 13.

Mesh Version

9.2.1 and 9.3.0

Description

Beginning last Thursday April 30, around 3pm EST our product manager noticed weird issues where orders were not appearing between two of his devices.

We subsequently upgrades from 9.2.1 to 9.3.0 and are seeing the same behavior.

With further debugging, we've discovered that our call to getOrdersAsync returns wildly different numbers of orders, and will never receive orders that were not available in that initial fetch.

Subsequent re-runs of the application without clearing database state will discover more orders.

I'll respond with detailed logs of subsequent runs of the mesh. This is easily reproducible on https://v2.augur.net where debug logging is deployed, but doesn't have level 6 mesh logs turned on.

Example runs:

Run Orders Retrieved
0 495
1 893
2 996
3 1390

At the same time other devs are reporting:
5500 orders
7200 orders
3960 orders.

Repro materials

Logs

Forthcoming

@pgebheim
Copy link
Contributor

pgebheim commented May 5, 2020

Logs / Info

Current State

Before doing this test, I loaded https://v2.augur.net with a primed browser (e.g. one that has been connected to the mesh for hours), and checked how many orders it was getting. Our syncing process first calls getOrdersAsyc() to receive bulk orders, prints out stats, and then responds to OrderEvent messages coming in off the mesh.

For that browser I received 6436 orders from the getOrdersAsync():
image

Fresh State Tests

For this test I will switch over to chrome, clear all application state, and reload https://v2.augur.net, noting the total order numbers retrieved.

Load 1

Source Count
getOrdersAsync 0
Order Events (after 5 mins) 1376
Total 1376

image
image

After this, no other order events are emitted.

Load 2

The tab was reloaded without clearing any state (all 0x db entries are intact).

Source Count
getOrdersAsync 1376
Order Events (after 5 mins) 1369
Total 2745

image
image

After this, no more Order Events are emitted.

Load 3

The tab was reloaded without clearing any state (all 0x db entries are intact).

Source Count
getOrdersAsync 3195
Order Events (after 5 mins) 0
Total 3195

image

From IndexedDB:
image

Load 4

The tab was reloaded without clearing any state (all 0x db entries are intact).

Source Count
getOrdersAsync 3195
Order Events (after 5 mins) 0
Total 3195

image

So at this point this browser appears to have hit a steady state and isn't receiving new bulk events.

Back to Firefox

Back to firefox just to check and see with the original browser. This ran very slowly, locked up the browser ~8 times, and showed the "A Webpage is slowing down your browser. Close or wait?" message as well.

However, according to this we've still received over 6k orders from the mesh, with numbers matching exactly the first test.
image

Firefox with a cleared state

For completeness, lets clear the state of firefox, previously reporting 6436 orders.

Load 1

Similar results to Chrome above, where it starts with 0 orders and then gets some in blocks. This time a lot less than the previous loads.

Source Count
getOrdersAsync 0
Order Events (after 5 mins) 124
Total 124

image

image

0x Mesh Debug Logs

I'll run these tests with a local instance with the 0x debug logs and attach per run.

@albrow
Copy link
Contributor Author

albrow commented May 5, 2020

I am able to reproduce similar results by following the instructions you shared. I'm coordinating with our devops engineer to see if there is anything wrong with our Mesh nodes on Kovan. I also separately reached out to you on Discord to confirm the latest custom order filter as that may be a contributing factor.

@albrow
Copy link
Contributor Author

albrow commented May 6, 2020

I spent a lot more time investigating this today. Here's what I found out:

  1. The order filter and other configuration options match between our Mesh nodes on Kovan and Augur running in the browser. So we have to throw that explanation out the window.
  2. Taking a close look at the logs, I can see that Mesh in the browser is attempting to run through the ordersync protocol with several different peers. This is expected. What is fairly unexpected is that many of the peers we try take a long time to give us all the orders. I can see that Mesh is still trying to finish ordersync minutes after starting up. We don't know who these peers are, and due to the nature of p2p networks we can't really know. It's possible they are running an older version of Mesh or they are running on underpowered machines. It's a hard problem to solve, but we might be able to make some changes within Mesh to handle slow peers better.
  3. I manually changed Mesh to only connect to one of our Kovan nodes and not attempt ordersync with any other nodes. Running Augur in the browser with this modified version of Mesh yielded much better results. Orders were retrieved much faster overall. However, we reached a hiccup where the underlying WebSocket connection was terminated and then we couldn't continue with ordersync. I've almost always had issues like this with WebSocket connections, regardless of programming language. I think we just need to do a better job of handling disconnects and timeouts in the ordersync protocol.

@albrow
Copy link
Contributor Author

albrow commented Jun 9, 2020

While the CPU optimizations we're working on will help with this, I'm starting to think that they won't be enough to fully address the problem. I also opened #822 to improve the ordersync protocol itself.

@albrow
Copy link
Contributor Author

albrow commented Jul 1, 2020

Quick update: We have seen that CPU optimizations have an impact here. We also recently opened #848 which should help further. Planning to implement some of the other improvements mentioned in #822 soon.

@pgebheim
Copy link
Contributor

pgebheim commented Jul 4, 2020

With each change its getting better. I think 822 will help a lot, and in the long term more intelligent routing for https will also help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
browser bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants