Add constraint to prevent duplicate headers #150

rmulhol · 2019-09-26T19:01:52Z

Disallow inserts of headers with the same number, hash, and node
fingerprint, since it will enable duplicate log fetching for the
same header

Resolves #148

Minimal change set to prevent concurrent processes from inserting duplicate headers. Would be cool to revisit how we can simplify the application code while preserving our cascade delete functionality (that handles reorgs).

Also important to note that we can still end up with duplicate logs since headers with the same number and hash come from different nodes 🤔

i-norden

Would it be better for the unique constraint to be on the hash and block number alone, so that we don't get duplicate headers from different nodes and don't have to worry about duplicate logs as a result? Is there any case where we would want duplicate headers from different nodes? Otherwise,

rmulhol · 2019-09-26T20:54:57Z

Would it be better for the unique constraint to be on the hash and block number alone, so that we don't get duplicate headers from different nodes and don't have to worry about duplicate logs as a result? Is there any case where we would want duplicate headers from different nodes? Otherwise,

I'm torn about this. I think we do in theory want to be able to be syncing the same header from different nodes, but there's probably a good bit of further work to be done to facilitate enabling that in the context of our various operations.

One core thing to consider is that the "node" differs not just between instances (e.g. geth vs parity) but also with same instance pointed at different networks (e.g. mainnet vs kovan). Just another place where we probably need more work to truly say we support running against multiple networks at one time, but worth consideration.

Perhaps it makes sense to lock down on one instance/network for now, but I think I'd be inclined to kick that over to another issue/PR. If we go that route, then we can probably strip out a lot of the code around nodeID, nodeFingerprint, etc. This PR basically just resolves one race condition.

rmulhol · 2019-09-26T20:58:10Z

That's also an interesting constraint (UNIQUE (hash, block_number)) because I think on conflict we'd want to delete the offending row (to trigger our cascades) before inserting a new record.. 🤔

i-norden · 2019-09-26T23:23:35Z

This would definitely be a separate PR but what about changing eth_node_id and eth_node_fingerprint to arrays, and if a header conflicts on (UNIQUE (hash, block_number)) the header is tagged with unique eth_node_fingerprint and eth_node_id? Currently the amount of headers we store will scale linearly with the number of nodes with unique fingerprints that we sync against, even if nearly all the data is identical as in the case of geth vs parity.

That's also an interesting constraint (UNIQUE (hash, block_number)) because I think on conflict we'd want to delete the offending row (to trigger our cascades) before inserting a new record.. 🤔

Hmm I'm probably missing something bigger picture but I don't see why we would upsert on a (UNIQUE (hash, block_number)) conflict, if the header is identical why not keep the original, the data that references it, and continue referencing it?

rmulhol · 2019-09-27T14:11:06Z

You're totally right about that constraint. For some reason I was thinking of it as a uniqueness constraint on the block number or block number + node fingerprint, which would both efficiently upsert reorged headers but (I think) fail to cascade delete now-invalidated nested data.

Unique by block number + hash would not trigger problems on cascade delete since constraint violations there mean literal data duplication rather than reorg or anything like that.

rmulhol · 2019-09-27T14:13:07Z

I like the idea of storing node id and fingerprint as an array on the header! I'd want to think a bit about how we're using those fields in queries like MissingHeaders to better estimate the work involved and performance implications, but it definitely seems like something that could provide a win in terms of reducing data duplication (and the potential multiplier effect duplicate headers entail)

rmulhol · 2019-09-27T14:18:54Z

Last thing I'll say is that I think it'd be totally awesome if we could boil all of this down to a single SQL query to:

(1) try to insert a header
(2) don't update if block number + hash are the same (or add id/fingerprint to arrays)
(3) delete the old and replace with the new header if the passed hash doesn't match the already-persisted hash for that block number + node
(4) return the id of the header with the passed block number + hash

Feels like it could be pretty ugly and tricky, but looking at this issue has me pretty skeptical of trying to manage header uniqueness in application code.

m0ar · 2019-10-28T15:24:26Z

db/migrations/00023_create_headers_table.sql

@@ -8,7 +8,8 @@ CREATE TABLE public.headers
    block_timestamp      NUMERIC,
    check_count          INTEGER NOT NULL DEFAULT 0,
    eth_node_id          INTEGER NOT NULL REFERENCES eth_nodes (id) ON DELETE CASCADE,
-    eth_node_fingerprint VARCHAR(128)
+    eth_node_fingerprint VARCHAR(128),
+    UNIQUE (block_number, hash, eth_node_fingerprint)


I'm always super anxious about us half-supporting syncing with several clients, since I'm not certain we actually account for that everywhere ._.

Yeah, agreed. Really what this is accomplishing, though, is just making sure that running against multiple chains (e.g. mainnet/kovan) won't cause destructive operations against one another. Definitely would like to do more to make multi-client support for one chain robust

m0ar · 2019-10-28T15:25:41Z

pkg/datastore/postgres/repositories/header_repository.go

 	if err != nil {
-		log.Error("insertHeader: error inserting header: ", err)
+		if err == sql.ErrNoRows {
+			return 0, ErrValidHeaderExists


Any corresponding caller error matching, since we're returning an error instead of just logging now?

Yep! Header sync will continue on this error https://github.com/vulcanize/vulcanizedb/blob/staging/pkg/history/populate_headers.go#L56

- Disallow inserts of headers with the same number, hash, and node fingerprint, since it will enable duplicate log fetching for the same header

rmulhol requested review from m0ar, yaoandrew, elizabethengelman, Gslaughl and i-norden September 26, 2019 19:01

i-norden approved these changes Sep 26, 2019

View reviewed changes

m0ar approved these changes Oct 28, 2019

View reviewed changes

Add constraint to prevent duplicate headers

e252229

- Disallow inserts of headers with the same number, hash, and node fingerprint, since it will enable duplicate log fetching for the same header

rmulhol force-pushed the unique-headers-constraint-v2 branch from 8f4343d to e252229 Compare October 28, 2019 19:57

rmulhol merged commit 25f9c6c into staging Oct 28, 2019

rmulhol deleted the unique-headers-constraint-v2 branch October 28, 2019 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add constraint to prevent duplicate headers #150

Add constraint to prevent duplicate headers #150

rmulhol commented Sep 26, 2019

i-norden left a comment

rmulhol commented Sep 26, 2019 •

edited

Loading

rmulhol commented Sep 26, 2019

i-norden commented Sep 26, 2019 •

edited

Loading

rmulhol commented Sep 27, 2019

rmulhol commented Sep 27, 2019

rmulhol commented Sep 27, 2019

m0ar Oct 28, 2019

rmulhol Oct 28, 2019

m0ar Oct 28, 2019

rmulhol Oct 28, 2019

Add constraint to prevent duplicate headers #150

Add constraint to prevent duplicate headers #150

Conversation

rmulhol commented Sep 26, 2019

i-norden left a comment

Choose a reason for hiding this comment

rmulhol commented Sep 26, 2019 • edited Loading

rmulhol commented Sep 26, 2019

i-norden commented Sep 26, 2019 • edited Loading

rmulhol commented Sep 27, 2019

rmulhol commented Sep 27, 2019

rmulhol commented Sep 27, 2019

m0ar Oct 28, 2019

Choose a reason for hiding this comment

rmulhol Oct 28, 2019

Choose a reason for hiding this comment

m0ar Oct 28, 2019

Choose a reason for hiding this comment

rmulhol Oct 28, 2019

Choose a reason for hiding this comment

rmulhol commented Sep 26, 2019 •

edited

Loading

i-norden commented Sep 26, 2019 •

edited

Loading