Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Add constraint to prevent duplicate headers #150

Merged
merged 1 commit into from
Oct 28, 2019

Conversation

rmulhol
Copy link
Contributor

@rmulhol rmulhol commented Sep 26, 2019

  • Disallow inserts of headers with the same number, hash, and node
    fingerprint, since it will enable duplicate log fetching for the
    same header

Resolves #148

Minimal change set to prevent concurrent processes from inserting duplicate headers. Would be cool to revisit how we can simplify the application code while preserving our cascade delete functionality (that handles reorgs).

Also important to note that we can still end up with duplicate logs since headers with the same number and hash come from different nodes 🤔

Copy link
Collaborator

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for the unique constraint to be on the hash and block number alone, so that we don't get duplicate headers from different nodes and don't have to worry about duplicate logs as a result? Is there any case where we would want duplicate headers from different nodes? Otherwise, :shipit:

@rmulhol
Copy link
Contributor Author

rmulhol commented Sep 26, 2019

Would it be better for the unique constraint to be on the hash and block number alone, so that we don't get duplicate headers from different nodes and don't have to worry about duplicate logs as a result? Is there any case where we would want duplicate headers from different nodes? Otherwise, :shipit:

I'm torn about this. I think we do in theory want to be able to be syncing the same header from different nodes, but there's probably a good bit of further work to be done to facilitate enabling that in the context of our various operations.

One core thing to consider is that the "node" differs not just between instances (e.g. geth vs parity) but also with same instance pointed at different networks (e.g. mainnet vs kovan). Just another place where we probably need more work to truly say we support running against multiple networks at one time, but worth consideration.

Perhaps it makes sense to lock down on one instance/network for now, but I think I'd be inclined to kick that over to another issue/PR. If we go that route, then we can probably strip out a lot of the code around nodeID, nodeFingerprint, etc. This PR basically just resolves one race condition.

@rmulhol
Copy link
Contributor Author

rmulhol commented Sep 26, 2019

That's also an interesting constraint (UNIQUE (hash, block_number)) because I think on conflict we'd want to delete the offending row (to trigger our cascades) before inserting a new record.. 🤔

@i-norden
Copy link
Collaborator

i-norden commented Sep 26, 2019

This would definitely be a separate PR but what about changing eth_node_id and eth_node_fingerprint to arrays, and if a header conflicts on (UNIQUE (hash, block_number)) the header is tagged with unique eth_node_fingerprint and eth_node_id? Currently the amount of headers we store will scale linearly with the number of nodes with unique fingerprints that we sync against, even if nearly all the data is identical as in the case of geth vs parity.

That's also an interesting constraint (UNIQUE (hash, block_number)) because I think on conflict we'd want to delete the offending row (to trigger our cascades) before inserting a new record.. 🤔

Hmm I'm probably missing something bigger picture but I don't see why we would upsert on a (UNIQUE (hash, block_number)) conflict, if the header is identical why not keep the original, the data that references it, and continue referencing it?

@rmulhol
Copy link
Contributor Author

rmulhol commented Sep 27, 2019

You're totally right about that constraint. For some reason I was thinking of it as a uniqueness constraint on the block number or block number + node fingerprint, which would both efficiently upsert reorged headers but (I think) fail to cascade delete now-invalidated nested data.

Unique by block number + hash would not trigger problems on cascade delete since constraint violations there mean literal data duplication rather than reorg or anything like that.

@rmulhol
Copy link
Contributor Author

rmulhol commented Sep 27, 2019

I like the idea of storing node id and fingerprint as an array on the header! I'd want to think a bit about how we're using those fields in queries like MissingHeaders to better estimate the work involved and performance implications, but it definitely seems like something that could provide a win in terms of reducing data duplication (and the potential multiplier effect duplicate headers entail)

@rmulhol
Copy link
Contributor Author

rmulhol commented Sep 27, 2019

Last thing I'll say is that I think it'd be totally awesome if we could boil all of this down to a single SQL query to:

(1) try to insert a header
(2) don't update if block number + hash are the same (or add id/fingerprint to arrays)
(3) delete the old and replace with the new header if the passed hash doesn't match the already-persisted hash for that block number + node
(4) return the id of the header with the passed block number + hash

Feels like it could be pretty ugly and tricky, but looking at this issue has me pretty skeptical of trying to manage header uniqueness in application code.

@@ -8,7 +8,8 @@ CREATE TABLE public.headers
block_timestamp NUMERIC,
check_count INTEGER NOT NULL DEFAULT 0,
eth_node_id INTEGER NOT NULL REFERENCES eth_nodes (id) ON DELETE CASCADE,
eth_node_fingerprint VARCHAR(128)
eth_node_fingerprint VARCHAR(128),
UNIQUE (block_number, hash, eth_node_fingerprint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always super anxious about us half-supporting syncing with several clients, since I'm not certain we actually account for that everywhere ._.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. Really what this is accomplishing, though, is just making sure that running against multiple chains (e.g. mainnet/kovan) won't cause destructive operations against one another. Definitely would like to do more to make multi-client support for one chain robust

if err != nil {
log.Error("insertHeader: error inserting header: ", err)
if err == sql.ErrNoRows {
return 0, ErrValidHeaderExists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any corresponding caller error matching, since we're returning an error instead of just logging now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Disallow inserts of headers with the same number, hash, and node
  fingerprint, since it will enable duplicate log fetching for the
  same header
@rmulhol rmulhol force-pushed the unique-headers-constraint-v2 branch from 8f4343d to e252229 Compare October 28, 2019 19:57
@rmulhol rmulhol merged commit 25f9c6c into staging Oct 28, 2019
@rmulhol rmulhol deleted the unique-headers-constraint-v2 branch October 28, 2019 20:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

headers can be duplicated
3 participants