Cache statediffs when DB connection is dropped #129

i-norden · 2021-10-09T15:14:51Z

If DB connection is dropped or fails, the geth client should continue to create statediff objects and cache them up to some configurable limit (in memory and/or write out to files i.e. journaling analogous to mempool journaling). When the connection is regained or the node is restarted, the client will write these cached diffs to the database (first, and then goes back to tracking diffs at head).

abdulrabbani00 · 2022-03-10T21:38:01Z

@i-norden - I will get started on this task, I think it's a good way for me to build up my understanding of how we connect to Postgres, and how we perform writes to it.

Starting Point

I am going to browse through the codebase to get a general sense of understanding as it relates to the following:

How geth connects to Postgres.
- Understand what actions geth takes when initially connecting to Postgres. I will more than likely need to add logic here to clear the queue.
How statediff objects are created in memory.
Check to see if there is any "built-in" caching mechanism I can leverage.

Proposed Changes

The following is what I had in mind:

When geth connects to the Postgres database, it should do the following:
- Check to see if we have any statediff objects in the cache.
- If statediff objects are present in the cache, write them to the database first.
- Once the cache is cleared:
  - Send a signal to stop the cache service.
  - Start writing new objects directly to the database.
If the connection to Postgres is dropped at any point:
- Check the upper limit for objects allowed in the cache.
- Continue to run the state diff, but write events to the cache instead of writing them to the database.
- Continue to write events to the cache until:
  - The cache is full.
  - A signal is received indicating that we no longer need to write events to the cache.
If the cache fills up, keep a record of the last block which was added to the cache.
- Once the cache has more "room", we could continue the statediff service from the last block, and write to the cache?
  - We can clear the cache and write new events to it at the same time.
- We could clear the cache entirely?
  - Once the cache system has been cleared entirely, continue the state diff service from the last known block.
  - We don't leverage concurrency for writing objects to the DB and processing new state-diff objects. This might be okay if clearing the cache takes a relatively short amount of time. We won't have to add an extra step of adding events to the cache and picking them up off the cache.

Input

What should we do for Once the cache has more "room", we could continue the statediff service from the last block, and write to the cache?? @i-norden

Testing

Testing this component will be critical. Here are a few thoughts:

We will need to have generic unit tests which can write a statediff object to "cache." Depending on the caching mechanism we chose, we might need to create a mock "cache."
Testing will predominantly consist of integration tests:
- Test that the cache is properly written to when geth losses connection to the database.
- Test that we write cached events to the database prior to tracking the head when the connection is gained.
- Ensure that whatever behavior we chose for new events while cached events are being written to the database is being performed correctly.

Writing Integration Tests

I am not sure that we currently have a method for writing integration tests and testing them on the running instance of Geth. With that in mind, it might make sense for me to work on getting Foundry integrated into our existing stack.

The best way to proceed might be:

Tackle the following PR, by integrating Foundry into our stack.
Once that PR is finished, I can work on this issue with Foundry and add integration tests to the running instance of geth.

abdulrabbani00 · 2022-03-16T19:29:48Z

Going to pick this ticket back up: @i-norden @erikdies

i-norden · 2022-03-16T19:35:10Z

Looks good, the main remark I have is that this is going to be trickier than perhaps conveyed here at a high level because of how tightly coupled our statediffing process is to Postgres. When operating in the direct writing mode there is no independent "statediff object", the statediff object is all the INSERT statements piling up inside the pending Postgres transaction.

This is for performance reasons, if we first create a statediff object as some distinct Go type then we have to reiterate that entire object once again when we go to map all the relations and insert into a DB.

We have vestigial methods (e.g. BuildStateDiffObject in builder.go) from when we tried to make the statediff a distinct intermediary object so that we could decouple the DB processing into a separate, remote, process (e.g. the now archived ipld-eth-indexer repo). So I think the way to approach this is to switch into this mode when writing to cache, and when reading from the cache we'll have to process these objects in the same way they were processed in ipld-eth-indexer.

i-norden · 2022-03-16T19:39:00Z

Actually I think the better way, from both a performance and an engineering complexity perspective, to approach it is to switch over to using the file-writing indexer if/when we lose our Postgres connection.

And so the cache simply becomes a bunch of sql files we can then load into Postgres.

i-norden · 2022-03-16T19:40:52Z

If that is the taken approach one thing we need to do is add a configuration option to the file-writing indexer that tells it to write out SQL statements that include ON CONFLICT DO NOTHING clauses, whereas the current indexer only writes out the stmts without those clauses for the purpose of historical batch processing in eth-statediff-service prior to having applied PK constraints to the tables.

abdulrabbani00 · 2022-03-16T19:50:17Z

Most of this makes a lot of sense, the last bit doesn't just yet (I am sure it will once I get a better understanding of what is going on).

At this point, I also agree that simply dumping some SQL files and loading them up upon "reconnection" is easier from a testing perspective and probably from an engineering perspective. Thank you for the rich insight, I'm going to continue diving into the code base and try to understand what is happening.

abdulrabbani00 · 2022-03-31T17:26:57Z

@i-norden - I believe we can close this issue since the known_gaps table covers this use case.

i-norden assigned abdulrabbani00 Mar 7, 2022

abdulrabbani00 mentioned this issue Mar 11, 2022

Improve Testing for Geth #207

Closed

i-norden closed this as completed May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache statediffs when DB connection is dropped #129

Cache statediffs when DB connection is dropped #129

i-norden commented Oct 9, 2021 •

edited

Loading

abdulrabbani00 commented Mar 10, 2022 •

edited

Loading

abdulrabbani00 commented Mar 16, 2022

i-norden commented Mar 16, 2022

i-norden commented Mar 16, 2022 •

edited

Loading

i-norden commented Mar 16, 2022 •

edited

Loading

abdulrabbani00 commented Mar 16, 2022

abdulrabbani00 commented Mar 31, 2022

Cache statediffs when DB connection is dropped #129

Cache statediffs when DB connection is dropped #129

Comments

i-norden commented Oct 9, 2021 • edited Loading

abdulrabbani00 commented Mar 10, 2022 • edited Loading

Starting Point

Proposed Changes

Input

Testing

Writing Integration Tests

abdulrabbani00 commented Mar 16, 2022

i-norden commented Mar 16, 2022

i-norden commented Mar 16, 2022 • edited Loading

i-norden commented Mar 16, 2022 • edited Loading

abdulrabbani00 commented Mar 16, 2022

abdulrabbani00 commented Mar 31, 2022

i-norden commented Oct 9, 2021 •

edited

Loading

abdulrabbani00 commented Mar 10, 2022 •

edited

Loading

i-norden commented Mar 16, 2022 •

edited

Loading

i-norden commented Mar 16, 2022 •

edited

Loading