-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs for pruning and some internal renaming #4505
Changes from all commits
2096d90
017671c
fb0aca5
705db27
6504d97
3c3ad93
1280949
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
## Pruning deployments | ||
|
||
Subgraphs, by default, store a full version history for entities, allowing | ||
consumers to query the subgraph as of any historical block. Pruning is an | ||
operation that deletes entity versions from a deployment older than a | ||
certain block, so it is no longer possible to query the deployment as of | ||
prior blocks. In GraphQL, those are only queries with a constraint `block { | ||
number: <n> } }` or a similar constraint by block hash where `n` is before | ||
the block to which the deployment is pruned. Queries that are run at a | ||
block height greater than that are not affected by pruning, and there is no | ||
difference between running these queries against an unpruned and a pruned | ||
deployment. | ||
|
||
Because pruning reduces the amount of data in a deployment, it reduces the | ||
amount of storage needed for that deployment, and is beneficial for both | ||
query performance and indexing speed. Especially compared to the default of | ||
keeping all history for a deployment, it can often reduce the amount of | ||
data for a deployment by a very large amount and speed up queries | ||
considerably. See [caveats](#caveats) below for the downsides. | ||
|
||
The block `b` to which a deployment is pruned is controlled by how many | ||
blocks `history_blocks` of history to retain; `b` is calculated internally | ||
using `history_blocks` and the latest block of the deployment when the | ||
prune operation is performed. When pruning finishes, it updates the | ||
`earliest_block` for the deployment. The `earliest_block` can be retrieved | ||
through the `index-node` status API, and `graph-node` will return an error | ||
for any query that tries to time-travel to a point before | ||
`earliest_block`. The value of `history_blocks` must be greater than | ||
`ETHEREUM_REORG_THRESHOLD` to make sure that reverts can never conflict | ||
with pruning. | ||
|
||
Pruning is started by running `graphman prune`. That command will perform | ||
an initial prune of the deployment and set the subgraph's `history_blocks` | ||
setting which is used to periodically check whether the deployment has | ||
accumulated more history than that. Whenever the deployment does contain | ||
more history than that, the deployment is automatically repruned. If | ||
ongoing pruning is not desired, pass the `--once` flag to `graphman | ||
prune`. Ongoing pruning can be turned off by setting `history_blocks` to a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To check my understanding, the turning off pointer here is saying that if you pruned once with (say) 10,000 blocks (setting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that's exactly what I meant here |
||
very large value with the `--history` flag. | ||
|
||
Repruning is performed whenever the deployment has more than | ||
`history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR` blocks of history. The | ||
environment variable `GRAPH_STORE_HISTORY_SLACK_FACTOR` therefore controls | ||
how often repruning is performed: with | ||
`GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5` and `history_blocks` set to 10,000, | ||
a reprune will happen every 5,000 blocks. After the initial pruning, a | ||
reprune therefore happens every `history_blocks * (1 - | ||
GRAPH_STORE_HISTORY_SLACK_FACTOR)` blocks. This value should be set high | ||
enough so that repruning occurs relatively infrequently to not cause too | ||
much database work. | ||
|
||
Pruning uses two different strategies for how to remove unneeded data: | ||
rebuilding tables and deleting old entity versions. Deleting old entity | ||
versions is straightforward: this strategy deletes rows from the underlying | ||
tables. Rebuilding tables will copy the data that should be kept from the | ||
existing tables into new tables and then replaces the existing tables with | ||
these much smaller tables. Which strategy to use is determined for each | ||
table individually, and governed by the settings for | ||
`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these thresholds 0-1 (i.e. 0.5 is 50%)? Or 0-100? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's between 0 and 1, added that to the text |
||
`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`, both numbers between 0 and 1: if we | ||
estimate that we will remove more than `REBUILD_THRESHOLD` of the table, | ||
the table will be rebuilt. If we estimate that we will remove a fraction | ||
between `REBUILD_THRESHOLD` and `DELETE_THRESHOLD` of the table, unneeded | ||
entity versions will be deleted. If we estimate to remove less than | ||
`DELETE_THRESHOLD`, the table is not changed at all. With both strategies, | ||
operations are broken into batches that should each take | ||
`GRAPH_STORE_BATCH_TARGET_DURATION` seconds to avoid causing very | ||
long-running transactions. | ||
|
||
Pruning, in most cases, runs in parallel with indexing and does not block | ||
it. When the rebuild strategy is used, pruning does block indexing while it | ||
copies non-final entities from the existing table to the new table. | ||
|
||
The initial prune started by `graphman prune` prints a progress report on | ||
the console. For the ongoing prune runs that are periodically performed, | ||
the following information is logged: a message `Start pruning historical | ||
entities` which includes the earliest and latest block, a message `Analyzed | ||
N tables`, and a message `Finished pruning entities` with details about how | ||
much was deleted or copied and how long that took. Pruning analyzes tables, | ||
if that seems necessary, because its estimates of how much of a table is | ||
likely not needed are based on Postgres statistics. | ||
|
||
### Caveats | ||
|
||
Pruning is a user-visible operation and does affect some of the things that | ||
can be done with a deployment: | ||
|
||
* because it removes history, it restricts how far back time-travel queries | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe worth linking to the time travel docs page? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just looked at the time-travel doc, and it's super low-level about how rows in the db are manipulated. Seems we miss more of a user-level explanation of it. |
||
can be performed. This will only be an issue for entities that keep | ||
lifetime statistics about some object (e.g., a token) and are used to | ||
produce time series: after pruning, it is only possible to produce a time | ||
series that goes back no more than `history_blocks`. It is very | ||
beneficial though for entities that keep daily or similar statistics | ||
about some object as it removes data that is not needed once the time | ||
period is over, and does not affect how far back time series based on | ||
these objects can be retrieved. | ||
* it restricts how far back a graft can be performed. Because it removes | ||
history, it becomes impossible to graft more than `history_blocks` before | ||
the current deployment head. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that initial prune now async (i.e. it doesn't block indexing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added a paragraph fro that. It blocks indexing with the rebuild strategy while it copies nonfinal entities. I also added another paragraph explaining what log output to look for.