diff --git a/NEWS.md b/NEWS.md index 215de1f71d5..f825a7f16f0 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,6 +2,10 @@ ## Unreleased +- the behavior for `graphman prune` has changed: running just `graphman + prune` will mark the subgraph for ongoing pruning in addition to + performing an initial pruning. To avoid ongoing pruning, use `graphman + prune --once` ([docs](./docs/implementation/pruning.md)) - the materialized views in the `info` schema (`table_sizes`, `subgraph_sizes`, and `chain_sizes`) that provide information about the size of various database objects are now automatically refreshed every 6 hours. [#4461](https://github.com/graphprotocol/graph-node/pull/4461) ### Fixes diff --git a/docs/environment-variables.md b/docs/environment-variables.md index 04433d4d0e3..635abc040c5 100644 --- a/docs/environment-variables.md +++ b/docs/environment-variables.md @@ -227,14 +227,14 @@ those. 1.1 means that the subgraph will be pruned every time it contains 10% more history (in blocks) than its history limit. The default value is 1.2 and the value must be at least 1.01 -- `GRAPH_STORE_HISTORY_COPY_THRESHOLD`, - `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: when pruning, prune by copying the - entities we will keep to new tables if we estimate that we will remove - more than a factor of `COPY_THRESHOLD` of the deployment's history. If we - estimate to remove a factor between `COPY_THRESHOLD` and - `DELETE_THRESHOLD`, prune by deleting from the existing tables of the +- `GRAPH_STORE_HISTORY_REBUILD_THRESHOLD`, + `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: when pruning, prune by copying + the entities we will keep to new tables if we estimate that we will + remove more than a factor of `REBUILD_THRESHOLD` of the deployment's + history. If we estimate to remove a factor between `REBUILD_THRESHOLD` + and `DELETE_THRESHOLD`, prune by deleting from the existing tables of the deployment. If we estimate to remove less than `DELETE_THRESHOLD` entities, do not change the table. Both settings are floats, and default - to 0.5 for the `COPY_THRESHOLD` and 0.05 for the `DELETE_THRESHOLD`; they - must be between 0 and 1, and `COPY_THRESHOLD` must be bigger than + to 0.5 for the `REBUILD_THRESHOLD` and 0.05 for the `DELETE_THRESHOLD`; + they must be between 0 and 1, and `REBUILD_THRESHOLD` must be bigger than `DELETE_THRESHOLD`. diff --git a/docs/implementation/README.md b/docs/implementation/README.md index 441c5f279aa..31d4eb694a6 100644 --- a/docs/implementation/README.md +++ b/docs/implementation/README.md @@ -9,3 +9,4 @@ the code should go into comments. * [Time-travel Queries](./time-travel.md) * [SQL Query Generation](./sql-query-generation.md) * [Adding support for a new chain](./add-chain.md) +* [Pruning](./pruning.md) diff --git a/docs/implementation/pruning.md b/docs/implementation/pruning.md new file mode 100644 index 00000000000..4faf66f4e31 --- /dev/null +++ b/docs/implementation/pruning.md @@ -0,0 +1,99 @@ +## Pruning deployments + +Subgraphs, by default, store a full version history for entities, allowing +consumers to query the subgraph as of any historical block. Pruning is an +operation that deletes entity versions from a deployment older than a +certain block, so it is no longer possible to query the deployment as of +prior blocks. In GraphQL, those are only queries with a constraint `block { +number: } }` or a similar constraint by block hash where `n` is before +the block to which the deployment is pruned. Queries that are run at a +block height greater than that are not affected by pruning, and there is no +difference between running these queries against an unpruned and a pruned +deployment. + +Because pruning reduces the amount of data in a deployment, it reduces the +amount of storage needed for that deployment, and is beneficial for both +query performance and indexing speed. Especially compared to the default of +keeping all history for a deployment, it can often reduce the amount of +data for a deployment by a very large amount and speed up queries +considerably. See [caveats](#caveats) below for the downsides. + +The block `b` to which a deployment is pruned is controlled by how many +blocks `history_blocks` of history to retain; `b` is calculated internally +using `history_blocks` and the latest block of the deployment when the +prune operation is performed. When pruning finishes, it updates the +`earliest_block` for the deployment. The `earliest_block` can be retrieved +through the `index-node` status API, and `graph-node` will return an error +for any query that tries to time-travel to a point before +`earliest_block`. The value of `history_blocks` must be greater than +`ETHEREUM_REORG_THRESHOLD` to make sure that reverts can never conflict +with pruning. + +Pruning is started by running `graphman prune`. That command will perform +an initial prune of the deployment and set the subgraph's `history_blocks` +setting which is used to periodically check whether the deployment has +accumulated more history than that. Whenever the deployment does contain +more history than that, the deployment is automatically repruned. If +ongoing pruning is not desired, pass the `--once` flag to `graphman +prune`. Ongoing pruning can be turned off by setting `history_blocks` to a +very large value with the `--history` flag. + +Repruning is performed whenever the deployment has more than +`history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR` blocks of history. The +environment variable `GRAPH_STORE_HISTORY_SLACK_FACTOR` therefore controls +how often repruning is performed: with +`GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5` and `history_blocks` set to 10,000, +a reprune will happen every 5,000 blocks. After the initial pruning, a +reprune therefore happens every `history_blocks * (1 - +GRAPH_STORE_HISTORY_SLACK_FACTOR)` blocks. This value should be set high +enough so that repruning occurs relatively infrequently to not cause too +much database work. + +Pruning uses two different strategies for how to remove unneeded data: +rebuilding tables and deleting old entity versions. Deleting old entity +versions is straightforward: this strategy deletes rows from the underlying +tables. Rebuilding tables will copy the data that should be kept from the +existing tables into new tables and then replaces the existing tables with +these much smaller tables. Which strategy to use is determined for each +table individually, and governed by the settings for +`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and +`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`, both numbers between 0 and 1: if we +estimate that we will remove more than `REBUILD_THRESHOLD` of the table, +the table will be rebuilt. If we estimate that we will remove a fraction +between `REBUILD_THRESHOLD` and `DELETE_THRESHOLD` of the table, unneeded +entity versions will be deleted. If we estimate to remove less than +`DELETE_THRESHOLD`, the table is not changed at all. With both strategies, +operations are broken into batches that should each take +`GRAPH_STORE_BATCH_TARGET_DURATION` seconds to avoid causing very +long-running transactions. + +Pruning, in most cases, runs in parallel with indexing and does not block +it. When the rebuild strategy is used, pruning does block indexing while it +copies non-final entities from the existing table to the new table. + +The initial prune started by `graphman prune` prints a progress report on +the console. For the ongoing prune runs that are periodically performed, +the following information is logged: a message `Start pruning historical +entities` which includes the earliest and latest block, a message `Analyzed +N tables`, and a message `Finished pruning entities` with details about how +much was deleted or copied and how long that took. Pruning analyzes tables, +if that seems necessary, because its estimates of how much of a table is +likely not needed are based on Postgres statistics. + +### Caveats + +Pruning is a user-visible operation and does affect some of the things that +can be done with a deployment: + +* because it removes history, it restricts how far back time-travel queries + can be performed. This will only be an issue for entities that keep + lifetime statistics about some object (e.g., a token) and are used to + produce time series: after pruning, it is only possible to produce a time + series that goes back no more than `history_blocks`. It is very + beneficial though for entities that keep daily or similar statistics + about some object as it removes data that is not needed once the time + period is over, and does not affect how far back time series based on + these objects can be retrieved. +* it restricts how far back a graft can be performed. Because it removes + history, it becomes impossible to graft more than `history_blocks` before + the current deployment head. diff --git a/graph/src/components/store/mod.rs b/graph/src/components/store/mod.rs index c18352e4315..a1199ac22ae 100644 --- a/graph/src/components/store/mod.rs +++ b/graph/src/components/store/mod.rs @@ -1208,7 +1208,7 @@ pub enum PrunePhase { impl PrunePhase { pub fn strategy(&self) -> PruningStrategy { match self { - PrunePhase::CopyFinal | PrunePhase::CopyNonfinal => PruningStrategy::Copy, + PrunePhase::CopyFinal | PrunePhase::CopyNonfinal => PruningStrategy::Rebuild, PrunePhase::Delete => PruningStrategy::Delete, } } @@ -1247,9 +1247,9 @@ pub trait PruneReporter: Send + 'static { /// Select how pruning should be done #[derive(Clone, Copy, Debug, Display, PartialEq)] pub enum PruningStrategy { - /// Copy the data we want to keep to new tables and swap them out for - /// the existing tables - Copy, + /// Rebuild by copying the data we want to keep to new tables and swap + /// them out for the existing tables + Rebuild, /// Delete unneeded data from the existing tables Delete, } @@ -1270,12 +1270,12 @@ pub struct PruneRequest { pub final_block: BlockNumber, /// The latest block, i.e., the subgraph head pub latest_block: BlockNumber, - /// Use the copy strategy when removing more than this fraction of - /// history. Initialized from `ENV_VARS.store.copy_threshold`, but can - /// be modified after construction - pub copy_threshold: f64, + /// Use the rebuild strategy when removing more than this fraction of + /// history. Initialized from `ENV_VARS.store.rebuild_threshold`, but + /// can be modified after construction + pub rebuild_threshold: f64, /// Use the delete strategy when removing more than this fraction of - /// history but less than `copy_threshold`. Initialized from + /// history but less than `rebuild_threshold`. Initialized from /// `ENV_VARS.store.delete_threshold`, but can be modified after /// construction pub delete_threshold: f64, @@ -1293,11 +1293,11 @@ impl PruneRequest { first_block: BlockNumber, latest_block: BlockNumber, ) -> Result { - let copy_threshold = ENV_VARS.store.copy_threshold; + let rebuild_threshold = ENV_VARS.store.rebuild_threshold; let delete_threshold = ENV_VARS.store.delete_threshold; - if copy_threshold < 0.0 || copy_threshold > 1.0 { + if rebuild_threshold < 0.0 || rebuild_threshold > 1.0 { return Err(constraint_violation!( - "the copy threshold must be between 0 and 1 but is {copy_threshold}" + "the copy threshold must be between 0 and 1 but is {rebuild_threshold}" )); } if delete_threshold < 0.0 || delete_threshold > 1.0 { @@ -1331,19 +1331,20 @@ impl PruneRequest { earliest_block, final_block, latest_block, - copy_threshold, + rebuild_threshold, delete_threshold, }) } /// Determine what strategy to use for pruning /// - /// We are pruning `history_pct` of the blocks from a table that has a ratio - /// of `version_ratio` entities to versions. If we are removing more than - /// `copy_threshold` percent of the versions, we prune by copying, and if we - /// are removing more than `delete_threshold` percent of the versions, we - /// prune by deleting. If we would remove less than `delete_threshold` - /// percent of the versions, we don't prune. + /// We are pruning `history_pct` of the blocks from a table that has a + /// ratio of `version_ratio` entities to versions. If we are removing + /// more than `rebuild_threshold` percent of the versions, we prune by + /// rebuilding, and if we are removing more than `delete_threshold` + /// percent of the versions, we prune by deleting. If we would remove + /// less than `delete_threshold` percent of the versions, we don't + /// prune. pub fn strategy(&self, stats: &VersionStats) -> Option { // If the deployment doesn't have enough history to cover the reorg // threshold, do not prune @@ -1356,8 +1357,8 @@ impl PruneRequest { // that `history_pct` will tell us how much of that data pruning // will remove. let removal_ratio = self.history_pct(stats) * (1.0 - stats.ratio); - if removal_ratio >= self.copy_threshold { - Some(PruningStrategy::Copy) + if removal_ratio >= self.rebuild_threshold { + Some(PruningStrategy::Rebuild) } else if removal_ratio >= self.delete_threshold { Some(PruningStrategy::Delete) } else { diff --git a/graph/src/env/store.rs b/graph/src/env/store.rs index f89f394bf17..8492b0e1b49 100644 --- a/graph/src/env/store.rs +++ b/graph/src/env/store.rs @@ -85,11 +85,11 @@ pub struct EnvVarsStore { pub batch_target_duration: Duration, /// Prune tables where we will remove at least this fraction of entity - /// versions by copying. Set by `GRAPH_STORE_HISTORY_COPY_THRESHOLD`. - /// The default is 0.5 - pub copy_threshold: f64, + /// versions by rebuilding the table. Set by + /// `GRAPH_STORE_HISTORY_REBUILD_THRESHOLD`. The default is 0.5 + pub rebuild_threshold: f64, /// Prune tables where we will remove at least this fraction of entity - /// versions, but fewer than `copy_threshold`, by deleting. Set by + /// versions, but fewer than `rebuild_threshold`, by deleting. Set by /// `GRAPH_STORE_HISTORY_DELETE_THRESHOLD`. The default is 0.05 pub delete_threshold: f64, /// How much history a subgraph with limited history can accumulate @@ -134,7 +134,7 @@ impl From for EnvVarsStore { connection_idle_timeout: Duration::from_secs(x.connection_idle_timeout_in_secs), write_queue_size: x.write_queue_size, batch_target_duration: Duration::from_secs(x.batch_target_duration_in_secs), - copy_threshold: x.copy_threshold.0, + rebuild_threshold: x.rebuild_threshold.0, delete_threshold: x.delete_threshold.0, history_slack_factor: x.history_slack_factor.0, } @@ -180,8 +180,8 @@ pub struct InnerStore { write_queue_size: usize, #[envconfig(from = "GRAPH_STORE_BATCH_TARGET_DURATION", default = "180")] batch_target_duration_in_secs: u64, - #[envconfig(from = "GRAPH_STORE_HISTORY_COPY_THRESHOLD", default = "0.5")] - copy_threshold: ZeroToOneF64, + #[envconfig(from = "GRAPH_STORE_HISTORY_REBUILD_THRESHOLD", default = "0.5")] + rebuild_threshold: ZeroToOneF64, #[envconfig(from = "GRAPH_STORE_HISTORY_DELETE_THRESHOLD", default = "0.05")] delete_threshold: ZeroToOneF64, #[envconfig(from = "GRAPH_STORE_HISTORY_SLACK_FACTOR", default = "1.2")] diff --git a/node/src/bin/manager.rs b/node/src/bin/manager.rs index b67afff336a..ba9ea30fe41 100644 --- a/node/src/bin/manager.rs +++ b/node/src/bin/manager.rs @@ -253,12 +253,12 @@ pub enum Command { Prune { /// The deployment to prune (see `help info`) deployment: DeploymentSearch, - /// Prune by copying when removing more than this fraction of - /// history. Defaults to GRAPH_STORE_HISTORY_COPY_THRESHOLD + /// Prune by rebuilding tables when removing more than this fraction + /// of history. Defaults to GRAPH_STORE_HISTORY_REBUILD_THRESHOLD #[clap(long, short)] - copy_threshold: Option, + rebuild_threshold: Option, /// Prune by deleting when removing more than this fraction of - /// history but less than copy_threshold. Defaults to + /// history but less than rebuild_threshold. Defaults to /// GRAPH_STORE_HISTORY_DELETE_THRESHOLD #[clap(long, short)] delete_threshold: Option, @@ -1390,7 +1390,7 @@ async fn main() -> anyhow::Result<()> { Prune { deployment, history, - copy_threshold, + rebuild_threshold, delete_threshold, once, } => { @@ -1400,7 +1400,7 @@ async fn main() -> anyhow::Result<()> { primary_pool, deployment, history, - copy_threshold, + rebuild_threshold, delete_threshold, once, ) diff --git a/node/src/manager/commands/prune.rs b/node/src/manager/commands/prune.rs index 52288dcab09..c169577ee65 100644 --- a/node/src/manager/commands/prune.rs +++ b/node/src/manager/commands/prune.rs @@ -161,7 +161,7 @@ pub async fn run( primary_pool: ConnectionPool, search: DeploymentSearch, history: usize, - copy_threshold: Option, + rebuild_threshold: Option, delete_threshold: Option, once: bool, ) -> Result<(), anyhow::Error> { @@ -198,8 +198,8 @@ pub async fn run( status.earliest_block_number, latest, )?; - if let Some(copy_threshold) = copy_threshold { - req.copy_threshold = copy_threshold; + if let Some(rebuild_threshold) = rebuild_threshold { + req.rebuild_threshold = rebuild_threshold; } if let Some(delete_threshold) = delete_threshold { req.delete_threshold = delete_threshold; diff --git a/store/postgres/src/deployment_store.rs b/store/postgres/src/deployment_store.rs index 32e040f95e4..ab8956c7a75 100644 --- a/store/postgres/src/deployment_store.rs +++ b/store/postgres/src/deployment_store.rs @@ -1290,10 +1290,10 @@ impl DeploymentStore { site: Arc, req: PruneRequest, ) -> Result<(), StoreError> { - let logger = logger.cheap_clone(); - retry::forever_async(&logger, "prune", move || { + let logger2 = logger.cheap_clone(); + retry::forever_async(&logger2, "prune", move || { let store = store.cheap_clone(); - let reporter = OngoingPruneReporter::new(store.logger.cheap_clone()); + let reporter = OngoingPruneReporter::new(logger.cheap_clone()); let site = site.cheap_clone(); async move { store.prune(reporter, site, req).await.map(|_| ()) } }) @@ -1969,7 +1969,7 @@ impl PruneReporter for OngoingPruneReporter { fn prune_batch(&mut self, _table: &str, rows: usize, phase: PrunePhase, _finished: bool) { match phase.strategy() { - PruningStrategy::Copy => self.rows_copied += rows, + PruningStrategy::Rebuild => self.rows_copied += rows, PruningStrategy::Delete => self.rows_deleted += rows, } } diff --git a/store/postgres/src/relational/prune.rs b/store/postgres/src/relational/prune.rs index 80b06b9af93..2a848cc0c2f 100644 --- a/store/postgres/src/relational/prune.rs +++ b/store/postgres/src/relational/prune.rs @@ -345,30 +345,29 @@ impl Layout { /// Remove all data from the underlying deployment that is not needed to /// respond to queries before block `earliest_block`. The `req` is used - /// to determine which strategy should be used for pruning, copy or + /// to determine which strategy should be used for pruning, rebuild or /// delete. /// /// Blocks before `req.final_block` are considered final and it is /// assumed that they will not be modified in any way while pruning is /// running. /// - /// The copy strategy implemented here works well for situations in + /// The rebuild strategy implemented here works well for situations in /// which pruning will remove a large amount of data from the subgraph /// (say, at least 50%) /// - /// The strategy for `prune_by_copying` is to copy all data that is - /// needed to respond to queries at block heights at or after - /// `earliest_block` to a new table and then to replace the existing - /// tables with these new tables atomically in a transaction. Copying - /// happens in two stages that are performed for each table in turn: we - /// first copy data for final blocks without blocking writes, and then - /// copy data for nonfinal blocks. The latter blocks writes by taking a - /// lock on the row for the deployment in `subgraph_deployment` (via - /// `deployment::lock`) The process for switching to the new tables - /// needs to take the naming of various database objects that Postgres - /// creates automatically into account so that they all have the same - /// names as the original objects to ensure that pruning can be done - /// again without risking name clashes. + /// The strategy for rebuilding is to copy all data that is needed to + /// respond to queries at block heights at or after `earliest_block` to + /// a new table and then to replace the existing tables with these new + /// tables atomically in a transaction. Rebuilding happens in two stages + /// that are performed for each table in turn: we first copy data for + /// final blocks without blocking writes, and then copy data for + /// nonfinal blocks. The latter blocks writes by taking an advisory lock + /// on the deployment (via `deployment::lock`) The process for switching + /// to the new tables needs to take the naming of various database + /// objects that Postgres creates automatically into account so that + /// they all have the same names as the original objects to ensure that + /// pruning can be done again without risking name clashes. /// /// The reason this strategy works well when a lot (or even the /// majority) of the data needs to be removed is that in the more @@ -380,8 +379,8 @@ impl Layout { /// tables. But a full vacuum takes an `access exclusive` lock which /// prevents both reads and writes to the table, which means it would /// also block queries to the deployment, often for extended periods of - /// time. The `prune_by_copying` strategy never blocks reads, it only - /// ever blocks writes. + /// time. The rebuild strategy never blocks reads, it only ever blocks + /// writes. pub fn prune( &self, logger: &Logger, @@ -414,7 +413,7 @@ impl Layout { for (table, strat) in &prunable_tables { reporter.start_table(table.name.as_str()); match strat { - PruningStrategy::Copy => { + PruningStrategy::Rebuild => { if recreate_dst_nsp { catalog::recreate_schema(conn, dst_nsp.as_str())?; recreate_dst_nsp = false; diff --git a/store/postgres/tests/graft.rs b/store/postgres/tests/graft.rs index 5fdb48dd03e..c401afeaa2e 100644 --- a/store/postgres/tests/graft.rs +++ b/store/postgres/tests/graft.rs @@ -569,7 +569,7 @@ fn prune() { ); } - for strategy in [PruningStrategy::Copy, PruningStrategy::Delete] { + for strategy in [PruningStrategy::Rebuild, PruningStrategy::Delete] { run_test(move |store, src| async move { store .set_history_blocks(&src, -3, 10) @@ -612,12 +612,12 @@ fn prune() { let mut req = PruneRequest::new(&src, 3, 1, 0, 6)?; // Change the thresholds so that we select the desired strategy match strategy { - PruningStrategy::Copy => { - req.copy_threshold = 0.0; + PruningStrategy::Rebuild => { + req.rebuild_threshold = 0.0; req.delete_threshold = 0.0; } PruningStrategy::Delete => { - req.copy_threshold = 1.0; + req.rebuild_threshold = 1.0; req.delete_threshold = 0.0; } }